feat(sim): Optimize support for downstream simulators and gym. (#36)

* feat(sim): Optimize support for downstream simulators and gym. * feat(sim): Optimize support for downstream simulators and gym. * docs: update docs * update version
2025-08-29 23:39:13 +08:00 · 2025-08-29 23:39:13 +08:00 · 0ca4a4894f
commit 0ca4a4894f
parent e06900fe57
22 changed files with 1303 additions and 474 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,6 +4,11 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0).
 ## [0.1.4] - 2025-08-29
 ### Feature
 - Supports importing the output of the EmbodiedGen Layout Generation module with a single `gym.make` command to create multiple parallel simulation environments and interactive 3D worlds.
 - Supports coacd-based mesh decomposition for better efficiency and accuracy in physical simulations.
 ## [0.1.3] - 2025-08-19
 ### Feature
 - Release `layout-cli` 🏞️ Layout(Interactive 3D Worlds) Generation, generate interactive 3D scene from task description.
--- a/README.md
+++ b/README.md
@ -30,7 +30,7 @@
 ```sh
 git clone https://github.com/HorizonRobotics/EmbodiedGen.git
 cd EmbodiedGen
-git checkout v0.1.3
+git checkout v0.1.4
 git submodule update --init --recursive --progress
 conda create -n embodiedgen python=3.10.13 -y # recommended to use a new env.
 conda activate embodiedgen
@ -67,7 +67,8 @@ You can choose between two backends for the GPT agent:
 [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Image_to_3D_Demo-blue)](https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Image-to-3D) Generate physically plausible 3D asset URDF from single input image, offering high-quality support for digital twin systems.
 (HF space is a simplified demonstration. For the full functionality, please refer to `img3d-cli`.)
-<img src="apps/assets/image_to_3d.jpg" alt="Image to 3D" width="900">
+
 <img src="apps/assets/image_to_3d.jpg" alt="Image to 3D" width="700">
 ### ☁️ Service
 Run the image-to-3D generation service locally.
@ -95,7 +96,7 @@ img3d-cli --image_path apps/assets/example_image/sample_04.jpg apps/assets/examp
 [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Text_to_3D_Demo-blue)](https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Text-to-3D) Create 3D assets from text descriptions for a wide range of geometry and styles. (HF space is a simplified demonstration. For the full functionality, please refer to `text3d-cli`.)
-<img src="apps/assets/text_to_3d.jpg" alt="Text to 3D" width="900">
+<img src="apps/assets/text_to_3d.jpg" alt="Text to 3D" width="700">
 ### ☁️ Service
 Deploy the text-to-3D generation service locally.
@ -133,7 +134,7 @@ ps: models with more permissive licenses found in `embodied_gen/models/image_com
 [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Texture_Gen_Demo-blue)](https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Texture-Gen) Generate visually rich textures for 3D mesh.
-<img src="apps/assets/texture_gen.jpg" alt="Texture Gen" width="900">
+<img src="apps/assets/texture_gen.jpg" alt="Texture Gen" width="700">
 ### ☁️ Service
@ -161,7 +162,7 @@ bash embodied_gen/scripts/texture_gen.sh \
 <h2 id="3d-scene-generation">🌍 3D Scene Generation</h2>
-<img src="apps/assets/scene3d.gif" alt="scene3d" style="width: 640px;">
+<img src="apps/assets/scene3d.gif" alt="scene3d" style="width: 600px;">
 ### ⚡ API
 > Run `bash install.sh extra` to install additional requirements if you need to use `scene3d-cli`.
@ -184,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \
 🚧 *Coming Soon*
-<img src="apps/assets/articulate.gif" alt="articulate" style="width: 480px;">
+<img src="apps/assets/articulate.gif" alt="articulate" style="width: 430px;">
 ---
@ -196,12 +197,12 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \
 <table>
  <tr>
-    <td><img src="apps/assets/layout1.gif" alt="layout1" width="360"/></td>
+    <td><img src="apps/assets/layout1.gif" alt="layout1" width="320"/></td>
-    <td><img src="apps/assets/layout2.gif" alt="layout2" width="360"/></td>
+    <td><img src="apps/assets/layout2.gif" alt="layout2" width="320"/></td>
  </tr>
  <tr>
-    <td><img src="apps/assets/layout3.gif" alt="layout3" width="360"/></td>
+    <td><img src="apps/assets/layout3.gif" alt="layout3" width="320"/></td>
-    <td><img src="apps/assets/layout4.gif" alt="layout4" width="360"/></td>
+    <td><img src="apps/assets/layout4.gif" alt="layout4" width="320"/></td>
  </tr>
 </table>
@ -219,8 +220,8 @@ layout-cli --task_descs "Place the pen in the mug on the desk" "Put the fruit on
 <table>
  <tr>
-    <td><img src="apps/assets/Iscene_demo1.gif" alt="Iscene_demo1" width="300"/></td>
+    <td><img src="apps/assets/Iscene_demo1.gif" alt="Iscene_demo1" width="234"/></td>
-    <td><img src="apps/assets/Iscene_demo2.gif" alt="Iscene_demo2" width="450"/></td>
+    <td><img src="apps/assets/Iscene_demo2.gif" alt="Iscene_demo2" width="350"/></td>
  </tr>
 </table>
@ -233,7 +234,7 @@ CUDA_VISIBLE_DEVICES=0 nohup layout-cli \
 --output_root "outputs/layouts_gens" --insert_robot > layouts_gens.log &
 ```
-Using `compose_layout.py`, you can recompose the layout of the generated interactive 3D scenes. (Support for texture editing and augmentation will be added later.)
+Using `compose_layout.py`, you can recompose the layout of the generated interactive 3D scenes.
 ```sh
 python embodied_gen/scripts/compose_layout.py \
 --layout_path "outputs/layouts_gens/task_0000/layout.json" \
@ -243,13 +244,29 @@ python embodied_gen/scripts/compose_layout.py \
 We provide `sim-cli`, that allows users to easily load generated layouts into an interactive 3D simulation using the SAPIEN engine (will support for more simulators in future updates).
 ```sh
-sim-cli --layout_path "outputs/layouts_gens/task_0000/recompose/layout.json" \
+sim-cli --layout_path "outputs/layouts_gen/task_0000/layout.json" \
--output_dir "outputs/layouts_gens/task_0000/recompose/sapien_render" --robot_name "franka"
+--output_dir "outputs/layouts_gen/task_0000/sapien_render" --insert_robot
 ```
 Example: generate multiple parallel simulation envs with `gym.make` and record sensor and trajectory data.
 <table>
  <tr>
    <td><img src="apps/assets/parallel_sim.gif" alt="parallel_sim1" width="290"/></td>
    <td><img src="apps/assets/parallel_sim2.gif" alt="parallel_sim2" width="290"/></td>
  </tr>
 </table>
 ```sh
 python embodied_gen/scripts/parallel_sim.py \
 --layout_file "outputs/layouts_gen/task_0000/layout.json" \
 --output_dir "outputs/parallel_sim/task_0000" \
 --num_envs 20
 ```
 ### 🖼️ Real-to-Sim Digital Twin
-<img src="apps/assets/real2sim_mujoco.gif" alt="real2sim_mujoco" style="width: 512px;">
+<img src="apps/assets/real2sim_mujoco.gif" alt="real2sim_mujoco" width="400">
 ---
@ -280,7 +297,7 @@ If you use EmbodiedGen in your research or projects, please cite:
 ## 🙌 Acknowledgement
 EmbodiedGen builds upon the following amazing projects and models:
-🌟 [Trellis](https://github.com/microsoft/TRELLIS) | 🌟 [Hunyuan-Delight](https://huggingface.co/tencent/Hunyuan3D-2/tree/main/hunyuan3d-delight-v2-0) | 🌟 [Segment Anything](https://github.com/facebookresearch/segment-anything) | 🌟 [Rembg](https://github.com/danielgatis/rembg) | 🌟 [RMBG-1.4](https://huggingface.co/briaai/RMBG-1.4) | 🌟 [Stable Diffusion x4](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | 🌟 [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) | 🌟 [Kolors](https://github.com/Kwai-Kolors/Kolors) | 🌟 [ChatGLM3](https://github.com/THUDM/ChatGLM3) | 🌟 [Aesthetic Score](http://captions.christoph-schuhmann.de/aesthetic_viz_laion_sac+logos+ava1-l14-linearMSE-en-2.37B.html) | 🌟 [Pano2Room](https://github.com/TrickyGo/Pano2Room) | 🌟 [Diffusion360](https://github.com/ArcherFMY/SD-T2I-360PanoImage) | 🌟 [Kaolin](https://github.com/NVIDIAGameWorks/kaolin) | 🌟 [diffusers](https://github.com/huggingface/diffusers) | 🌟 [gsplat](https://github.com/nerfstudio-project/gsplat) | 🌟 [QWEN-2.5VL](https://github.com/QwenLM/Qwen2.5-VL) | 🌟 [GPT4o](https://platform.openai.com/docs/models/gpt-4o) | 🌟 [SD3.5](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium)
+🌟 [Trellis](https://github.com/microsoft/TRELLIS) | 🌟 [Hunyuan-Delight](https://huggingface.co/tencent/Hunyuan3D-2/tree/main/hunyuan3d-delight-v2-0) | 🌟 [Segment Anything](https://github.com/facebookresearch/segment-anything) | 🌟 [Rembg](https://github.com/danielgatis/rembg) | 🌟 [RMBG-1.4](https://huggingface.co/briaai/RMBG-1.4) | 🌟 [Stable Diffusion x4](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | 🌟 [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) | 🌟 [Kolors](https://github.com/Kwai-Kolors/Kolors) | 🌟 [ChatGLM3](https://github.com/THUDM/ChatGLM3) | 🌟 [Aesthetic Score](http://captions.christoph-schuhmann.de/aesthetic_viz_laion_sac+logos+ava1-l14-linearMSE-en-2.37B.html) | 🌟 [Pano2Room](https://github.com/TrickyGo/Pano2Room) | 🌟 [Diffusion360](https://github.com/ArcherFMY/SD-T2I-360PanoImage) | 🌟 [Kaolin](https://github.com/NVIDIAGameWorks/kaolin) | 🌟 [diffusers](https://github.com/huggingface/diffusers) | 🌟 [gsplat](https://github.com/nerfstudio-project/gsplat) | 🌟 [QWEN-2.5VL](https://github.com/QwenLM/Qwen2.5-VL) | 🌟 [GPT4o](https://platform.openai.com/docs/models/gpt-4o) | 🌟 [SD3.5](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium) | 🌟 [ManiSkill](https://github.com/haosulab/ManiSkill)
 ---
--- a/apps/assets/parallel_sim.gif
+++ b/apps/assets/parallel_sim.gif
--- a/apps/assets/parallel_sim2.gif
+++ b/apps/assets/parallel_sim2.gif
--- a/apps/common.py
+++ b/apps/common.py
@ -547,7 +547,9 @@ def extract_urdf(
    # Convert to URDF and recover attrs by GPT.
    filename = "sample"
-    urdf_convertor = URDFGenerator(GPT_CLIENT, render_view_num=4)
+    urdf_convertor = URDFGenerator(
        GPT_CLIENT, render_view_num=4, decompose_convex=True
    )
    asset_attrs = {
        "version": VERSION,
        "gs_model": f"{urdf_convertor.output_mesh_dir}/{filename}_gs.ply",
--- a/embodied_gen/data/convex_decomposer.py
+++ b/embodied_gen/data/convex_decomposer.py
@ -0,0 +1,161 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import logging
 import multiprocessing as mp
 import os
 import coacd
 import trimesh
 logger = logging.getLogger(__name__)
 __all__ = [
    "decompose_convex_coacd",
    "decompose_convex_mesh",
    "decompose_convex_process",
 ]
 def decompose_convex_coacd(
    filename: str, outfile: str, params: dict, verbose: bool = False
 ) -> None:
    coacd.set_log_level("info" if verbose else "warn")
    mesh = trimesh.load(filename, force="mesh")
    mesh = coacd.Mesh(mesh.vertices, mesh.faces)
    result = coacd.run_coacd(mesh, **params)
    combined = sum([trimesh.Trimesh(*m) for m in result])
    combined.export(outfile)
 def decompose_convex_mesh(
    filename: str,
    outfile: str,
    threshold: float = 0.05,
    max_convex_hull: int = -1,
    preprocess_mode: str = "auto",
    preprocess_resolution: int = 30,
    resolution: int = 2000,
    mcts_nodes: int = 20,
    mcts_iterations: int = 150,
    mcts_max_depth: int = 3,
    pca: bool = False,
    merge: bool = True,
    seed: int = 0,
    verbose: bool = False,
 ) -> str:
    """Decompose a mesh into convex parts using the CoACD algorithm."""
    coacd.set_log_level("info" if verbose else "warn")
    if os.path.exists(outfile):
        logger.warning(f"Output file {outfile} already exists, removing it.")
        os.remove(outfile)
    params = dict(
        threshold=threshold,
        max_convex_hull=max_convex_hull,
        preprocess_mode=preprocess_mode,
        preprocess_resolution=preprocess_resolution,
        resolution=resolution,
        mcts_nodes=mcts_nodes,
        mcts_iterations=mcts_iterations,
        mcts_max_depth=mcts_max_depth,
        pca=pca,
        merge=merge,
        seed=seed,
    )
    try:
        decompose_convex_coacd(filename, outfile, params, verbose)
        if os.path.exists(outfile):
            return outfile
    except Exception as e:
        if verbose:
            print(f"Decompose convex first attempt failed: {e}.")
    if preprocess_mode != "on":
        try:
            params["preprocess_mode"] = "on"
            decompose_convex_coacd(filename, outfile, params, verbose)
            if os.path.exists(outfile):
                return outfile
        except Exception as e:
            if verbose:
                print(
                    f"Decompose convex second attempt with preprocess_mode='on' failed: {e}"
                )
    raise RuntimeError(f"Convex decomposition failed on {filename}")
 def decompose_convex_mp(
    filename: str,
    outfile: str,
    threshold: float = 0.05,
    max_convex_hull: int = -1,
    preprocess_mode: str = "auto",
    preprocess_resolution: int = 30,
    resolution: int = 2000,
    mcts_nodes: int = 20,
    mcts_iterations: int = 150,
    mcts_max_depth: int = 3,
    pca: bool = False,
    merge: bool = True,
    seed: int = 0,
    verbose: bool = False,
 ) -> str:
    """Decompose a mesh into convex parts using the CoACD algorithm in a separate process.
    See https://simulately.wiki/docs/toolkits/ConvexDecomp for details.
    """
    params = dict(
        threshold=threshold,
        max_convex_hull=max_convex_hull,
        preprocess_mode=preprocess_mode,
        preprocess_resolution=preprocess_resolution,
        resolution=resolution,
        mcts_nodes=mcts_nodes,
        mcts_iterations=mcts_iterations,
        mcts_max_depth=mcts_max_depth,
        pca=pca,
        merge=merge,
        seed=seed,
    )
    ctx = mp.get_context("spawn")
    p = ctx.Process(
        target=decompose_convex_coacd,
        args=(filename, outfile, params, verbose),
    )
    p.start()
    p.join()
    if p.exitcode == 0 and os.path.exists(outfile):
        return outfile
    if preprocess_mode != "on":
        params["preprocess_mode"] = "on"
        p = ctx.Process(
            target=decompose_convex_coacd,
            args=(filename, outfile, params, verbose),
        )
        p.start()
        p.join()
        if p.exitcode == 0 and os.path.exists(outfile):
            return outfile
    raise RuntimeError(f"Convex decomposition failed on {filename}")
--- a/embodied_gen/data/mesh_operator.py
+++ b/embodied_gen/data/mesh_operator.py
@ -16,13 +16,17 @@
 import logging
 import multiprocessing as mp
 import os
 from typing import Tuple, Union
 import coacd
 import igraph
 import numpy as np
 import pyvista as pv
 import spaces
 import torch
 import trimesh
 import utils3d
 from pymeshfix import _meshfix
 from tqdm import tqdm
@ -33,7 +37,9 @@ logging.basicConfig(
 logger = logging.getLogger(__name__)
-__all__ = ["MeshFixer"]
+__all__ = [
    "MeshFixer",
 ]
 def _radical_inverse(base, n):
--- a/embodied_gen/envs/pick_embodiedgen.py
+++ b/embodied_gen/envs/pick_embodiedgen.py
@ -0,0 +1,389 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import json
 import os
 from copy import deepcopy
 import numpy as np
 import sapien
 import torch
 import torchvision.transforms as transforms
 from mani_skill.envs.sapien_env import BaseEnv
 from mani_skill.sensors.camera import CameraConfig
 from mani_skill.utils import sapien_utils
 from mani_skill.utils.building import actors
 from mani_skill.utils.registration import register_env
 from mani_skill.utils.structs.actor import Actor
 from mani_skill.utils.structs.pose import Pose
 from mani_skill.utils.structs.types import (
    GPUMemoryConfig,
    SceneConfig,
    SimConfig,
 )
 from mani_skill.utils.visualization.misc import tile_images
 from tqdm import tqdm
 from embodied_gen.models.gs_model import GaussianOperator
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.geometry import bfs_placement, quaternion_multiply
 from embodied_gen.utils.log import logger
 from embodied_gen.utils.process_media import alpha_blend_rgba
 from embodied_gen.utils.simulation import (
    SIM_COORD_ALIGN,
    load_assets_from_layout_file,
 )
 __all__ = ["PickEmbodiedGen"]
@register_env("PickEmbodiedGen-v1", max_episode_steps=100)
 class PickEmbodiedGen(BaseEnv):
    SUPPORTED_ROBOTS = ["panda", "panda_wristcam", "fetch"]
    goal_thresh = 0.0
    def __init__(
        self,
        *args,
        robot_uids: str | list[str] = "panda",
        robot_init_qpos_noise: float = 0.02,
        num_envs: int = 1,
        reconfiguration_freq: int = None,
        **kwargs,
    ):
        self.robot_init_qpos_noise = robot_init_qpos_noise
        if reconfiguration_freq is None:
            if num_envs == 1:
                reconfiguration_freq = 1
            else:
                reconfiguration_freq = 0
        # Init params from kwargs.
        layout_file = kwargs.pop("layout_file", None)
        replace_objs = kwargs.pop("replace_objs", True)
        self.enable_grasp = kwargs.pop("enable_grasp", False)
        self.init_quat = kwargs.pop("init_quat", [0.7071, 0, 0, 0.7071])
        # Add small offset in z-axis to avoid collision.
        self.objs_z_offset = kwargs.pop("objs_z_offset", 0.002)
        self.robot_z_offset = kwargs.pop("robot_z_offset", 0.002)
        self.layouts = self.init_env_layouts(
            layout_file, num_envs, replace_objs
        )
        self.robot_pose = self.compute_robot_init_pose(
            self.layouts, num_envs, self.robot_z_offset
        )
        self.env_actors = dict()
        self.image_transform = transforms.PILToTensor()
        super().__init__(
            *args,
            robot_uids=robot_uids,
            reconfiguration_freq=reconfiguration_freq,
            num_envs=num_envs,
            **kwargs,
        )
        self.bg_images = dict()
        if self.render_mode == "hybrid":
            self.bg_images = self.render_gs3d_images(
                self.layouts, num_envs, self.init_quat
            )
    @staticmethod
    def init_env_layouts(
        layout_file: str, num_envs: int, replace_objs: bool
    ) -> list[LayoutInfo]:
        layout = LayoutInfo.from_dict(json.load(open(layout_file, "r")))
        layouts = []
        for env_idx in range(num_envs):
            if replace_objs and env_idx > 0:
                layout = bfs_placement(deepcopy(layout))
            layouts.append(layout)
        return layouts
    @staticmethod
    def compute_robot_init_pose(
        layouts: list[LayoutInfo], num_envs: int, z_offset: float = 0.0
    ) -> list[list[float]]:
        robot_pose = []
        for env_idx in range(num_envs):
            layout = layouts[env_idx]
            robot_node = layout.relation[Scene3DItemEnum.ROBOT.value]
            x, y, z, qx, qy, qz, qw = layout.position[robot_node]
            robot_pose.append([x, y, z + z_offset, qw, qx, qy, qz])
        return robot_pose
    @property
    def _default_sim_config(self):
        return SimConfig(
            scene_config=SceneConfig(
                solver_position_iterations=30,
                # contact_offset=0.04,
                # rest_offset=0.001,
            ),
            # sim_freq=200,
            control_freq=50,
            gpu_memory_config=GPUMemoryConfig(
                max_rigid_contact_count=2**20, max_rigid_patch_count=2**19
            ),
        )
    @property
    def _default_sensor_configs(self):
        pose = sapien_utils.look_at(eye=[0.3, 0, 0.6], target=[-0.1, 0, 0.1])
        return [
            CameraConfig("base_camera", pose, 128, 128, np.pi / 2, 0.01, 100)
        ]
    @property
    def _default_human_render_camera_configs(self):
        pose = sapien_utils.look_at(
            eye=[0.9, 0.0, 1.1], target=[0.0, 0.0, 0.9]
        )
        return CameraConfig(
            "render_camera", pose, 256, 256, np.deg2rad(75), 0.01, 100
        )
    def _load_agent(self, options: dict):
        super()._load_agent(options, sapien.Pose(p=[-10, 0, 10]))
    def _load_scene(self, options: dict):
        all_objects = []
        logger.info(f"Loading assets and decomposition mesh collisions...")
        for env_idx in range(self.num_envs):
            env_actors = load_assets_from_layout_file(
                self.scene,
                self.layouts[env_idx],
                z_offset=self.objs_z_offset,
                init_quat=self.init_quat,
                env_idx=env_idx,
            )
            self.env_actors[f"env{env_idx}"] = env_actors
            all_objects.extend(env_actors.values())
        self.obj = all_objects[-1]
        for obj in all_objects:
            self.remove_from_state_dict_registry(obj)
        self.all_objects = Actor.merge(all_objects, name="all_objects")
        self.add_to_state_dict_registry(self.all_objects)
        self.goal_site = actors.build_sphere(
            self.scene,
            radius=self.goal_thresh,
            color=[0, 1, 0, 0],
            name="goal_site",
            body_type="kinematic",
            add_collision=False,
            initial_pose=sapien.Pose(),
        )
        self._hidden_objects.append(self.goal_site)
    def _initialize_episode(self, env_idx: torch.Tensor, options: dict):
        with torch.device(self.device):
            b = len(env_idx)
            goal_xyz = torch.zeros((b, 3))
            goal_xyz[:, :2] = torch.rand((b, 2)) * 0.2 - 0.1
            self.goal_site.set_pose(Pose.create_from_pq(goal_xyz))
            qpos = np.array(
                [
                    0.0,
                    np.pi / 8,
                    0,
                    -np.pi * 3 / 8,
                    0,
                    np.pi * 3 / 4,
                    np.pi / 4,
                    0.04,
                    0.04,
                ]
            )
            qpos = (
                np.random.normal(
                    0, self.robot_init_qpos_noise, (self.num_envs, len(qpos))
                )
                + qpos
            )
            qpos[:, -2:] = 0.04
            self.agent.robot.set_root_pose(np.array(self.robot_pose))
            self.agent.reset(qpos)
            self.agent.init_qpos = qpos
            self.agent.controller.controllers["gripper"].reset()
    def render_gs3d_images(
        self, layouts: list[LayoutInfo], num_envs: int, init_quat: list[float]
    ) -> dict[str, np.ndarray]:
        sim_coord_align = (
            torch.tensor(SIM_COORD_ALIGN).to(torch.float32).to(self.device)
        )
        cameras = self.scene.sensors.copy()
        cameras.update(self.scene.human_render_cameras)
        bg_node = layouts[0].relation[Scene3DItemEnum.BACKGROUND.value]
        gs_path = os.path.join(layouts[0].assets[bg_node], "gs_model.ply")
        raw_gs: GaussianOperator = GaussianOperator.load_from_ply(gs_path)
        bg_images = dict()
        for env_idx in tqdm(range(num_envs), desc="Pre-rendering Background"):
            layout = layouts[env_idx]
            x, y, z, qx, qy, qz, qw = layout.position[bg_node]
            qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
            init_pose = torch.tensor([x, y, z, qx, qy, qz, qw])
            gs_model = raw_gs.get_gaussians(instance_pose=init_pose)
            for key in cameras:
                camera = cameras[key]
                Ks = camera.camera.get_intrinsic_matrix()  # (n_env, 3, 3)
                c2w = camera.camera.get_model_matrix()  # (n_env, 4, 4)
                result = gs_model.render(
                    c2w[env_idx] @ sim_coord_align,
                    Ks[env_idx],
                    image_width=camera.config.width,
                    image_height=camera.config.height,
                )
                bg_images[f"{key}-env{env_idx}"] = result.rgb[..., ::-1]
        return bg_images
    def render(self):
        if self.render_mode is None:
            raise RuntimeError("render_mode is not set.")
        if self.render_mode == "human":
            return self.render_human()
        elif self.render_mode == "rgb_array":
            res = self.render_rgb_array()
            return res
        elif self.render_mode == "sensors":
            res = self.render_sensors()
            return res
        elif self.render_mode == "all":
            return self.render_all()
        elif self.render_mode == "hybrid":
            return self.hybrid_render()
        else:
            raise NotImplementedError(
                f"Unsupported render mode {self.render_mode}."
            )
    def render_rgb_array(
        self, camera_name: str = None, return_alpha: bool = False
    ):
        for obj in self._hidden_objects:
            obj.show_visual()
        self.scene.update_render(
            update_sensors=False, update_human_render_cameras=True
        )
        images = []
        render_images = self.scene.get_human_render_camera_images(
            camera_name, return_alpha
        )
        for image in render_images.values():
            images.append(image)
        if len(images) == 0:
            return None
        if len(images) == 1:
            return images[0]
        for obj in self._hidden_objects:
            obj.hide_visual()
        return tile_images(images)
    def render_sensors(self):
        images = []
        sensor_images = self.get_sensor_images()
        for image in sensor_images.values():
            for img in image.values():
                images.append(img)
        return tile_images(images)
    def hybrid_render(self):
        fg_images = self.render_rgb_array(
            return_alpha=True
        )  # (n_env, h, w, 3)
        images = []
        for key in self.bg_images:
            if "render_camera" not in key:
                continue
            env_idx = int(key.split("-env")[-1])
            rgba = alpha_blend_rgba(
                fg_images[env_idx].cpu().numpy(), self.bg_images[key]
            )
            images.append(self.image_transform(rgba))
        images = torch.stack(images, dim=0)
        images = images.permute(0, 2, 3, 1)
        return images[..., :3]
    def evaluate(self):
        obj_to_goal_pos = (
            self.obj.pose.p
        )  # self.goal_site.pose.p - self.obj.pose.p
        is_obj_placed = (
            torch.linalg.norm(obj_to_goal_pos, axis=1) <= self.goal_thresh
        )
        is_grasped = self.agent.is_grasping(self.obj)
        is_robot_static = self.agent.is_static(0.2)
        return dict(
            is_grasped=is_grasped,
            obj_to_goal_pos=obj_to_goal_pos,
            is_obj_placed=is_obj_placed,
            is_robot_static=is_robot_static,
            is_grasping=self.agent.is_grasping(self.obj),
            success=torch.logical_and(is_obj_placed, is_robot_static),
        )
    def _get_obs_extra(self, info: dict):
        return dict()
    def compute_dense_reward(self, obs: any, action: torch.Tensor, info: dict):
        tcp_to_obj_dist = torch.linalg.norm(
            self.obj.pose.p - self.agent.tcp.pose.p, axis=1
        )
        reaching_reward = 1 - torch.tanh(5 * tcp_to_obj_dist)
        reward = reaching_reward
        is_grasped = info["is_grasped"]
        reward += is_grasped
        # obj_to_goal_dist = torch.linalg.norm(
        #     self.goal_site.pose.p - self.obj.pose.p, axis=1
        # )
        obj_to_goal_dist = torch.linalg.norm(
            self.obj.pose.p - self.obj.pose.p, axis=1
        )
        place_reward = 1 - torch.tanh(5 * obj_to_goal_dist)
        reward += place_reward * is_grasped
        reward += info["is_obj_placed"] * is_grasped
        static_reward = 1 - torch.tanh(
            5
            * torch.linalg.norm(self.agent.robot.get_qvel()[..., :-2], axis=1)
        )
        reward += static_reward * info["is_obj_placed"] * is_grasped
        reward[info["success"]] = 6
        return reward
    def compute_normalized_dense_reward(
        self, obs: any, action: torch.Tensor, info: dict
    ):
        return self.compute_dense_reward(obs=obs, action=action, info=info) / 6
--- a/embodied_gen/scripts/gen_layout.py
+++ b/embodied_gen/scripts/gen_layout.py
@ -137,7 +137,7 @@ def entrypoint() -> None:
        sim_cli(
            layout_path=layout_path,
            output_dir=output_root,
-            robot_name="franka" if args.insert_robot else None,
+            insert_robot=args.insert_robot,
        )
        torch.cuda.empty_cache()
--- a/embodied_gen/scripts/imageto3d.py
+++ b/embodied_gen/scripts/imageto3d.py
@ -58,7 +58,7 @@ os.environ["GRADIO_ANALYTICS_ENABLED"] = "false"
 os.environ["SPCONV_ALGO"] = "native"
 random.seed(0)
-logger.info("Loading Models...")
+logger.info("Loading Image3D Models...")
 DELIGHT = DelightingModel()
 IMAGESR_MODEL = ImageRealESRGAN(outscale=4)
 RBG_REMOVER = RembgRemover()
@ -107,6 +107,7 @@ def parse_args():
        type=int,
        default=2,
    )
    parser.add_argument("--disable_decompose_convex", action="store_true")
    args, unknown = parser.parse_known_args()
    return args
@ -251,7 +252,11 @@ def entrypoint(**kwargs):
            mesh_glb_path = os.path.join(output_root, f"{filename}.glb")
            mesh.export(mesh_glb_path)
-            urdf_convertor = URDFGenerator(GPT_CLIENT, render_view_num=4)
+            urdf_convertor = URDFGenerator(
                GPT_CLIENT,
                render_view_num=4,
                decompose_convex=not args.disable_decompose_convex,
            )
            asset_attrs = {
                "version": VERSION,
                "gs_model": f"{urdf_convertor.output_mesh_dir}/{filename}_gs.ply",
--- a/embodied_gen/scripts/parallel_sim.py
+++ b/embodied_gen/scripts/parallel_sim.py
@ -0,0 +1,148 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 from embodied_gen.utils.monkey_patches import monkey_patch_maniskill
 monkey_patch_maniskill()
 import json
 from collections import defaultdict
 from dataclasses import dataclass
 from typing import Literal
 import gymnasium as gym
 import numpy as np
 import torch
 import tyro
 from mani_skill.utils.wrappers import RecordEpisode
 from tqdm import tqdm
 import embodied_gen.envs.pick_embodiedgen
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.log import logger
 from embodied_gen.utils.simulation import FrankaPandaGrasper
@dataclass
 class ParallelSimConfig:
    """CLI parameters for Parallel Sapien simulation."""
    # Environment configuration
    layout_file: str
    """Path to the layout JSON file"""
    output_dir: str
    """Directory to save recorded videos"""
    gym_env_name: str = "PickEmbodiedGen-v1"
    """Name of the Gym environment to use"""
    num_envs: int = 4
    """Number of parallel environments"""
    render_mode: Literal["rgb_array", "hybrid"] = "hybrid"
    """Rendering mode: rgb_array or hybrid"""
    enable_shadow: bool = True
    """Whether to enable shadows in rendering"""
    control_mode: str = "pd_joint_pos"
    """Control mode for the agent"""
    # Recording configuration
    max_steps_per_video: int = 1000
    """Maximum steps to record per video"""
    save_trajectory: bool = False
    """Whether to save trajectory data"""
    # Simulation parameters
    seed: int = 0
    """Random seed for environment reset"""
    warmup_steps: int = 50
    """Number of warmup steps before action computation"""
    reach_target_only: bool = True
    """Whether to only reach target without full action"""
 def entrypoint(**kwargs):
    if kwargs is None or len(kwargs) == 0:
        cfg = tyro.cli(ParallelSimConfig)
    else:
        cfg = ParallelSimConfig(**kwargs)
    env = gym.make(
        cfg.gym_env_name,
        num_envs=cfg.num_envs,
        render_mode=cfg.render_mode,
        enable_shadow=cfg.enable_shadow,
        layout_file=cfg.layout_file,
        control_mode=cfg.control_mode,
    )
    env = RecordEpisode(
        env,
        cfg.output_dir,
        max_steps_per_video=cfg.max_steps_per_video,
        save_trajectory=cfg.save_trajectory,
    )
    env.reset(seed=cfg.seed)
    default_action = env.unwrapped.agent.init_qpos[:, :8]
    for _ in tqdm(range(cfg.warmup_steps), desc="SIM Warmup"):
        # action = env.action_space.sample() # Random action
        obs, reward, terminated, truncated, info = env.step(default_action)
    grasper = FrankaPandaGrasper(
        env.unwrapped.agent,
        env.unwrapped.sim_config.control_freq,
    )
    layout_data = LayoutInfo.from_dict(json.load(open(cfg.layout_file, "r")))
    actions = defaultdict(list)
    # Plan Grasp reach pose for each manipulated object in each env.
    for env_idx in range(env.num_envs):
        actors = env.unwrapped.env_actors[f"env{env_idx}"]
        for node in layout_data.relation[
            Scene3DItemEnum.MANIPULATED_OBJS.value
        ]:
            action = grasper.compute_grasp_action(
                actor=actors[node]._objs[0],
                reach_target_only=True,
                env_idx=env_idx,
            )
            actions[node].append(action)
    # Excute the planned actions for each manipulated object in each env.
    for node in actions:
        max_env_steps = 0
        for env_idx in range(env.num_envs):
            if actions[node][env_idx] is None:
                continue
            max_env_steps = max(max_env_steps, len(actions[node][env_idx]))
        action_tensor = np.ones(
            (max_env_steps, env.num_envs, env.action_space.shape[-1])
        )
        action_tensor *= default_action[None, ...]
        for env_idx in range(env.num_envs):
            action = actions[node][env_idx]
            if action is None:
                continue
            action_tensor[: len(action), env_idx, :] = action
        for step in tqdm(range(max_env_steps), desc=f"Grasping: {node}"):
            action = torch.Tensor(action_tensor[step]).to(env.unwrapped.device)
            env.unwrapped.agent.set_action(action)
            obs, reward, terminated, truncated, info = env.step(action)
    env.close()
    logger.info(f"Results saved in {cfg.output_dir}")
 if __name__ == "__main__":
    entrypoint()
--- a/embodied_gen/scripts/simulate_sapien.py
+++ b/embodied_gen/scripts/simulate_sapien.py
@ -35,6 +35,9 @@ from embodied_gen.utils.simulation import (
    SIM_COORD_ALIGN,
    FrankaPandaGrasper,
    SapienSceneManager,
    load_assets_from_layout_file,
    load_mani_skill_robot,
    render_images,
 )
@ -50,8 +53,8 @@ class SapienSimConfig:
        default_factory=lambda: [0.7071, 0, 0, 0.7071]
    )  # xyzw
    device: str = "cuda"
-    robot_name: str | None = None
+    control_freq: int = 50
-    control_freq: int = 40
+    insert_robot: bool = False
    # Camera settings.
    render_interval: int = 10
    num_cameras: int = 3
@ -91,48 +94,41 @@ def entrypoint(**kwargs):
        layout_data = json.load(f)
        layout_data: LayoutInfo = LayoutInfo.from_dict(layout_data)
-    scene_manager.load_assets_from_layout_file(
+    actors = load_assets_from_layout_file(
        scene_manager.scene,
        layout_data,
        cfg.z_offset,
        cfg.init_quat,
-        cfg.robot_name,
+    )
-        cfg.control_freq,
+    agent = load_mani_skill_robot(
        scene_manager.scene, layout_data, cfg.control_freq
    )
    frames = defaultdict(list)
    image_cnt = 0
    for step in tqdm(range(cfg.sim_step), desc="Simulation"):
        scene_manager.scene.step()
-        if cfg.robot_name is not None:
+        agent.reset(agent.init_qpos)
            scene_manager.robot.reset(scene_manager.robot.init_qpos)
        if step % cfg.render_interval != 0:
            continue
        scene_manager.scene.update_render()
        image_cnt += 1
        for camera in scene_manager.cameras:
            camera.take_picture()
-            images = scene_manager.render_images(
+            images = render_images(camera, cfg.render_keys)
                camera, render_keys=cfg.render_keys
            )
            frames[camera.name].append(images)
-    grasp_frames = defaultdict(dict)
+    actions = dict()
-    if cfg.robot_name is not None:
+    if cfg.insert_robot:
        sim_steps_per_control = cfg.sim_freq // cfg.control_freq
        control_timestep = 1.0 / cfg.control_freq
        grasper = FrankaPandaGrasper(
-            scene_manager.robot,
+            agent,
-            scene_manager,
+            cfg.control_freq,
            sim_steps_per_control,
            control_timestep,
        )
        for node in layout_data.relation[
            Scene3DItemEnum.MANIPULATED_OBJS.value
        ]:
-            grasp_frames[node] = grasper.render_grasp(
+            actions[node] = grasper.compute_grasp_action(
-                scene_manager.actors[node],
+                actor=actors[node], reach_target_only=True
                scene_manager.cameras,
                cfg.render_keys,
            )
    if "Foreground" not in cfg.render_keys:
@ -160,22 +156,36 @@ def entrypoint(**kwargs):
        bg_images[camera.name] = result.rgb[..., ::-1]
    video_frames = []
-    for camera in scene_manager.cameras:
+    for idx, camera in enumerate(scene_manager.cameras):
        # Scene rendering
-        for step in tqdm(range(image_cnt), desc="Rendering"):
+        if idx == 0:
            for step in range(image_cnt):
                rgba = alpha_blend_rgba(
                    frames[camera.name][step]["Foreground"],
                    bg_images[camera.name],
                )
                video_frames.append(np.array(rgba))
        # Grasp rendering
-        for node in grasp_frames:
+        for node in actions:
-            for frame in grasp_frames[node][camera.name]:
+            if actions[node] is None:
                continue
            for action in tqdm(actions[node]):
                grasp_frames = scene_manager.step_action(
                    agent,
                    torch.Tensor(action[None, ...]),
                    scene_manager.cameras,
                    cfg.render_keys,
                    sim_steps_per_control=cfg.sim_freq // cfg.control_freq,
                )
                rgba = alpha_blend_rgba(
-                    frame["Foreground"], bg_images[camera.name]
+                    grasp_frames[camera.name][0]["Foreground"],
                    bg_images[camera.name],
                )
                video_frames.append(np.array(rgba))
            agent.reset(agent.init_qpos)
    os.makedirs(cfg.output_dir, exist_ok=True)
    video_path = f"{cfg.output_dir}/Iscene.mp4"
    imageio.mimsave(video_path, video_frames, fps=30)
--- a/embodied_gen/scripts/textto3d.py
+++ b/embodied_gen/scripts/textto3d.py
@ -42,7 +42,7 @@ from embodied_gen.validators.quality_checkers import (
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 random.seed(0)
-logger.info("Loading Models...")
+logger.info("Loading TEXT2IMG_MODEL...")
 SEMANTIC_CHECKER = SemanticConsistChecker(GPT_CLIENT)
 SEG_CHECKER = ImageSegChecker(GPT_CLIENT)
 TXTGEN_CHECKER = TextGenAlignChecker(GPT_CLIENT)
@ -170,6 +170,7 @@ def text_to_3d(**kwargs) -> dict:
                seed=random.randint(0, 100000) if seed_3d is None else seed_3d,
                n_retry=args.n_asset_retry,
                keep_intermediate=args.keep_intermediate,
                disable_decompose_convex=args.disable_decompose_convex,
            )
            mesh_path = f"{node_save_dir}/result/mesh/{save_node}.obj"
            image_path = render_asset3d(
@ -270,6 +271,7 @@ def parse_args():
        help="Random seed for 3D generation",
    )
    parser.add_argument("--keep_intermediate", action="store_true")
    parser.add_argument("--disable_decompose_convex", action="store_true")
    args, unknown = parser.parse_known_args()
--- a/embodied_gen/utils/monkey_patches.py
+++ b/embodied_gen/utils/monkey_patches.py
@ -18,6 +18,7 @@ import os
 import sys
 import zipfile
 import numpy as np
 import torch
 from huggingface_hub import hf_hub_download
 from omegaconf import OmegaConf
@ -150,3 +151,68 @@ def monkey_patch_pano2room():
        self.inpaint_pipe = pipe
    SDFTInpainter.__init__ = patched_sd_inpaint_init
 def monkey_patch_maniskill():
    from mani_skill.envs.scene import ManiSkillScene
    def get_sensor_images(
        self, obs: dict[str, any]
    ) -> dict[str, dict[str, torch.Tensor]]:
        sensor_data = dict()
        for name, sensor in self.sensors.items():
            sensor_data[name] = sensor.get_images(obs[name])
        return sensor_data
    def get_human_render_camera_images(
        self, camera_name: str = None, return_alpha: bool = False
    ) -> dict[str, torch.Tensor]:
        def get_rgba_tensor(camera, return_alpha):
            color = camera.get_obs(
                rgb=True, depth=False, segmentation=False, position=False
            )["rgb"]
            if return_alpha:
                seg_labels = camera.get_obs(
                    rgb=False, depth=False, segmentation=True, position=False
                )["segmentation"]
                masks = np.where((seg_labels.cpu() > 0), 255, 0).astype(
                    np.uint8
                )
                masks = torch.tensor(masks).to(color.device)
                color = torch.concat([color, masks], dim=-1)
            return color
        image_data = dict()
        if self.gpu_sim_enabled:
            if self.parallel_in_single_scene:
                for name, camera in self.human_render_cameras.items():
                    camera.camera._render_cameras[0].take_picture()
                    rgba = get_rgba_tensor(camera, return_alpha)
                    image_data[name] = rgba
            else:
                for name, camera in self.human_render_cameras.items():
                    if camera_name is not None and name != camera_name:
                        continue
                    assert camera.config.shader_config.shader_pack not in [
                        "rt",
                        "rt-fast",
                        "rt-med",
                    ], "ray tracing shaders do not work with parallel rendering"
                    camera.capture()
                    rgba = get_rgba_tensor(camera, return_alpha)
                    image_data[name] = rgba
        else:
            for name, camera in self.human_render_cameras.items():
                if camera_name is not None and name != camera_name:
                    continue
                camera.capture()
                rgba = get_rgba_tensor(camera, return_alpha)
                image_data[name] = rgba
        return image_data
    ManiSkillScene.get_sensor_images = get_sensor_images
    ManiSkillScene.get_human_render_camera_images = (
        get_human_render_camera_images
    )
--- a/embodied_gen/utils/simulation.py
+++ b/embodied_gen/utils/simulation.py
@ -14,7 +14,7 @@
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
-
+import json
 import logging
 import os
 import xml.etree.ElementTree as ET
@ -24,19 +24,20 @@ from typing import Literal
 import mplib
 import numpy as np
 import sapien.core as sapien
 import sapien.physx as physx
 import torch
-from PIL import Image, ImageColor
+from mani_skill.agents.base_agent import BaseAgent
-from scipy.spatial.transform import Rotation as R
+from mani_skill.envs.scene import ManiSkillScene
 from tqdm import tqdm
 from embodied_gen.data.utils import DiffrastRender
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.geometry import quaternion_multiply
 logger = logging.getLogger(__name__)
 from mani_skill.examples.motionplanning.panda.utils import (
    compute_grasp_info_by_obb,
 )
 from mani_skill.utils.geometry.trimesh_utils import get_component_mesh
 from PIL import Image, ImageColor
 from scipy.spatial.transform import Rotation as R
 from embodied_gen.data.utils import DiffrastRender
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.geometry import quaternion_multiply
 from embodied_gen.utils.log import logger
 COLORMAP = list(set(ImageColor.colormap.values()))
 COLOR_PALETTE = np.array(
@ -54,10 +55,267 @@ SIM_COORD_ALIGN = np.array(
 __all__ = [
    "SIM_COORD_ALIGN",
    "FrankaPandaGrasper",
-    "SimpleGrasper",
+    "load_assets_from_layout_file",
    "load_mani_skill_robot",
    "render_images",
 ]
 def load_actor_from_urdf(
    scene: ManiSkillScene | sapien.Scene,
    file_path: str,
    pose: sapien.Pose,
    env_idx: int = None,
    use_static: bool = False,
    update_mass: bool = False,
 ) -> sapien.pysapien.Entity:
    tree = ET.parse(file_path)
    root = tree.getroot()
    node_name = root.get("name")
    file_dir = os.path.dirname(file_path)
    visual_mesh = root.find('.//visual/geometry/mesh')
    visual_file = visual_mesh.get("filename")
    visual_scale = visual_mesh.get("scale", "1.0 1.0 1.0")
    visual_scale = np.array([float(x) for x in visual_scale.split()])
    collision_mesh = root.find('.//collision/geometry/mesh')
    collision_file = collision_mesh.get("filename")
    collision_scale = collision_mesh.get("scale", "1.0 1.0 1.0")
    collision_scale = np.array([float(x) for x in collision_scale.split()])
    visual_file = os.path.join(file_dir, visual_file)
    collision_file = os.path.join(file_dir, collision_file)
    static_fric = root.find('.//collision/gazebo/mu1').text
    dynamic_fric = root.find('.//collision/gazebo/mu2').text
    material = physx.PhysxMaterial(
        static_friction=np.clip(float(static_fric), 0.1, 0.7),
        dynamic_friction=np.clip(float(dynamic_fric), 0.1, 0.6),
        restitution=0.05,
    )
    builder = scene.create_actor_builder()
    body_type = "static" if use_static else "dynamic"
    builder.set_physx_body_type(body_type)
    builder.add_multiple_convex_collisions_from_file(
        collision_file if body_type == "dynamic" else visual_file,
        material=material,
        scale=collision_scale if body_type == "dynamic" else visual_scale,
        # decomposition="coacd",
        # decomposition_params=dict(
        #     threshold=0.05, max_convex_hull=64, verbose=False
        # ),
    )
    builder.add_visual_from_file(visual_file, scale=visual_scale)
    builder.set_initial_pose(pose)
    if isinstance(scene, ManiSkillScene) and env_idx is not None:
        builder.set_scene_idxs([env_idx])
    actor = builder.build(name=f"{node_name}-{env_idx}")
    if update_mass and hasattr(actor.components[1], "mass"):
        node_mass = float(root.find('.//inertial/mass').get("value"))
        actor.components[1].set_mass(node_mass)
    return actor
 def load_assets_from_layout_file(
    scene: ManiSkillScene | sapien.Scene,
    layout: LayoutInfo | str,
    z_offset: float = 0.0,
    init_quat: list[float] = [0, 0, 0, 1],
    env_idx: int = None,
 ) -> dict[str, sapien.pysapien.Entity]:
    """Load assets from `EmbodiedGen` layout-gen output and create actors in the scene.
    Args:
        scene (sapien.Scene | ManiSkillScene): The SAPIEN or ManiSkill scene to load assets into.
        layout (LayoutInfo): The layout information data.
        z_offset (float): Offset to apply to the Z-coordinate of non-context objects.
        init_quat (List[float]): Initial quaternion (x, y, z, w) for orientation adjustment.
        env_idx (int): Environment index for multi-environment setup.
    """
    if isinstance(layout, str) and layout.endswith(".json"):
        layout = LayoutInfo.from_dict(json.load(open(layout, "r")))
    actors = dict()
    for node in layout.assets:
        file_dir = layout.assets[node]
        file_name = f"{node.replace(' ', '_')}.urdf"
        urdf_file = os.path.join(file_dir, file_name)
        if layout.objs_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
            continue
        position = layout.position[node].copy()
        if layout.objs_mapping[node] != Scene3DItemEnum.CONTEXT.value:
            position[2] += z_offset
        use_static = (
            layout.relation.get(Scene3DItemEnum.CONTEXT.value, None) == node
        )
        # Combine initial quaternion with object quaternion
        x, y, z, qx, qy, qz, qw = position
        qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
        actor = load_actor_from_urdf(
            scene,
            urdf_file,
            sapien.Pose(p=[x, y, z], q=[qw, qx, qy, qz]),
            env_idx,
            use_static=use_static,
            update_mass=False,
        )
        actors[node] = actor
    return actors
 def load_mani_skill_robot(
    scene: sapien.Scene | ManiSkillScene,
    layout: LayoutInfo | str,
    control_freq: int = 20,
    robot_init_qpos_noise: float = 0.0,
    control_mode: str = "pd_joint_pos",
    backend_str: tuple[str, str] = ("cpu", "gpu"),
 ) -> BaseAgent:
    from mani_skill.agents import REGISTERED_AGENTS
    from mani_skill.envs.scene import ManiSkillScene
    from mani_skill.envs.utils.system.backend import (
        parse_sim_and_render_backend,
    )
    if isinstance(layout, str) and layout.endswith(".json"):
        layout = LayoutInfo.from_dict(json.load(open(layout, "r")))
    robot_name = layout.relation[Scene3DItemEnum.ROBOT.value]
    x, y, z, qx, qy, qz, qw = layout.position[robot_name]
    delta_z = 0.002  # Add small offset to avoid collision.
    pose = sapien.Pose([x, y, z + delta_z], [qw, qx, qy, qz])
    if robot_name not in REGISTERED_AGENTS:
        logger.warning(
            f"Robot `{robot_name}` not registered, chosen from {REGISTERED_AGENTS.keys()}, use `panda` instead."
        )
        robot_name = "panda"
    ROBOT_CLS = REGISTERED_AGENTS[robot_name].agent_cls
    backend = parse_sim_and_render_backend(*backend_str)
    if isinstance(scene, sapien.Scene):
        scene = ManiSkillScene([scene], device=backend_str[0], backend=backend)
    robot = ROBOT_CLS(
        scene=scene,
        control_freq=control_freq,
        control_mode=control_mode,
        initial_pose=pose,
    )
    # Set robot init joint rad agree(joint0 to joint6 w 2 finger).
    qpos = np.array(
        [
            0.0,
            np.pi / 8,
            0,
            -np.pi * 3 / 8,
            0,
            np.pi * 3 / 4,
            np.pi / 4,
            0.04,
            0.04,
        ]
    )
    qpos = (
        np.random.normal(
            0, robot_init_qpos_noise, (len(scene.sub_scenes), len(qpos))
        )
        + qpos
    )
    qpos[:, -2:] = 0.04
    robot.reset(qpos)
    robot.init_qpos = robot.robot.qpos
    robot.controller.controllers["gripper"].reset()
    return robot
 def render_images(
    camera: sapien.render.RenderCameraComponent,
    render_keys: list[
        Literal[
            "Color",
            "Segmentation",
            "Normal",
            "Mask",
            "Depth",
            "Foreground",
        ]
    ] = None,
 ) -> dict[str, Image.Image]:
    """Render images from a given sapien camera.
    Args:
        camera (sapien.render.RenderCameraComponent): The camera to render from.
        render_keys (List[str]): Types of images to render (e.g., Color, Segmentation).
    Returns:
        Dict[str, Image.Image]: Dictionary of rendered images.
    """
    if render_keys is None:
        render_keys = [
            "Color",
            "Segmentation",
            "Normal",
            "Mask",
            "Depth",
            "Foreground",
        ]
    results: dict[str, Image.Image] = {}
    if "Color" in render_keys:
        color = camera.get_picture("Color")
        color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
        results["Color"] = Image.fromarray(color_rgb)
    if "Mask" in render_keys:
        alpha = (np.clip(color[..., 3], 0, 1) * 255).astype(np.uint8)
        results["Mask"] = Image.fromarray(alpha)
    if "Segmentation" in render_keys:
        seg_labels = camera.get_picture("Segmentation")
        label0 = seg_labels[..., 0].astype(np.uint8)
        seg_color = COLOR_PALETTE[label0]
        results["Segmentation"] = Image.fromarray(seg_color)
    if "Foreground" in render_keys:
        seg_labels = camera.get_picture("Segmentation")
        label0 = seg_labels[..., 0]
        mask = np.where((label0 > 1), 255, 0).astype(np.uint8)
        color = camera.get_picture("Color")
        color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
        foreground = np.concatenate([color_rgb, mask[..., None]], axis=-1)
        results["Foreground"] = Image.fromarray(foreground)
    if "Normal" in render_keys:
        normal = camera.get_picture("Normal")[..., :3]
        normal_img = (((normal + 1) / 2) * 255).astype(np.uint8)
        results["Normal"] = Image.fromarray(normal_img)
    if "Depth" in render_keys:
        position_map = camera.get_picture("Position")
        depth = -position_map[..., 2]
        alpha = torch.tensor(color[..., 3], dtype=torch.float32)
        norm_depth = DiffrastRender.normalize_map_by_mask(
            torch.tensor(depth), alpha
        )
        depth_img = (norm_depth * 255).to(torch.uint8).numpy()
        results["Depth"] = Image.fromarray(depth_img)
    return results
 class SapienSceneManager:
    """A class to manage SAPIEN simulator."""
@ -81,11 +339,6 @@ class SapienSceneManager:
            sapien.render.set_ray_tracing_path_depth(10)
            sapien.render.set_ray_tracing_denoiser("oidn")
        # TODO: Support GPU.
        # import sapien.physx as physx
        # physx.enable_gpu()
        # scene = sapien.Scene(systems=[sapien.physx.PhysxGpuSystem(), sapien.render.RenderSystem()])
        scene = sapien.Scene()
        scene.set_timestep(1 / self.sim_freq)
@ -114,12 +367,13 @@ class SapienSceneManager:
    def step_action(
        self,
        agent: BaseAgent,
        action: torch.Tensor,
        sim_steps_per_control: int,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
        sim_steps_per_control: int = 1,
    ) -> dict:
-        self.robot.set_action(action)
+        agent.set_action(action)
        frames = defaultdict(list)
        for _ in range(sim_steps_per_control):
            self.scene.step()
@ -127,125 +381,11 @@ class SapienSceneManager:
        self.scene.update_render()
        for camera in cameras:
            camera.take_picture()
-            images = self.render_images(camera, render_keys=render_keys)
+            images = render_images(camera, render_keys=render_keys)
            frames[camera.name].append(images)
        return frames
    def load_exists_robot(
        self,
        robot_name: str,
        pose: sapien.Pose,
        control_freq: int = 20,
        robot_init_qpos_noise: float = 0.02,
        control_mode: str = "pd_joint_pos",
        backend_str: tuple[str, str] = ("cpu", "gpu"),
    ):
        from mani_skill.agents import REGISTERED_AGENTS
        from mani_skill.envs.scene import ManiSkillScene
        from mani_skill.envs.utils.system.backend import (
            parse_sim_and_render_backend,
        )
        if robot_name == "franka":
            robot_name = "panda"  # remapping
        if robot_name not in REGISTERED_AGENTS:
            raise KeyError(
                f"Robot `{robot_name}` not registered, chosen from {REGISTERED_AGENTS.keys()}."
            )
        ROBOT_CLS = REGISTERED_AGENTS[robot_name].agent_cls
        # backend = parse_sim_and_render_backend("physx_cuda", "gpu")
        backend = parse_sim_and_render_backend(*backend_str)
        scene = ManiSkillScene(
            [self.scene], device=backend_str[0], backend=backend
        )
        robot = ROBOT_CLS(
            scene=scene,
            control_freq=control_freq,
            control_mode=control_mode,
            initial_pose=pose,
        )
        # Set robot init joint rad agree(joint0 to joint6 w 2 finger).
        qpos = np.array(
            [
                0.0,
                np.pi / 8,
                0,
                -np.pi * 3 / 8,
                0,
                np.pi * 3 / 4,
                np.pi / 4,
                0.04,
                0.04,
            ]
        )
        qpos = (
            np.random.normal(
                0, robot_init_qpos_noise, (len(scene.sub_scenes), len(qpos))
            )
            + qpos
        )
        qpos[:, -2:] = 0.04
        robot.reset(qpos)
        robot.init_qpos = robot.robot.qpos
        robot.controller.controllers["gripper"].reset()
        return robot
    def load_actor_from_urdf(
        self,
        file_path: str,
        pose: sapien.Pose,
        use_static: bool = False,
        update_mass: bool = False,
    ) -> sapien.pysapien.Entity:
        """Load an actor from a URDF file into the scene.
        Args:
            file_path (str): Path to the URDF file.
            pose (sapien.Pose): Pose of the actor.
            use_static (bool): Whether to create a static actor.
            update_mass (bool): Whether to update the actor's mass from the URDF.
        Returns:
            sapien.pysapien.Entity: The loaded actor.
        """
        tree = ET.parse(file_path)
        root = tree.getroot()
        node_name = root.get("name")
        file_dir = os.path.dirname(file_path)
        visual_file = root.find('.//visual/geometry/mesh').get("filename")
        collision_file = root.find('.//collision/geometry/mesh').get(
            "filename"
        )
        visual_file = os.path.join(file_dir, visual_file)
        collision_file = os.path.join(file_dir, collision_file)
        static_fric = root.find('.//collision/gazebo/mu1').text
        dynamic_fric = root.find('.//collision/gazebo/mu2').text
        material = self.scene.create_physical_material(
            static_friction=np.clip(float(static_fric), 0.1, 0.7),
            dynamic_friction=np.clip(float(dynamic_fric), 0.1, 0.6),
            restitution=0.05,
        )
        builder = self.scene.create_actor_builder()
        body_type = "static" if use_static else "dynamic"
        builder.set_physx_body_type(body_type)
        builder.add_multiple_convex_collisions_from_file(
            collision_file, material=material
        )
        builder.add_visual_from_file(visual_file)
        actor = builder.build(name=node_name)
        actor.set_name(node_name)
        actor.set_pose(pose)
        if update_mass and hasattr(actor.components[1], "mass"):
            node_mass = float(root.find('.//inertial/mass').get("value"))
            actor.components[1].set_mass(node_mass)
        return actor
    def create_camera(
        self,
        cam_name: str,
@ -344,179 +484,32 @@ class SapienSceneManager:
        return self.cameras
    def render_images(
        self,
        camera: sapien.render.RenderCameraComponent,
        render_keys: list[
            Literal[
                "Color",
                "Segmentation",
                "Normal",
                "Mask",
                "Depth",
                "Foreground",
            ]
        ] = None,
    ) -> dict[str, Image.Image]:
        """Render images from a given camera.
        Args:
            camera (sapien.render.RenderCameraComponent): The camera to render from.
            render_keys (List[str]): Types of images to render (e.g., Color, Segmentation).
        Returns:
            Dict[str, Image.Image]: Dictionary of rendered images.
        """
        if render_keys is None:
            render_keys = [
                "Color",
                "Segmentation",
                "Normal",
                "Mask",
                "Depth",
                "Foreground",
            ]
        results: dict[str, Image.Image] = {}
        if "Color" in render_keys:
            color = camera.get_picture("Color")
            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
            results["Color"] = Image.fromarray(color_rgb)
        if "Mask" in render_keys:
            alpha = (np.clip(color[..., 3], 0, 1) * 255).astype(np.uint8)
            results["Mask"] = Image.fromarray(alpha)
        if "Segmentation" in render_keys:
            seg_labels = camera.get_picture("Segmentation")
            label0 = seg_labels[..., 0].astype(np.uint8)
            seg_color = self.COLOR_PALETTE[label0]
            results["Segmentation"] = Image.fromarray(seg_color)
        if "Foreground" in render_keys:
            seg_labels = camera.get_picture("Segmentation")
            label0 = seg_labels[..., 0]
            mask = np.where((label0 > 1), 255, 0).astype(np.uint8)
            color = camera.get_picture("Color")
            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
            foreground = np.concatenate([color_rgb, mask[..., None]], axis=-1)
            results["Foreground"] = Image.fromarray(foreground)
        if "Normal" in render_keys:
            normal = camera.get_picture("Normal")[..., :3]
            normal_img = (((normal + 1) / 2) * 255).astype(np.uint8)
            results["Normal"] = Image.fromarray(normal_img)
        if "Depth" in render_keys:
            position_map = camera.get_picture("Position")
            depth = -position_map[..., 2]
            alpha = torch.tensor(color[..., 3], dtype=torch.float32)
            norm_depth = DiffrastRender.normalize_map_by_mask(
                torch.tensor(depth), alpha
            )
            depth_img = (norm_depth * 255).to(torch.uint8).numpy()
            results["Depth"] = Image.fromarray(depth_img)
        return results
    def load_assets_from_layout_file(
        self,
        layout: LayoutInfo,
        z_offset: float = 0.0,
        init_quat: list[float] = [0, 0, 0, 1],
        robot_name: str = None,
        control_freq: int = 40,
        robot_init_qpos_noise: float = 0,
    ) -> None:
        """Load assets from `EmbodiedGen` layout-gen output and create actors in the scene.
        Args:
            layout (LayoutInfo): The layout information data.
            z_offset (float): Offset to apply to the Z-coordinate of non-context objects.
            init_quat (List[float]): Initial quaternion (x, y, z, w) for orientation adjustment.
            robot_name (str): Name of the robot to load, None not load robot.
            control_freq (int): Control frequency for the robot, default is 40.
            robot_init_qpos_noise (float): Noise to add to the robot's initial joint positions, default is 0.
        """
        for node in layout.assets:
            file_dir = layout.assets[node]
            file_name = f"{node.replace(' ', '_')}.urdf"
            urdf_file = os.path.join(file_dir, file_name)
            if layout.objs_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
                continue
            position = layout.position[node].copy()
            if layout.objs_mapping[node] != Scene3DItemEnum.CONTEXT.value:
                position[2] += z_offset
            use_static = (
                layout.relation.get(Scene3DItemEnum.CONTEXT.value, None)
                == node
            )
            # Combine initial quaternion with object quaternion
            x, y, z, qx, qy, qz, qw = position
            qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
            actor = self.load_actor_from_urdf(
                urdf_file,
                sapien.Pose(p=[x, y, z], q=[qw, qx, qy, qz]),
                use_static=use_static,
                update_mass=False,
            )
            self.actors[node] = actor
        if robot_name is not None:
            robot_node = layout.relation[Scene3DItemEnum.ROBOT.value]
            x, y, z, qx, qy, qz, qw = layout.position[robot_node]
            delta_z = 0.002  # Add small offset to avoid collision.
            pose = sapien.Pose([x, y, z + delta_z], [qw, qx, qy, qz])
            self.robot = self.load_exists_robot(
                robot_name, pose, control_freq, robot_init_qpos_noise
            )
        return
 class FrankaPandaGrasper(object):
    """Grasper planner for Franka Panda robot arm.
    Args:
        robot: See from `mani_skill.agents.robots.panda.panda`.
        scene: The `SapienSceneManager` object.
        sim_steps_per_control: Number of simulation steps per control step.
        control_timestep: Time step for planning.
        joint_vel_limits: Velocity limit for all joints (default: 0.9)
        joint_acc_limits: Acceleration limit for all joints (default: 0.9)
        finger_length: Length of the robot's fingers (default: 0.025)
    """
    def __init__(
        self,
-        robot,
+        agent: BaseAgent,
-        scene: SapienSceneManager,
+        control_freq: float,
-        sim_steps_per_control: int,
+        joint_vel_limits: float = 2.0,
-        control_timestep: float,
+        joint_acc_limits: float = 1.0,
        joint_vel_limits: float = 0.9,
        joint_acc_limits: float = 0.9,
        finger_length: float = 0.025,
    ) -> None:
-        self.agent = robot
+        self.agent = agent
-        self.robot = robot.robot
+        self.robot = agent.robot
-        self.scene = scene
+        self.control_freq = control_freq
-        self.sim_steps_per_control = sim_steps_per_control
+        self.control_timestep = 1 / control_freq
        self.control_timestep = control_timestep
        self.joint_vel_limits = joint_vel_limits
        self.joint_acc_limits = joint_acc_limits
        self.finger_length = finger_length
-        self.planner = self._setup_planner()
+        self.planners = self._setup_planner()
    def _setup_planner(self) -> mplib.Planner:
        planners = []
        for pose in self.robot.pose:
            link_names = [link.get_name() for link in self.robot.get_links()]
            joint_names = [
                joint.get_name() for joint in self.robot.get_active_joints()
            ]
            planner = mplib.Planner(
                urdf=self.agent.urdf_path,
                srdf=self.agent.urdf_path.replace(".urdf", ".srdf"),
@ -526,75 +519,35 @@ class FrankaPandaGrasper(object):
                joint_vel_limits=np.ones(7) * self.joint_vel_limits,
                joint_acc_limits=np.ones(7) * self.joint_acc_limits,
            )
-        planner.set_base_pose(self.robot.pose.raw_pose[0].tolist())
+            planner.set_base_pose(pose.raw_pose[0].tolist())
            planners.append(planner)
-        return planner
+        return planners
    def update_state(
        self,
        result: dict,
        gripper_state: Literal[-1, 1],
        sim_steps_per_control: int,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
    ) -> dict[str, list[Image.Image]]:
        n_step = len(result["position"])
        total_frames = defaultdict(list)
        for i in tqdm(range(n_step), desc="Grasping"):
            qpos = result["position"][min(i, n_step - 1)]
            action = np.hstack([qpos, gripper_state])[None, ...]
            action = torch.from_numpy(action).float()
            frames = self.scene.step_action(
                action, sim_steps_per_control, cameras, render_keys
            )
            for camera in cameras:
                total_frames[camera.name].extend(frames[camera.name])
        return total_frames
    def control_gripper(
        self,
        sim_steps_per_control: int,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
        gripper_state: Literal[-1, 1],
        n_step: int = 10,
-    ) -> dict[str, list[Image.Image]]:
+    ) -> np.ndarray:
        total_frames = defaultdict(list)
        qpos = self.robot.get_qpos()[0, :-2].cpu().numpy()
        actions = []
        for _ in range(n_step):
            action = np.hstack([qpos, gripper_state])[None, ...]
-            action = torch.from_numpy(action).float()
+            actions.append(action)
            frames = self.scene.step_action(
                action, sim_steps_per_control, cameras, render_keys
            )
            for camera in cameras:
                total_frames[camera.name].extend(frames[camera.name])
-        return total_frames
+        return np.concatenate(actions, axis=0)
    def move_to_pose(
        self,
        pose: sapien.Pose,
        sim_steps_per_control: int,
        control_timestep: float,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
        gripper_state: Literal[-1, 1],
        use_point_cloud: bool = False,
        n_max_step: int = 100,
-    ) -> dict[str, list[Image.Image]]:
+        action_key: str = "position",
-        # First try screw motion planning
+        env_idx: int = 0,
-        result = self.planner.plan_screw(
+    ) -> np.ndarray:
-            np.concatenate([pose.p, pose.q]),
+        result = self.planners[env_idx].plan_qpos_to_pose(
            self.robot.get_qpos().cpu().numpy()[0],
            time_step=control_timestep,
            use_point_cloud=use_point_cloud,
        )
        # Fallback to standard planning if screw fails
        if result["status"] != "Success":
            result = self.planner.plan_qpos_to_pose(
            np.concatenate([pose.p, pose.q]),
            self.robot.get_qpos().cpu().numpy()[0],
            time_step=control_timestep,
@ -602,44 +555,42 @@ class FrankaPandaGrasper(object):
        )
        if result["status"] != "Success":
-            return defaultdict(list)
+            result = self.planners[env_idx].plan_screw(
-
+                np.concatenate([pose.p, pose.q]),
-        for key in result:
+                self.robot.get_qpos().cpu().numpy()[0],
-            if key in ["status", "duration"]:
+                time_step=control_timestep,
-                continue
+                use_point_cloud=use_point_cloud,
            sample_ratio = (len(result[key]) // n_max_step) + 1
            result[key] = result[key][::sample_ratio]
        frames = self.update_state(
            result, gripper_state, sim_steps_per_control, cameras, render_keys
            )
-        return frames
+        if result["status"] != "Success":
            return
-    def render_grasp(
+        sample_ratio = (len(result[action_key]) // n_max_step) + 1
        result[action_key] = result[action_key][::sample_ratio]
        n_step = len(result[action_key])
        actions = []
        for i in range(n_step):
            qpos = result[action_key][i]
            action = np.hstack([qpos, gripper_state])[None, ...]
            actions.append(action)
        return np.concatenate(actions, axis=0)
    def compute_grasp_action(
        self,
        actor: sapien.pysapien.Entity,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[
            Literal[
                "Color",
                "Foreground",
                "Segmentation",
                "Normal",
                "Mask",
                "Depth",
            ]
        ],
        reach_target_only: bool = True,
-    ) -> dict[str, list[Image.Image]]:
+        offset: tuple[float, float, float] = [0, 0, -0.05],
        env_idx: int = 0,
    ) -> np.ndarray:
        physx_rigid = actor.components[1]
        mesh = get_component_mesh(physx_rigid, to_world_frame=True)
        obb = mesh.bounding_box_oriented
        approaching = np.array([0, 0, -1])
        tcp_pose = self.agent.tcp.pose[env_idx]
        target_closing = (
-            self.agent.tcp.pose.to_transformation_matrix()[0, :3, 1]
+            tcp_pose.to_transformation_matrix()[0, :3, 1].cpu().numpy()
            .cpu()
            .numpy()
        )
        grasp_info = compute_grasp_info_by_obb(
            obb,
@ -647,60 +598,46 @@ class FrankaPandaGrasper(object):
            target_closing=target_closing,
            depth=self.finger_length,
        )
        closing, center = grasp_info["closing"], grasp_info["center"]
        raw_tcp_pose = self.agent.tcp.pose.sp
        grasp_pose = self.agent.build_grasp_pose(approaching, closing, center)
        reach_pose = grasp_pose * sapien.Pose(p=[0, 0, -0.05])
        grasp_pose = grasp_pose * sapien.Pose(p=[0, 0, 0.01])
-        total_frames = defaultdict(list)
+        closing, center = grasp_info["closing"], grasp_info["center"]
-        frames_list = []
+        raw_tcp_pose = tcp_pose.sp
-        reach_frames = self.move_to_pose(
+        grasp_pose = self.agent.build_grasp_pose(approaching, closing, center)
        reach_pose = grasp_pose * sapien.Pose(p=offset)
        grasp_pose = grasp_pose * sapien.Pose(p=[0, 0, 0.01])
        actions = []
        reach_actions = self.move_to_pose(
            reach_pose,
            self.sim_steps_per_control,
            self.control_timestep,
            cameras,
            render_keys,
            gripper_state=1,
            env_idx=env_idx,
        )
-        frames_list.append(reach_frames)
+        actions.append(reach_actions)
-        if len(reach_frames) == 0:
+
        if reach_actions is None:
            logger.warning(
                f"Failed to reach the grasp pose for node `{actor.name}`, skipping grasping."
            )
-            return total_frames
+            return None
        if not reach_target_only:
-            grasp_frames = self.move_to_pose(
+            grasp_actions = self.move_to_pose(
                grasp_pose,
                self.sim_steps_per_control,
                self.control_timestep,
                cameras,
                render_keys,
                gripper_state=1,
                env_idx=env_idx,
            )
-            frames_list.append(grasp_frames)
+            actions.append(grasp_actions)
-            close_frames = self.control_gripper(
+            close_actions = self.control_gripper(
                self.sim_steps_per_control,
                cameras,
                render_keys,
                gripper_state=-1,
                env_idx=env_idx,
            )
-            frames_list.append(close_frames)
+            actions.append(close_actions)
-            back_frames = self.move_to_pose(
+            back_actions = self.move_to_pose(
                raw_tcp_pose,
                self.sim_steps_per_control,
                self.control_timestep,
                cameras,
                render_keys,
                gripper_state=-1,
                env_idx=env_idx,
            )
-            frames_list.append(back_frames)
+            actions.append(back_actions)
-        for frame_dict in frames_list:
+        return np.concatenate(actions, axis=0)
            for cam_name, frames in frame_dict.items():
                total_frames[cam_name].extend(frames)
        self.agent.reset(self.agent.init_qpos)
        return total_frames
--- a/embodied_gen/utils/tags.py
+++ b/embodied_gen/utils/tags.py
@ -1 +1 @@
-VERSION = "v0.1.3"
+VERSION = "v0.1.4"
--- a/embodied_gen/validators/urdf_convertor.py
+++ b/embodied_gen/validators/urdf_convertor.py
@ -24,6 +24,7 @@ from xml.dom.minidom import parseString
 import numpy as np
 import trimesh
 from embodied_gen.data.convex_decomposer import decompose_convex_mesh
 from embodied_gen.utils.gpt_clients import GPT_CLIENT, GPTclient
 from embodied_gen.utils.process_media import render_asset3d
 from embodied_gen.utils.tags import VERSION
@ -84,6 +85,7 @@ class URDFGenerator(object):
        attrs_name: list[str] = None,
        render_dir: str = "urdf_renders",
        render_view_num: int = 4,
        decompose_convex: bool = False,
    ) -> None:
        if mesh_file_list is None:
            mesh_file_list = []
@ -133,7 +135,8 @@ class URDFGenerator(object):
            For example:
              - A pen standing upright in the first image (aligned with the image's vertical axis)
                full body visible in the first image: → vertical height ≈ 0.14-0.20 m
-              - A pen lying flat in the first image (showing thickness or as a dot) → vertical height ≈ 0.018-0.025 m
+              - A pen lying flat in the first image or either the tip or the tail is facing the image
                (showing thickness or as a circle) → vertical height ≈ 0.018-0.025 m
              - Tilted pen in the first image (e.g., ~45° angle): vertical height ≈ 0.07-0.12 m
            - Use the rest views to help determine the object's 3D pose and orientation.
            Assume the object is in real-world scale and estimate the approximate vertical height
@ -156,6 +159,7 @@ class URDFGenerator(object):
                "gs_model",
            ]
        self.attrs_name = attrs_name
        self.decompose_convex = decompose_convex
    def parse_response(self, response: str) -> dict[str, any]:
        lines = response.split("\n")
@ -258,9 +262,24 @@ class URDFGenerator(object):
        # Update collision geometry
        collision = link.find("collision/geometry/mesh")
        if collision is not None:
-            collision.set(
+            collision_mesh = os.path.join(self.output_mesh_dir, obj_name)
-                "filename", os.path.join(self.output_mesh_dir, obj_name)
+            if self.decompose_convex:
                try:
                    d_params = dict(
                        threshold=0.05, max_convex_hull=64, verbose=False
                    )
                    filename = f"{os.path.splitext(obj_name)[0]}_collision.ply"
                    output_path = os.path.join(mesh_folder, filename)
                    decompose_convex_mesh(
                        mesh_output_path, output_path, **d_params
                    )
                    collision_mesh = f"{self.output_mesh_dir}/{filename}"
                except Exception as e:
                    logger.warning(
                        f"Convex decomposition failed for {output_path}, {e}."
                        "Use original mesh for collision computation."
                    )
            collision.set("filename", collision_mesh)
            collision.set("scale", "1.0 1.0 1.0")
        # Update friction coefficients
--- a/install/install_basic.sh
+++ b/install/install_basic.sh
@ -9,10 +9,10 @@ PIP_INSTALL_PACKAGES=(
    "xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu118"
    "flash-attn==2.7.0.post2 --no-build-isolation"
    "-r requirements.txt --use-deprecated=legacy-resolver"
-    "utils3d@git+https://github.com/EasternJournalist/utils3d.git#egg=9a4eb15"
+    "utils3d@git+https://github.com/EasternJournalist/utils3d.git@9a4eb15"
    "clip@git+https://github.com/openai/CLIP.git"
-    "segment-anything@git+https://github.com/facebookresearch/segment-anything.git#egg=dca509f"
+    "segment-anything@git+https://github.com/facebookresearch/segment-anything.git@dca509f"
-    "nvdiffrast@git+https://github.com/NVlabs/nvdiffrast.git#egg=729261d"
+    "nvdiffrast@git+https://github.com/NVlabs/nvdiffrast.git@729261d"
    "kolors@git+https://github.com/HochCC/Kolors.git"
    "kaolin@git+https://github.com/NVIDIAGameWorks/kaolin.git@v0.16.0"
    "git+https://github.com/nerfstudio-project/gsplat.git@v1.5.3"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -7,7 +7,7 @@ packages = ["embodied_gen"]
 [project]
 name = "embodied_gen"
-version = "v0.1.3"
+version = "v0.1.4"
 readme = "README.md"
 license = "Apache-2.0"
 license-files = ["LICENSE", "NOTICE"]
--- a/requirements.txt
+++ b/requirements.txt
@ -38,5 +38,6 @@ tyro
 pyquaternion
 shapely
 sapien==3.0.0b1
 coacd
 mani_skill==3.0.0b21
 typing_extensions==4.14.1
--- a/tests/test_examples/test_urdf_convertor.py
+++ b/tests/test_examples/test_urdf_convertor.py
@ -39,6 +39,7 @@ def test_urdf_convertor():
        "outputs/test_urdf/notebook/result/mesh/notebook.obj",
        "outputs/test_urdf/marker/result/mesh/marker.obj",
        "outputs/test_urdf/pen2/result/mesh/pen.obj",
        "outputs/test_urdf/pen3/result/mesh/pen.obj",
    ]
    for idx, mesh_path in enumerate(mesh_paths):
        filename = mesh_path.split("/")[-1].split(".")[0]
@ -49,3 +50,19 @@ def test_urdf_convertor():
            # min_height=1.0,
            # max_height=1.2,
        )
 def test_decompose_convex_mesh():
    urdf_gen = URDFGenerator(GPT_CLIENT, decompose_convex=True)
    mesh_paths = [
        "outputs/test_urdf/sample_0/mesh/pen.obj",
        "outputs/test_urdf/sample_1/mesh/notepad.obj",
        "outputs/test_urdf/sample_2/mesh/plate.obj",
    ]
    for idx, mesh_path in enumerate(mesh_paths):
        filename = mesh_path.split("/")[-1].split(".")[0]
        urdf_path = urdf_gen(
            mesh_path=mesh_path,
            output_root=f"outputs/test_urdf3/sample_{idx}",
            category=filename,
        )
--- a/tests/test_unit/test_mesh_process.py
+++ b/tests/test_unit/test_mesh_process.py
@ -0,0 +1,44 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import logging
 import tempfile
 from time import time
 import pytest
 from embodied_gen.data.convex_decomposer import decompose_convex_mesh
 logger = logging.getLogger(__name__)
@pytest.mark.parametrize(
    "input_mesh_path, max_convex_hull",
    [
        ("apps/assets/example_texture/meshes/robot_text.obj", 8),
        # ("apps/assets/example_texture/meshes/robot_text.obj", 32),
        # ("apps/assets/example_texture/meshes/robot_text.obj", 64),
        # ("apps/assets/example_texture/meshes/robot_text.obj", 128),
    ],
 )
 def test_decompose_convex_mesh(input_mesh_path, max_convex_hull):
    d_params = dict(
        threshold=0.05, max_convex_hull=max_convex_hull, verbose=False
    )
    with tempfile.NamedTemporaryFile(suffix='.ply', delete=True) as tmp_file:
        start_time = time()
        decompose_convex_mesh(input_mesh_path, tmp_file.name, **d_params)
        logger.info(f"Finished in {time()-start_time:.2f}s")