feat(sim): Optimize support for downstream simulators and gym. (#36)

* feat(sim): Optimize support for downstream simulators and gym. * feat(sim): Optimize support for downstream simulators and gym. * docs: update docs * update version
2025-08-29 23:39:13 +08:00 · 2025-08-29 23:39:13 +08:00 · 0ca4a4894f
commit 0ca4a4894f
parent e06900fe57
22 changed files with 1303 additions and 474 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,6 +4,11 @@ All notable changes to this project will be documented in this file.

 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0).

+## [0.1.4] - 2025-08-29
+### Feature
+- Supports importing the output of the EmbodiedGen Layout Generation module with a single `gym.make` command to create multiple parallel simulation environments and interactive 3D worlds.
+- Supports coacd-based mesh decomposition for better efficiency and accuracy in physical simulations.
+
 ## [0.1.3] - 2025-08-19
 ### Feature
 - Release `layout-cli` 🏞️ Layout(Interactive 3D Worlds) Generation, generate interactive 3D scene from task description.
--- a/README.md
+++ b/README.md
@ -30,7 +30,7 @@
 ```sh
 git clone https://github.com/HorizonRobotics/EmbodiedGen.git
 cd EmbodiedGen
-git checkout v0.1.3
+git checkout v0.1.4
 git submodule update --init --recursive --progress
 conda create -n embodiedgen python=3.10.13 -y # recommended to use a new env.
 conda activate embodiedgen
@ -67,7 +67,8 @@ You can choose between two backends for the GPT agent:

 [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Image_to_3D_Demo-blue)](https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Image-to-3D) Generate physically plausible 3D asset URDF from single input image, offering high-quality support for digital twin systems.
 (HF space is a simplified demonstration. For the full functionality, please refer to `img3d-cli`.)
-<img src="apps/assets/image_to_3d.jpg" alt="Image to 3D" width="900">
+
+<img src="apps/assets/image_to_3d.jpg" alt="Image to 3D" width="700">

 ### ☁️ Service
 Run the image-to-3D generation service locally.
@ -95,7 +96,7 @@ img3d-cli --image_path apps/assets/example_image/sample_04.jpg apps/assets/examp

 [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Text_to_3D_Demo-blue)](https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Text-to-3D) Create 3D assets from text descriptions for a wide range of geometry and styles. (HF space is a simplified demonstration. For the full functionality, please refer to `text3d-cli`.)

-<img src="apps/assets/text_to_3d.jpg" alt="Text to 3D" width="900">
+<img src="apps/assets/text_to_3d.jpg" alt="Text to 3D" width="700">

 ### ☁️ Service
 Deploy the text-to-3D generation service locally.
@ -133,7 +134,7 @@ ps: models with more permissive licenses found in `embodied_gen/models/image_com

 [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Texture_Gen_Demo-blue)](https://huggingface.co/spaces/HorizonRobotics/EmbodiedGen-Texture-Gen) Generate visually rich textures for 3D mesh.

-<img src="apps/assets/texture_gen.jpg" alt="Texture Gen" width="900">
+<img src="apps/assets/texture_gen.jpg" alt="Texture Gen" width="700">


 ### ☁️ Service
@ -161,7 +162,7 @@ bash embodied_gen/scripts/texture_gen.sh \

 <h2 id="3d-scene-generation">🌍 3D Scene Generation</h2>

-<img src="apps/assets/scene3d.gif" alt="scene3d" style="width: 640px;">
+<img src="apps/assets/scene3d.gif" alt="scene3d" style="width: 600px;">

 ### ⚡ API
 > Run `bash install.sh extra` to install additional requirements if you need to use `scene3d-cli`.
@ -184,7 +185,7 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \

 🚧 *Coming Soon*

-<img src="apps/assets/articulate.gif" alt="articulate" style="width: 480px;">
+<img src="apps/assets/articulate.gif" alt="articulate" style="width: 430px;">


 ---
@ -196,12 +197,12 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \

 <table>
  <tr>
-    <td><img src="apps/assets/layout1.gif" alt="layout1" width="360"/></td>
-    <td><img src="apps/assets/layout2.gif" alt="layout2" width="360"/></td>
+    <td><img src="apps/assets/layout1.gif" alt="layout1" width="320"/></td>
+    <td><img src="apps/assets/layout2.gif" alt="layout2" width="320"/></td>
  </tr>
  <tr>
-    <td><img src="apps/assets/layout3.gif" alt="layout3" width="360"/></td>
-    <td><img src="apps/assets/layout4.gif" alt="layout4" width="360"/></td>
+    <td><img src="apps/assets/layout3.gif" alt="layout3" width="320"/></td>
+    <td><img src="apps/assets/layout4.gif" alt="layout4" width="320"/></td>
  </tr>
 </table>

@ -219,8 +220,8 @@ layout-cli --task_descs "Place the pen in the mug on the desk" "Put the fruit on

 <table>
  <tr>
-    <td><img src="apps/assets/Iscene_demo1.gif" alt="Iscene_demo1" width="300"/></td>
-    <td><img src="apps/assets/Iscene_demo2.gif" alt="Iscene_demo2" width="450"/></td>
+    <td><img src="apps/assets/Iscene_demo1.gif" alt="Iscene_demo1" width="234"/></td>
+    <td><img src="apps/assets/Iscene_demo2.gif" alt="Iscene_demo2" width="350"/></td>
  </tr>
 </table>

@ -233,7 +234,7 @@ CUDA_VISIBLE_DEVICES=0 nohup layout-cli \
 --output_root "outputs/layouts_gens" --insert_robot > layouts_gens.log &
 ```

-Using `compose_layout.py`, you can recompose the layout of the generated interactive 3D scenes. (Support for texture editing and augmentation will be added later.)
+Using `compose_layout.py`, you can recompose the layout of the generated interactive 3D scenes.
 ```sh
 python embodied_gen/scripts/compose_layout.py \
 --layout_path "outputs/layouts_gens/task_0000/layout.json" \
@ -243,13 +244,29 @@ python embodied_gen/scripts/compose_layout.py \
 We provide `sim-cli`, that allows users to easily load generated layouts into an interactive 3D simulation using the SAPIEN engine (will support for more simulators in future updates).

 ```sh
-sim-cli --layout_path "outputs/layouts_gens/task_0000/recompose/layout.json" \
--output_dir "outputs/layouts_gens/task_0000/recompose/sapien_render" --robot_name "franka"
+sim-cli --layout_path "outputs/layouts_gen/task_0000/layout.json" \
+--output_dir "outputs/layouts_gen/task_0000/sapien_render" --insert_robot
+```
+
+Example: generate multiple parallel simulation envs with `gym.make` and record sensor and trajectory data.
+
+<table>
+  <tr>
+    <td><img src="apps/assets/parallel_sim.gif" alt="parallel_sim1" width="290"/></td>
+    <td><img src="apps/assets/parallel_sim2.gif" alt="parallel_sim2" width="290"/></td>
+  </tr>
+</table>
+
+```sh
+python embodied_gen/scripts/parallel_sim.py \
+--layout_file "outputs/layouts_gen/task_0000/layout.json" \
+--output_dir "outputs/parallel_sim/task_0000" \
+--num_envs 20
 ```

 ### 🖼️ Real-to-Sim Digital Twin

-<img src="apps/assets/real2sim_mujoco.gif" alt="real2sim_mujoco" style="width: 512px;">
+<img src="apps/assets/real2sim_mujoco.gif" alt="real2sim_mujoco" width="400">

 ---

@ -280,7 +297,7 @@ If you use EmbodiedGen in your research or projects, please cite:
 ## 🙌 Acknowledgement

 EmbodiedGen builds upon the following amazing projects and models:
-🌟 [Trellis](https://github.com/microsoft/TRELLIS) | 🌟 [Hunyuan-Delight](https://huggingface.co/tencent/Hunyuan3D-2/tree/main/hunyuan3d-delight-v2-0) | 🌟 [Segment Anything](https://github.com/facebookresearch/segment-anything) | 🌟 [Rembg](https://github.com/danielgatis/rembg) | 🌟 [RMBG-1.4](https://huggingface.co/briaai/RMBG-1.4) | 🌟 [Stable Diffusion x4](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | 🌟 [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) | 🌟 [Kolors](https://github.com/Kwai-Kolors/Kolors) | 🌟 [ChatGLM3](https://github.com/THUDM/ChatGLM3) | 🌟 [Aesthetic Score](http://captions.christoph-schuhmann.de/aesthetic_viz_laion_sac+logos+ava1-l14-linearMSE-en-2.37B.html) | 🌟 [Pano2Room](https://github.com/TrickyGo/Pano2Room) | 🌟 [Diffusion360](https://github.com/ArcherFMY/SD-T2I-360PanoImage) | 🌟 [Kaolin](https://github.com/NVIDIAGameWorks/kaolin) | 🌟 [diffusers](https://github.com/huggingface/diffusers) | 🌟 [gsplat](https://github.com/nerfstudio-project/gsplat) | 🌟 [QWEN-2.5VL](https://github.com/QwenLM/Qwen2.5-VL) | 🌟 [GPT4o](https://platform.openai.com/docs/models/gpt-4o) | 🌟 [SD3.5](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium)
+🌟 [Trellis](https://github.com/microsoft/TRELLIS) | 🌟 [Hunyuan-Delight](https://huggingface.co/tencent/Hunyuan3D-2/tree/main/hunyuan3d-delight-v2-0) | 🌟 [Segment Anything](https://github.com/facebookresearch/segment-anything) | 🌟 [Rembg](https://github.com/danielgatis/rembg) | 🌟 [RMBG-1.4](https://huggingface.co/briaai/RMBG-1.4) | 🌟 [Stable Diffusion x4](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) | 🌟 [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) | 🌟 [Kolors](https://github.com/Kwai-Kolors/Kolors) | 🌟 [ChatGLM3](https://github.com/THUDM/ChatGLM3) | 🌟 [Aesthetic Score](http://captions.christoph-schuhmann.de/aesthetic_viz_laion_sac+logos+ava1-l14-linearMSE-en-2.37B.html) | 🌟 [Pano2Room](https://github.com/TrickyGo/Pano2Room) | 🌟 [Diffusion360](https://github.com/ArcherFMY/SD-T2I-360PanoImage) | 🌟 [Kaolin](https://github.com/NVIDIAGameWorks/kaolin) | 🌟 [diffusers](https://github.com/huggingface/diffusers) | 🌟 [gsplat](https://github.com/nerfstudio-project/gsplat) | 🌟 [QWEN-2.5VL](https://github.com/QwenLM/Qwen2.5-VL) | 🌟 [GPT4o](https://platform.openai.com/docs/models/gpt-4o) | 🌟 [SD3.5](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium) | 🌟 [ManiSkill](https://github.com/haosulab/ManiSkill)

 ---

--- a/apps/assets/parallel_sim.gif
+++ b/apps/assets/parallel_sim.gif
--- a/apps/assets/parallel_sim2.gif
+++ b/apps/assets/parallel_sim2.gif
--- a/apps/common.py
+++ b/apps/common.py
@ -547,7 +547,9 @@ def extract_urdf(

    # Convert to URDF and recover attrs by GPT.
    filename = "sample"
-    urdf_convertor = URDFGenerator(GPT_CLIENT, render_view_num=4)
+    urdf_convertor = URDFGenerator(
+        GPT_CLIENT, render_view_num=4, decompose_convex=True
+    )
    asset_attrs = {
        "version": VERSION,
        "gs_model": f"{urdf_convertor.output_mesh_dir}/{filename}_gs.ply",
--- a/embodied_gen/data/convex_decomposer.py
+++ b/embodied_gen/data/convex_decomposer.py
@ -0,0 +1,161 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+import logging
+import multiprocessing as mp
+import os
+
+import coacd
+import trimesh
+
+logger = logging.getLogger(__name__)
+
+__all__ = [
+    "decompose_convex_coacd",
+    "decompose_convex_mesh",
+    "decompose_convex_process",
+]
+
+
+def decompose_convex_coacd(
+    filename: str, outfile: str, params: dict, verbose: bool = False
+) -> None:
+    coacd.set_log_level("info" if verbose else "warn")
+
+    mesh = trimesh.load(filename, force="mesh")
+    mesh = coacd.Mesh(mesh.vertices, mesh.faces)
+
+    result = coacd.run_coacd(mesh, **params)
+    combined = sum([trimesh.Trimesh(*m) for m in result])
+    combined.export(outfile)
+
+
+def decompose_convex_mesh(
+    filename: str,
+    outfile: str,
+    threshold: float = 0.05,
+    max_convex_hull: int = -1,
+    preprocess_mode: str = "auto",
+    preprocess_resolution: int = 30,
+    resolution: int = 2000,
+    mcts_nodes: int = 20,
+    mcts_iterations: int = 150,
+    mcts_max_depth: int = 3,
+    pca: bool = False,
+    merge: bool = True,
+    seed: int = 0,
+    verbose: bool = False,
+) -> str:
+    """Decompose a mesh into convex parts using the CoACD algorithm."""
+    coacd.set_log_level("info" if verbose else "warn")
+
+    if os.path.exists(outfile):
+        logger.warning(f"Output file {outfile} already exists, removing it.")
+        os.remove(outfile)
+
+    params = dict(
+        threshold=threshold,
+        max_convex_hull=max_convex_hull,
+        preprocess_mode=preprocess_mode,
+        preprocess_resolution=preprocess_resolution,
+        resolution=resolution,
+        mcts_nodes=mcts_nodes,
+        mcts_iterations=mcts_iterations,
+        mcts_max_depth=mcts_max_depth,
+        pca=pca,
+        merge=merge,
+        seed=seed,
+    )
+
+    try:
+        decompose_convex_coacd(filename, outfile, params, verbose)
+        if os.path.exists(outfile):
+            return outfile
+    except Exception as e:
+        if verbose:
+            print(f"Decompose convex first attempt failed: {e}.")
+
+    if preprocess_mode != "on":
+        try:
+            params["preprocess_mode"] = "on"
+            decompose_convex_coacd(filename, outfile, params, verbose)
+            if os.path.exists(outfile):
+                return outfile
+        except Exception as e:
+            if verbose:
+                print(
+                    f"Decompose convex second attempt with preprocess_mode='on' failed: {e}"
+                )
+
+    raise RuntimeError(f"Convex decomposition failed on {filename}")
+
+
+def decompose_convex_mp(
+    filename: str,
+    outfile: str,
+    threshold: float = 0.05,
+    max_convex_hull: int = -1,
+    preprocess_mode: str = "auto",
+    preprocess_resolution: int = 30,
+    resolution: int = 2000,
+    mcts_nodes: int = 20,
+    mcts_iterations: int = 150,
+    mcts_max_depth: int = 3,
+    pca: bool = False,
+    merge: bool = True,
+    seed: int = 0,
+    verbose: bool = False,
+) -> str:
+    """Decompose a mesh into convex parts using the CoACD algorithm in a separate process.
+
+    See https://simulately.wiki/docs/toolkits/ConvexDecomp for details.
+    """
+    params = dict(
+        threshold=threshold,
+        max_convex_hull=max_convex_hull,
+        preprocess_mode=preprocess_mode,
+        preprocess_resolution=preprocess_resolution,
+        resolution=resolution,
+        mcts_nodes=mcts_nodes,
+        mcts_iterations=mcts_iterations,
+        mcts_max_depth=mcts_max_depth,
+        pca=pca,
+        merge=merge,
+        seed=seed,
+    )
+
+    ctx = mp.get_context("spawn")
+    p = ctx.Process(
+        target=decompose_convex_coacd,
+        args=(filename, outfile, params, verbose),
+    )
+    p.start()
+    p.join()
+    if p.exitcode == 0 and os.path.exists(outfile):
+        return outfile
+
+    if preprocess_mode != "on":
+        params["preprocess_mode"] = "on"
+        p = ctx.Process(
+            target=decompose_convex_coacd,
+            args=(filename, outfile, params, verbose),
+        )
+        p.start()
+        p.join()
+        if p.exitcode == 0 and os.path.exists(outfile):
+            return outfile
+
+    raise RuntimeError(f"Convex decomposition failed on {filename}")
--- a/embodied_gen/data/mesh_operator.py
+++ b/embodied_gen/data/mesh_operator.py
@ -16,13 +16,17 @@


 import logging
+import multiprocessing as mp
+import os
 from typing import Tuple, Union

+import coacd
 import igraph
 import numpy as np
 import pyvista as pv
 import spaces
 import torch
+import trimesh
 import utils3d
 from pymeshfix import _meshfix
 from tqdm import tqdm
@ -33,7 +37,9 @@ logging.basicConfig(
 logger = logging.getLogger(__name__)


-__all__ = ["MeshFixer"]
+__all__ = [
+    "MeshFixer",
+]


 def _radical_inverse(base, n):
--- a/embodied_gen/envs/pick_embodiedgen.py
+++ b/embodied_gen/envs/pick_embodiedgen.py
@ -0,0 +1,389 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+import json
+import os
+from copy import deepcopy
+
+import numpy as np
+import sapien
+import torch
+import torchvision.transforms as transforms
+from mani_skill.envs.sapien_env import BaseEnv
+from mani_skill.sensors.camera import CameraConfig
+from mani_skill.utils import sapien_utils
+from mani_skill.utils.building import actors
+from mani_skill.utils.registration import register_env
+from mani_skill.utils.structs.actor import Actor
+from mani_skill.utils.structs.pose import Pose
+from mani_skill.utils.structs.types import (
+    GPUMemoryConfig,
+    SceneConfig,
+    SimConfig,
+)
+from mani_skill.utils.visualization.misc import tile_images
+from tqdm import tqdm
+from embodied_gen.models.gs_model import GaussianOperator
+from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
+from embodied_gen.utils.geometry import bfs_placement, quaternion_multiply
+from embodied_gen.utils.log import logger
+from embodied_gen.utils.process_media import alpha_blend_rgba
+from embodied_gen.utils.simulation import (
+    SIM_COORD_ALIGN,
+    load_assets_from_layout_file,
+)
+
+__all__ = ["PickEmbodiedGen"]
+
+
+@register_env("PickEmbodiedGen-v1", max_episode_steps=100)
+class PickEmbodiedGen(BaseEnv):
+    SUPPORTED_ROBOTS = ["panda", "panda_wristcam", "fetch"]
+    goal_thresh = 0.0
+
+    def __init__(
+        self,
+        *args,
+        robot_uids: str | list[str] = "panda",
+        robot_init_qpos_noise: float = 0.02,
+        num_envs: int = 1,
+        reconfiguration_freq: int = None,
+        **kwargs,
+    ):
+        self.robot_init_qpos_noise = robot_init_qpos_noise
+        if reconfiguration_freq is None:
+            if num_envs == 1:
+                reconfiguration_freq = 1
+            else:
+                reconfiguration_freq = 0
+
+        # Init params from kwargs.
+        layout_file = kwargs.pop("layout_file", None)
+        replace_objs = kwargs.pop("replace_objs", True)
+        self.enable_grasp = kwargs.pop("enable_grasp", False)
+        self.init_quat = kwargs.pop("init_quat", [0.7071, 0, 0, 0.7071])
+        # Add small offset in z-axis to avoid collision.
+        self.objs_z_offset = kwargs.pop("objs_z_offset", 0.002)
+        self.robot_z_offset = kwargs.pop("robot_z_offset", 0.002)
+
+        self.layouts = self.init_env_layouts(
+            layout_file, num_envs, replace_objs
+        )
+        self.robot_pose = self.compute_robot_init_pose(
+            self.layouts, num_envs, self.robot_z_offset
+        )
+        self.env_actors = dict()
+        self.image_transform = transforms.PILToTensor()
+
+        super().__init__(
+            *args,
+            robot_uids=robot_uids,
+            reconfiguration_freq=reconfiguration_freq,
+            num_envs=num_envs,
+            **kwargs,
+        )
+
+        self.bg_images = dict()
+        if self.render_mode == "hybrid":
+            self.bg_images = self.render_gs3d_images(
+                self.layouts, num_envs, self.init_quat
+            )
+
+    @staticmethod
+    def init_env_layouts(
+        layout_file: str, num_envs: int, replace_objs: bool
+    ) -> list[LayoutInfo]:
+        layout = LayoutInfo.from_dict(json.load(open(layout_file, "r")))
+        layouts = []
+        for env_idx in range(num_envs):
+            if replace_objs and env_idx > 0:
+                layout = bfs_placement(deepcopy(layout))
+            layouts.append(layout)
+
+        return layouts
+
+    @staticmethod
+    def compute_robot_init_pose(
+        layouts: list[LayoutInfo], num_envs: int, z_offset: float = 0.0
+    ) -> list[list[float]]:
+        robot_pose = []
+        for env_idx in range(num_envs):
+            layout = layouts[env_idx]
+            robot_node = layout.relation[Scene3DItemEnum.ROBOT.value]
+            x, y, z, qx, qy, qz, qw = layout.position[robot_node]
+            robot_pose.append([x, y, z + z_offset, qw, qx, qy, qz])
+
+        return robot_pose
+
+    @property
+    def _default_sim_config(self):
+        return SimConfig(
+            scene_config=SceneConfig(
+                solver_position_iterations=30,
+                # contact_offset=0.04,
+                # rest_offset=0.001,
+            ),
+            # sim_freq=200,
+            control_freq=50,
+            gpu_memory_config=GPUMemoryConfig(
+                max_rigid_contact_count=2**20, max_rigid_patch_count=2**19
+            ),
+        )
+
+    @property
+    def _default_sensor_configs(self):
+        pose = sapien_utils.look_at(eye=[0.3, 0, 0.6], target=[-0.1, 0, 0.1])
+
+        return [
+            CameraConfig("base_camera", pose, 128, 128, np.pi / 2, 0.01, 100)
+        ]
+
+    @property
+    def _default_human_render_camera_configs(self):
+        pose = sapien_utils.look_at(
+            eye=[0.9, 0.0, 1.1], target=[0.0, 0.0, 0.9]
+        )
+
+        return CameraConfig(
+            "render_camera", pose, 256, 256, np.deg2rad(75), 0.01, 100
+        )
+
+    def _load_agent(self, options: dict):
+        super()._load_agent(options, sapien.Pose(p=[-10, 0, 10]))
+
+    def _load_scene(self, options: dict):
+        all_objects = []
+        logger.info(f"Loading assets and decomposition mesh collisions...")
+        for env_idx in range(self.num_envs):
+            env_actors = load_assets_from_layout_file(
+                self.scene,
+                self.layouts[env_idx],
+                z_offset=self.objs_z_offset,
+                init_quat=self.init_quat,
+                env_idx=env_idx,
+            )
+            self.env_actors[f"env{env_idx}"] = env_actors
+            all_objects.extend(env_actors.values())
+
+        self.obj = all_objects[-1]
+        for obj in all_objects:
+            self.remove_from_state_dict_registry(obj)
+
+        self.all_objects = Actor.merge(all_objects, name="all_objects")
+        self.add_to_state_dict_registry(self.all_objects)
+
+        self.goal_site = actors.build_sphere(
+            self.scene,
+            radius=self.goal_thresh,
+            color=[0, 1, 0, 0],
+            name="goal_site",
+            body_type="kinematic",
+            add_collision=False,
+            initial_pose=sapien.Pose(),
+        )
+        self._hidden_objects.append(self.goal_site)
+
+    def _initialize_episode(self, env_idx: torch.Tensor, options: dict):
+        with torch.device(self.device):
+            b = len(env_idx)
+            goal_xyz = torch.zeros((b, 3))
+            goal_xyz[:, :2] = torch.rand((b, 2)) * 0.2 - 0.1
+            self.goal_site.set_pose(Pose.create_from_pq(goal_xyz))
+
+            qpos = np.array(
+                [
+                    0.0,
+                    np.pi / 8,
+                    0,
+                    -np.pi * 3 / 8,
+                    0,
+                    np.pi * 3 / 4,
+                    np.pi / 4,
+                    0.04,
+                    0.04,
+                ]
+            )
+            qpos = (
+                np.random.normal(
+                    0, self.robot_init_qpos_noise, (self.num_envs, len(qpos))
+                )
+                + qpos
+            )
+            qpos[:, -2:] = 0.04
+            self.agent.robot.set_root_pose(np.array(self.robot_pose))
+            self.agent.reset(qpos)
+            self.agent.init_qpos = qpos
+            self.agent.controller.controllers["gripper"].reset()
+
+    def render_gs3d_images(
+        self, layouts: list[LayoutInfo], num_envs: int, init_quat: list[float]
+    ) -> dict[str, np.ndarray]:
+        sim_coord_align = (
+            torch.tensor(SIM_COORD_ALIGN).to(torch.float32).to(self.device)
+        )
+        cameras = self.scene.sensors.copy()
+        cameras.update(self.scene.human_render_cameras)
+
+        bg_node = layouts[0].relation[Scene3DItemEnum.BACKGROUND.value]
+        gs_path = os.path.join(layouts[0].assets[bg_node], "gs_model.ply")
+        raw_gs: GaussianOperator = GaussianOperator.load_from_ply(gs_path)
+        bg_images = dict()
+        for env_idx in tqdm(range(num_envs), desc="Pre-rendering Background"):
+            layout = layouts[env_idx]
+            x, y, z, qx, qy, qz, qw = layout.position[bg_node]
+            qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
+            init_pose = torch.tensor([x, y, z, qx, qy, qz, qw])
+            gs_model = raw_gs.get_gaussians(instance_pose=init_pose)
+            for key in cameras:
+                camera = cameras[key]
+                Ks = camera.camera.get_intrinsic_matrix()  # (n_env, 3, 3)
+                c2w = camera.camera.get_model_matrix()  # (n_env, 4, 4)
+                result = gs_model.render(
+                    c2w[env_idx] @ sim_coord_align,
+                    Ks[env_idx],
+                    image_width=camera.config.width,
+                    image_height=camera.config.height,
+                )
+                bg_images[f"{key}-env{env_idx}"] = result.rgb[..., ::-1]
+
+        return bg_images
+
+    def render(self):
+        if self.render_mode is None:
+            raise RuntimeError("render_mode is not set.")
+        if self.render_mode == "human":
+            return self.render_human()
+        elif self.render_mode == "rgb_array":
+            res = self.render_rgb_array()
+            return res
+        elif self.render_mode == "sensors":
+            res = self.render_sensors()
+            return res
+        elif self.render_mode == "all":
+            return self.render_all()
+        elif self.render_mode == "hybrid":
+            return self.hybrid_render()
+        else:
+            raise NotImplementedError(
+                f"Unsupported render mode {self.render_mode}."
+            )
+
+    def render_rgb_array(
+        self, camera_name: str = None, return_alpha: bool = False
+    ):
+        for obj in self._hidden_objects:
+            obj.show_visual()
+        self.scene.update_render(
+            update_sensors=False, update_human_render_cameras=True
+        )
+        images = []
+        render_images = self.scene.get_human_render_camera_images(
+            camera_name, return_alpha
+        )
+        for image in render_images.values():
+            images.append(image)
+        if len(images) == 0:
+            return None
+        if len(images) == 1:
+            return images[0]
+        for obj in self._hidden_objects:
+            obj.hide_visual()
+        return tile_images(images)
+
+    def render_sensors(self):
+        images = []
+        sensor_images = self.get_sensor_images()
+        for image in sensor_images.values():
+            for img in image.values():
+                images.append(img)
+        return tile_images(images)
+
+    def hybrid_render(self):
+        fg_images = self.render_rgb_array(
+            return_alpha=True
+        )  # (n_env, h, w, 3)
+        images = []
+        for key in self.bg_images:
+            if "render_camera" not in key:
+                continue
+            env_idx = int(key.split("-env")[-1])
+            rgba = alpha_blend_rgba(
+                fg_images[env_idx].cpu().numpy(), self.bg_images[key]
+            )
+            images.append(self.image_transform(rgba))
+
+        images = torch.stack(images, dim=0)
+        images = images.permute(0, 2, 3, 1)
+
+        return images[..., :3]
+
+    def evaluate(self):
+        obj_to_goal_pos = (
+            self.obj.pose.p
+        )  # self.goal_site.pose.p - self.obj.pose.p
+        is_obj_placed = (
+            torch.linalg.norm(obj_to_goal_pos, axis=1) <= self.goal_thresh
+        )
+        is_grasped = self.agent.is_grasping(self.obj)
+        is_robot_static = self.agent.is_static(0.2)
+
+        return dict(
+            is_grasped=is_grasped,
+            obj_to_goal_pos=obj_to_goal_pos,
+            is_obj_placed=is_obj_placed,
+            is_robot_static=is_robot_static,
+            is_grasping=self.agent.is_grasping(self.obj),
+            success=torch.logical_and(is_obj_placed, is_robot_static),
+        )
+
+    def _get_obs_extra(self, info: dict):
+
+        return dict()
+
+    def compute_dense_reward(self, obs: any, action: torch.Tensor, info: dict):
+        tcp_to_obj_dist = torch.linalg.norm(
+            self.obj.pose.p - self.agent.tcp.pose.p, axis=1
+        )
+        reaching_reward = 1 - torch.tanh(5 * tcp_to_obj_dist)
+        reward = reaching_reward
+
+        is_grasped = info["is_grasped"]
+        reward += is_grasped
+
+        # obj_to_goal_dist = torch.linalg.norm(
+        #     self.goal_site.pose.p - self.obj.pose.p, axis=1
+        # )
+        obj_to_goal_dist = torch.linalg.norm(
+            self.obj.pose.p - self.obj.pose.p, axis=1
+        )
+        place_reward = 1 - torch.tanh(5 * obj_to_goal_dist)
+        reward += place_reward * is_grasped
+
+        reward += info["is_obj_placed"] * is_grasped
+
+        static_reward = 1 - torch.tanh(
+            5
+            * torch.linalg.norm(self.agent.robot.get_qvel()[..., :-2], axis=1)
+        )
+        reward += static_reward * info["is_obj_placed"] * is_grasped
+
+        reward[info["success"]] = 6
+        return reward
+
+    def compute_normalized_dense_reward(
+        self, obs: any, action: torch.Tensor, info: dict
+    ):
+        return self.compute_dense_reward(obs=obs, action=action, info=info) / 6
--- a/embodied_gen/scripts/gen_layout.py
+++ b/embodied_gen/scripts/gen_layout.py
@ -137,7 +137,7 @@ def entrypoint() -> None:
        sim_cli(
            layout_path=layout_path,
            output_dir=output_root,
-            robot_name="franka" if args.insert_robot else None,
+            insert_robot=args.insert_robot,
        )

        torch.cuda.empty_cache()
--- a/embodied_gen/scripts/imageto3d.py
+++ b/embodied_gen/scripts/imageto3d.py
@ -58,7 +58,7 @@ os.environ["GRADIO_ANALYTICS_ENABLED"] = "false"
 os.environ["SPCONV_ALGO"] = "native"
 random.seed(0)

-logger.info("Loading Models...")
+logger.info("Loading Image3D Models...")
 DELIGHT = DelightingModel()
 IMAGESR_MODEL = ImageRealESRGAN(outscale=4)
 RBG_REMOVER = RembgRemover()
@ -107,6 +107,7 @@ def parse_args():
        type=int,
        default=2,
    )
+    parser.add_argument("--disable_decompose_convex", action="store_true")
    args, unknown = parser.parse_known_args()

    return args
@ -251,7 +252,11 @@ def entrypoint(**kwargs):
            mesh_glb_path = os.path.join(output_root, f"{filename}.glb")
            mesh.export(mesh_glb_path)

-            urdf_convertor = URDFGenerator(GPT_CLIENT, render_view_num=4)
+            urdf_convertor = URDFGenerator(
+                GPT_CLIENT,
+                render_view_num=4,
+                decompose_convex=not args.disable_decompose_convex,
+            )
            asset_attrs = {
                "version": VERSION,
                "gs_model": f"{urdf_convertor.output_mesh_dir}/{filename}_gs.ply",
--- a/embodied_gen/scripts/parallel_sim.py
+++ b/embodied_gen/scripts/parallel_sim.py
@ -0,0 +1,148 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+
+from embodied_gen.utils.monkey_patches import monkey_patch_maniskill
+
+monkey_patch_maniskill()
+import json
+from collections import defaultdict
+from dataclasses import dataclass
+from typing import Literal
+
+import gymnasium as gym
+import numpy as np
+import torch
+import tyro
+from mani_skill.utils.wrappers import RecordEpisode
+from tqdm import tqdm
+import embodied_gen.envs.pick_embodiedgen
+from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
+from embodied_gen.utils.log import logger
+from embodied_gen.utils.simulation import FrankaPandaGrasper
+
+
+@dataclass
+class ParallelSimConfig:
+    """CLI parameters for Parallel Sapien simulation."""
+
+    # Environment configuration
+    layout_file: str
+    """Path to the layout JSON file"""
+    output_dir: str
+    """Directory to save recorded videos"""
+    gym_env_name: str = "PickEmbodiedGen-v1"
+    """Name of the Gym environment to use"""
+    num_envs: int = 4
+    """Number of parallel environments"""
+    render_mode: Literal["rgb_array", "hybrid"] = "hybrid"
+    """Rendering mode: rgb_array or hybrid"""
+    enable_shadow: bool = True
+    """Whether to enable shadows in rendering"""
+    control_mode: str = "pd_joint_pos"
+    """Control mode for the agent"""
+
+    # Recording configuration
+    max_steps_per_video: int = 1000
+    """Maximum steps to record per video"""
+    save_trajectory: bool = False
+    """Whether to save trajectory data"""
+
+    # Simulation parameters
+    seed: int = 0
+    """Random seed for environment reset"""
+    warmup_steps: int = 50
+    """Number of warmup steps before action computation"""
+    reach_target_only: bool = True
+    """Whether to only reach target without full action"""
+
+
+def entrypoint(**kwargs):
+    if kwargs is None or len(kwargs) == 0:
+        cfg = tyro.cli(ParallelSimConfig)
+    else:
+        cfg = ParallelSimConfig(**kwargs)
+
+    env = gym.make(
+        cfg.gym_env_name,
+        num_envs=cfg.num_envs,
+        render_mode=cfg.render_mode,
+        enable_shadow=cfg.enable_shadow,
+        layout_file=cfg.layout_file,
+        control_mode=cfg.control_mode,
+    )
+    env = RecordEpisode(
+        env,
+        cfg.output_dir,
+        max_steps_per_video=cfg.max_steps_per_video,
+        save_trajectory=cfg.save_trajectory,
+    )
+    env.reset(seed=cfg.seed)
+
+    default_action = env.unwrapped.agent.init_qpos[:, :8]
+    for _ in tqdm(range(cfg.warmup_steps), desc="SIM Warmup"):
+        # action = env.action_space.sample() # Random action
+        obs, reward, terminated, truncated, info = env.step(default_action)
+
+    grasper = FrankaPandaGrasper(
+        env.unwrapped.agent,
+        env.unwrapped.sim_config.control_freq,
+    )
+
+    layout_data = LayoutInfo.from_dict(json.load(open(cfg.layout_file, "r")))
+    actions = defaultdict(list)
+    # Plan Grasp reach pose for each manipulated object in each env.
+    for env_idx in range(env.num_envs):
+        actors = env.unwrapped.env_actors[f"env{env_idx}"]
+        for node in layout_data.relation[
+            Scene3DItemEnum.MANIPULATED_OBJS.value
+        ]:
+            action = grasper.compute_grasp_action(
+                actor=actors[node]._objs[0],
+                reach_target_only=True,
+                env_idx=env_idx,
+            )
+            actions[node].append(action)
+
+    # Excute the planned actions for each manipulated object in each env.
+    for node in actions:
+        max_env_steps = 0
+        for env_idx in range(env.num_envs):
+            if actions[node][env_idx] is None:
+                continue
+            max_env_steps = max(max_env_steps, len(actions[node][env_idx]))
+
+        action_tensor = np.ones(
+            (max_env_steps, env.num_envs, env.action_space.shape[-1])
+        )
+        action_tensor *= default_action[None, ...]
+        for env_idx in range(env.num_envs):
+            action = actions[node][env_idx]
+            if action is None:
+                continue
+            action_tensor[: len(action), env_idx, :] = action
+
+        for step in tqdm(range(max_env_steps), desc=f"Grasping: {node}"):
+            action = torch.Tensor(action_tensor[step]).to(env.unwrapped.device)
+            env.unwrapped.agent.set_action(action)
+            obs, reward, terminated, truncated, info = env.step(action)
+
+    env.close()
+    logger.info(f"Results saved in {cfg.output_dir}")
+
+
+if __name__ == "__main__":
+    entrypoint()
--- a/embodied_gen/scripts/simulate_sapien.py
+++ b/embodied_gen/scripts/simulate_sapien.py
@ -35,6 +35,9 @@ from embodied_gen.utils.simulation import (
    SIM_COORD_ALIGN,
    FrankaPandaGrasper,
    SapienSceneManager,
+    load_assets_from_layout_file,
+    load_mani_skill_robot,
+    render_images,
 )


@ -50,8 +53,8 @@ class SapienSimConfig:
        default_factory=lambda: [0.7071, 0, 0, 0.7071]
    )  # xyzw
    device: str = "cuda"
-    robot_name: str | None = None
-    control_freq: int = 40
+    control_freq: int = 50
+    insert_robot: bool = False
    # Camera settings.
    render_interval: int = 10
    num_cameras: int = 3
@ -91,48 +94,41 @@ def entrypoint(**kwargs):
        layout_data = json.load(f)
        layout_data: LayoutInfo = LayoutInfo.from_dict(layout_data)

-    scene_manager.load_assets_from_layout_file(
+    actors = load_assets_from_layout_file(
+        scene_manager.scene,
        layout_data,
        cfg.z_offset,
        cfg.init_quat,
-        cfg.robot_name,
-        cfg.control_freq,
+    )
+    agent = load_mani_skill_robot(
+        scene_manager.scene, layout_data, cfg.control_freq
    )

    frames = defaultdict(list)
    image_cnt = 0
    for step in tqdm(range(cfg.sim_step), desc="Simulation"):
        scene_manager.scene.step()
-        if cfg.robot_name is not None:
-            scene_manager.robot.reset(scene_manager.robot.init_qpos)
+        agent.reset(agent.init_qpos)
        if step % cfg.render_interval != 0:
            continue
        scene_manager.scene.update_render()
        image_cnt += 1
        for camera in scene_manager.cameras:
            camera.take_picture()
-            images = scene_manager.render_images(
-                camera, render_keys=cfg.render_keys
-            )
+            images = render_images(camera, cfg.render_keys)
            frames[camera.name].append(images)

-    grasp_frames = defaultdict(dict)
-    if cfg.robot_name is not None:
-        sim_steps_per_control = cfg.sim_freq // cfg.control_freq
-        control_timestep = 1.0 / cfg.control_freq
+    actions = dict()
+    if cfg.insert_robot:
        grasper = FrankaPandaGrasper(
-            scene_manager.robot,
-            scene_manager,
-            sim_steps_per_control,
-            control_timestep,
+            agent,
+            cfg.control_freq,
        )
        for node in layout_data.relation[
            Scene3DItemEnum.MANIPULATED_OBJS.value
        ]:
-            grasp_frames[node] = grasper.render_grasp(
-                scene_manager.actors[node],
-                scene_manager.cameras,
-                cfg.render_keys,
+            actions[node] = grasper.compute_grasp_action(
+                actor=actors[node], reach_target_only=True
            )

    if "Foreground" not in cfg.render_keys:
@ -160,22 +156,36 @@ def entrypoint(**kwargs):
        bg_images[camera.name] = result.rgb[..., ::-1]

    video_frames = []
-    for camera in scene_manager.cameras:
+    for idx, camera in enumerate(scene_manager.cameras):
        # Scene rendering
-        for step in tqdm(range(image_cnt), desc="Rendering"):
-            rgba = alpha_blend_rgba(
-                frames[camera.name][step]["Foreground"],
-                bg_images[camera.name],
-            )
-            video_frames.append(np.array(rgba))
-        # Grasp rendering
-        for node in grasp_frames:
-            for frame in grasp_frames[node][camera.name]:
+        if idx == 0:
+            for step in range(image_cnt):
                rgba = alpha_blend_rgba(
-                    frame["Foreground"], bg_images[camera.name]
+                    frames[camera.name][step]["Foreground"],
+                    bg_images[camera.name],
                )
                video_frames.append(np.array(rgba))

+        # Grasp rendering
+        for node in actions:
+            if actions[node] is None:
+                continue
+            for action in tqdm(actions[node]):
+                grasp_frames = scene_manager.step_action(
+                    agent,
+                    torch.Tensor(action[None, ...]),
+                    scene_manager.cameras,
+                    cfg.render_keys,
+                    sim_steps_per_control=cfg.sim_freq // cfg.control_freq,
+                )
+                rgba = alpha_blend_rgba(
+                    grasp_frames[camera.name][0]["Foreground"],
+                    bg_images[camera.name],
+                )
+                video_frames.append(np.array(rgba))
+
+            agent.reset(agent.init_qpos)
+
    os.makedirs(cfg.output_dir, exist_ok=True)
    video_path = f"{cfg.output_dir}/Iscene.mp4"
    imageio.mimsave(video_path, video_frames, fps=30)
--- a/embodied_gen/scripts/textto3d.py
+++ b/embodied_gen/scripts/textto3d.py
@ -42,7 +42,7 @@ from embodied_gen.validators.quality_checkers import (
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 random.seed(0)

-logger.info("Loading Models...")
+logger.info("Loading TEXT2IMG_MODEL...")
 SEMANTIC_CHECKER = SemanticConsistChecker(GPT_CLIENT)
 SEG_CHECKER = ImageSegChecker(GPT_CLIENT)
 TXTGEN_CHECKER = TextGenAlignChecker(GPT_CLIENT)
@ -170,6 +170,7 @@ def text_to_3d(**kwargs) -> dict:
                seed=random.randint(0, 100000) if seed_3d is None else seed_3d,
                n_retry=args.n_asset_retry,
                keep_intermediate=args.keep_intermediate,
+                disable_decompose_convex=args.disable_decompose_convex,
            )
            mesh_path = f"{node_save_dir}/result/mesh/{save_node}.obj"
            image_path = render_asset3d(
@ -270,6 +271,7 @@ def parse_args():
        help="Random seed for 3D generation",
    )
    parser.add_argument("--keep_intermediate", action="store_true")
+    parser.add_argument("--disable_decompose_convex", action="store_true")

    args, unknown = parser.parse_known_args()

--- a/embodied_gen/utils/monkey_patches.py
+++ b/embodied_gen/utils/monkey_patches.py
@ -18,6 +18,7 @@ import os
 import sys
 import zipfile

+import numpy as np
 import torch
 from huggingface_hub import hf_hub_download
 from omegaconf import OmegaConf
@ -150,3 +151,68 @@ def monkey_patch_pano2room():
        self.inpaint_pipe = pipe

    SDFTInpainter.__init__ = patched_sd_inpaint_init
+
+
+def monkey_patch_maniskill():
+    from mani_skill.envs.scene import ManiSkillScene
+
+    def get_sensor_images(
+        self, obs: dict[str, any]
+    ) -> dict[str, dict[str, torch.Tensor]]:
+        sensor_data = dict()
+        for name, sensor in self.sensors.items():
+            sensor_data[name] = sensor.get_images(obs[name])
+        return sensor_data
+
+    def get_human_render_camera_images(
+        self, camera_name: str = None, return_alpha: bool = False
+    ) -> dict[str, torch.Tensor]:
+        def get_rgba_tensor(camera, return_alpha):
+            color = camera.get_obs(
+                rgb=True, depth=False, segmentation=False, position=False
+            )["rgb"]
+            if return_alpha:
+                seg_labels = camera.get_obs(
+                    rgb=False, depth=False, segmentation=True, position=False
+                )["segmentation"]
+                masks = np.where((seg_labels.cpu() > 0), 255, 0).astype(
+                    np.uint8
+                )
+                masks = torch.tensor(masks).to(color.device)
+                color = torch.concat([color, masks], dim=-1)
+
+            return color
+
+        image_data = dict()
+        if self.gpu_sim_enabled:
+            if self.parallel_in_single_scene:
+                for name, camera in self.human_render_cameras.items():
+                    camera.camera._render_cameras[0].take_picture()
+                    rgba = get_rgba_tensor(camera, return_alpha)
+                    image_data[name] = rgba
+            else:
+                for name, camera in self.human_render_cameras.items():
+                    if camera_name is not None and name != camera_name:
+                        continue
+                    assert camera.config.shader_config.shader_pack not in [
+                        "rt",
+                        "rt-fast",
+                        "rt-med",
+                    ], "ray tracing shaders do not work with parallel rendering"
+                    camera.capture()
+                    rgba = get_rgba_tensor(camera, return_alpha)
+                    image_data[name] = rgba
+        else:
+            for name, camera in self.human_render_cameras.items():
+                if camera_name is not None and name != camera_name:
+                    continue
+                camera.capture()
+                rgba = get_rgba_tensor(camera, return_alpha)
+                image_data[name] = rgba
+
+        return image_data
+
+    ManiSkillScene.get_sensor_images = get_sensor_images
+    ManiSkillScene.get_human_render_camera_images = (
+        get_human_render_camera_images
+    )
--- a/embodied_gen/utils/simulation.py
+++ b/embodied_gen/utils/simulation.py
@ -14,7 +14,7 @@
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.

-
+import json
 import logging
 import os
 import xml.etree.ElementTree as ET
@ -24,19 +24,20 @@ from typing import Literal
 import mplib
 import numpy as np
 import sapien.core as sapien
+import sapien.physx as physx
 import torch
-from PIL import Image, ImageColor
-from scipy.spatial.transform import Rotation as R
-from tqdm import tqdm
-from embodied_gen.data.utils import DiffrastRender
-from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
-from embodied_gen.utils.geometry import quaternion_multiply
-
-logger = logging.getLogger(__name__)
+from mani_skill.agents.base_agent import BaseAgent
+from mani_skill.envs.scene import ManiSkillScene
 from mani_skill.examples.motionplanning.panda.utils import (
    compute_grasp_info_by_obb,
 )
 from mani_skill.utils.geometry.trimesh_utils import get_component_mesh
+from PIL import Image, ImageColor
+from scipy.spatial.transform import Rotation as R
+from embodied_gen.data.utils import DiffrastRender
+from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
+from embodied_gen.utils.geometry import quaternion_multiply
+from embodied_gen.utils.log import logger

 COLORMAP = list(set(ImageColor.colormap.values()))
 COLOR_PALETTE = np.array(
@ -54,10 +55,267 @@ SIM_COORD_ALIGN = np.array(
 __all__ = [
    "SIM_COORD_ALIGN",
    "FrankaPandaGrasper",
-    "SimpleGrasper",
+    "load_assets_from_layout_file",
+    "load_mani_skill_robot",
+    "render_images",
 ]


+def load_actor_from_urdf(
+    scene: ManiSkillScene | sapien.Scene,
+    file_path: str,
+    pose: sapien.Pose,
+    env_idx: int = None,
+    use_static: bool = False,
+    update_mass: bool = False,
+) -> sapien.pysapien.Entity:
+    tree = ET.parse(file_path)
+    root = tree.getroot()
+    node_name = root.get("name")
+    file_dir = os.path.dirname(file_path)
+
+    visual_mesh = root.find('.//visual/geometry/mesh')
+    visual_file = visual_mesh.get("filename")
+    visual_scale = visual_mesh.get("scale", "1.0 1.0 1.0")
+    visual_scale = np.array([float(x) for x in visual_scale.split()])
+
+    collision_mesh = root.find('.//collision/geometry/mesh')
+    collision_file = collision_mesh.get("filename")
+    collision_scale = collision_mesh.get("scale", "1.0 1.0 1.0")
+    collision_scale = np.array([float(x) for x in collision_scale.split()])
+
+    visual_file = os.path.join(file_dir, visual_file)
+    collision_file = os.path.join(file_dir, collision_file)
+    static_fric = root.find('.//collision/gazebo/mu1').text
+    dynamic_fric = root.find('.//collision/gazebo/mu2').text
+
+    material = physx.PhysxMaterial(
+        static_friction=np.clip(float(static_fric), 0.1, 0.7),
+        dynamic_friction=np.clip(float(dynamic_fric), 0.1, 0.6),
+        restitution=0.05,
+    )
+    builder = scene.create_actor_builder()
+
+    body_type = "static" if use_static else "dynamic"
+    builder.set_physx_body_type(body_type)
+    builder.add_multiple_convex_collisions_from_file(
+        collision_file if body_type == "dynamic" else visual_file,
+        material=material,
+        scale=collision_scale if body_type == "dynamic" else visual_scale,
+        # decomposition="coacd",
+        # decomposition_params=dict(
+        #     threshold=0.05, max_convex_hull=64, verbose=False
+        # ),
+    )
+
+    builder.add_visual_from_file(visual_file, scale=visual_scale)
+    builder.set_initial_pose(pose)
+    if isinstance(scene, ManiSkillScene) and env_idx is not None:
+        builder.set_scene_idxs([env_idx])
+
+    actor = builder.build(name=f"{node_name}-{env_idx}")
+
+    if update_mass and hasattr(actor.components[1], "mass"):
+        node_mass = float(root.find('.//inertial/mass').get("value"))
+        actor.components[1].set_mass(node_mass)
+
+    return actor
+
+
+def load_assets_from_layout_file(
+    scene: ManiSkillScene | sapien.Scene,
+    layout: LayoutInfo | str,
+    z_offset: float = 0.0,
+    init_quat: list[float] = [0, 0, 0, 1],
+    env_idx: int = None,
+) -> dict[str, sapien.pysapien.Entity]:
+    """Load assets from `EmbodiedGen` layout-gen output and create actors in the scene.
+
+    Args:
+        scene (sapien.Scene | ManiSkillScene): The SAPIEN or ManiSkill scene to load assets into.
+        layout (LayoutInfo): The layout information data.
+        z_offset (float): Offset to apply to the Z-coordinate of non-context objects.
+        init_quat (List[float]): Initial quaternion (x, y, z, w) for orientation adjustment.
+        env_idx (int): Environment index for multi-environment setup.
+    """
+    if isinstance(layout, str) and layout.endswith(".json"):
+        layout = LayoutInfo.from_dict(json.load(open(layout, "r")))
+
+    actors = dict()
+    for node in layout.assets:
+        file_dir = layout.assets[node]
+        file_name = f"{node.replace(' ', '_')}.urdf"
+        urdf_file = os.path.join(file_dir, file_name)
+
+        if layout.objs_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
+            continue
+
+        position = layout.position[node].copy()
+        if layout.objs_mapping[node] != Scene3DItemEnum.CONTEXT.value:
+            position[2] += z_offset
+
+        use_static = (
+            layout.relation.get(Scene3DItemEnum.CONTEXT.value, None) == node
+        )
+
+        # Combine initial quaternion with object quaternion
+        x, y, z, qx, qy, qz, qw = position
+        qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
+        actor = load_actor_from_urdf(
+            scene,
+            urdf_file,
+            sapien.Pose(p=[x, y, z], q=[qw, qx, qy, qz]),
+            env_idx,
+            use_static=use_static,
+            update_mass=False,
+        )
+        actors[node] = actor
+
+    return actors
+
+
+def load_mani_skill_robot(
+    scene: sapien.Scene | ManiSkillScene,
+    layout: LayoutInfo | str,
+    control_freq: int = 20,
+    robot_init_qpos_noise: float = 0.0,
+    control_mode: str = "pd_joint_pos",
+    backend_str: tuple[str, str] = ("cpu", "gpu"),
+) -> BaseAgent:
+    from mani_skill.agents import REGISTERED_AGENTS
+    from mani_skill.envs.scene import ManiSkillScene
+    from mani_skill.envs.utils.system.backend import (
+        parse_sim_and_render_backend,
+    )
+
+    if isinstance(layout, str) and layout.endswith(".json"):
+        layout = LayoutInfo.from_dict(json.load(open(layout, "r")))
+
+    robot_name = layout.relation[Scene3DItemEnum.ROBOT.value]
+    x, y, z, qx, qy, qz, qw = layout.position[robot_name]
+    delta_z = 0.002  # Add small offset to avoid collision.
+    pose = sapien.Pose([x, y, z + delta_z], [qw, qx, qy, qz])
+
+    if robot_name not in REGISTERED_AGENTS:
+        logger.warning(
+            f"Robot `{robot_name}` not registered, chosen from {REGISTERED_AGENTS.keys()}, use `panda` instead."
+        )
+        robot_name = "panda"
+
+    ROBOT_CLS = REGISTERED_AGENTS[robot_name].agent_cls
+    backend = parse_sim_and_render_backend(*backend_str)
+    if isinstance(scene, sapien.Scene):
+        scene = ManiSkillScene([scene], device=backend_str[0], backend=backend)
+    robot = ROBOT_CLS(
+        scene=scene,
+        control_freq=control_freq,
+        control_mode=control_mode,
+        initial_pose=pose,
+    )
+
+    # Set robot init joint rad agree(joint0 to joint6 w 2 finger).
+    qpos = np.array(
+        [
+            0.0,
+            np.pi / 8,
+            0,
+            -np.pi * 3 / 8,
+            0,
+            np.pi * 3 / 4,
+            np.pi / 4,
+            0.04,
+            0.04,
+        ]
+    )
+    qpos = (
+        np.random.normal(
+            0, robot_init_qpos_noise, (len(scene.sub_scenes), len(qpos))
+        )
+        + qpos
+    )
+    qpos[:, -2:] = 0.04
+    robot.reset(qpos)
+    robot.init_qpos = robot.robot.qpos
+    robot.controller.controllers["gripper"].reset()
+
+    return robot
+
+
+def render_images(
+    camera: sapien.render.RenderCameraComponent,
+    render_keys: list[
+        Literal[
+            "Color",
+            "Segmentation",
+            "Normal",
+            "Mask",
+            "Depth",
+            "Foreground",
+        ]
+    ] = None,
+) -> dict[str, Image.Image]:
+    """Render images from a given sapien camera.
+
+    Args:
+        camera (sapien.render.RenderCameraComponent): The camera to render from.
+        render_keys (List[str]): Types of images to render (e.g., Color, Segmentation).
+
+    Returns:
+        Dict[str, Image.Image]: Dictionary of rendered images.
+    """
+    if render_keys is None:
+        render_keys = [
+            "Color",
+            "Segmentation",
+            "Normal",
+            "Mask",
+            "Depth",
+            "Foreground",
+        ]
+
+    results: dict[str, Image.Image] = {}
+    if "Color" in render_keys:
+        color = camera.get_picture("Color")
+        color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
+        results["Color"] = Image.fromarray(color_rgb)
+
+    if "Mask" in render_keys:
+        alpha = (np.clip(color[..., 3], 0, 1) * 255).astype(np.uint8)
+        results["Mask"] = Image.fromarray(alpha)
+
+    if "Segmentation" in render_keys:
+        seg_labels = camera.get_picture("Segmentation")
+        label0 = seg_labels[..., 0].astype(np.uint8)
+        seg_color = COLOR_PALETTE[label0]
+        results["Segmentation"] = Image.fromarray(seg_color)
+
+    if "Foreground" in render_keys:
+        seg_labels = camera.get_picture("Segmentation")
+        label0 = seg_labels[..., 0]
+        mask = np.where((label0 > 1), 255, 0).astype(np.uint8)
+        color = camera.get_picture("Color")
+        color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
+        foreground = np.concatenate([color_rgb, mask[..., None]], axis=-1)
+        results["Foreground"] = Image.fromarray(foreground)
+
+    if "Normal" in render_keys:
+        normal = camera.get_picture("Normal")[..., :3]
+        normal_img = (((normal + 1) / 2) * 255).astype(np.uint8)
+        results["Normal"] = Image.fromarray(normal_img)
+
+    if "Depth" in render_keys:
+        position_map = camera.get_picture("Position")
+        depth = -position_map[..., 2]
+        alpha = torch.tensor(color[..., 3], dtype=torch.float32)
+        norm_depth = DiffrastRender.normalize_map_by_mask(
+            torch.tensor(depth), alpha
+        )
+        depth_img = (norm_depth * 255).to(torch.uint8).numpy()
+        results["Depth"] = Image.fromarray(depth_img)
+
+    return results
+
+
 class SapienSceneManager:
    """A class to manage SAPIEN simulator."""

@ -81,11 +339,6 @@ class SapienSceneManager:
            sapien.render.set_ray_tracing_path_depth(10)
            sapien.render.set_ray_tracing_denoiser("oidn")

-        # TODO: Support GPU.
-        # import sapien.physx as physx
-        # physx.enable_gpu()
-        # scene = sapien.Scene(systems=[sapien.physx.PhysxGpuSystem(), sapien.render.RenderSystem()])
-
        scene = sapien.Scene()
        scene.set_timestep(1 / self.sim_freq)

@ -114,12 +367,13 @@ class SapienSceneManager:

    def step_action(
        self,
+        agent: BaseAgent,
        action: torch.Tensor,
-        sim_steps_per_control: int,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
+        sim_steps_per_control: int = 1,
    ) -> dict:
-        self.robot.set_action(action)
+        agent.set_action(action)
        frames = defaultdict(list)
        for _ in range(sim_steps_per_control):
            self.scene.step()
@ -127,125 +381,11 @@ class SapienSceneManager:
        self.scene.update_render()
        for camera in cameras:
            camera.take_picture()
-            images = self.render_images(camera, render_keys=render_keys)
+            images = render_images(camera, render_keys=render_keys)
            frames[camera.name].append(images)

        return frames

-    def load_exists_robot(
-        self,
-        robot_name: str,
-        pose: sapien.Pose,
-        control_freq: int = 20,
-        robot_init_qpos_noise: float = 0.02,
-        control_mode: str = "pd_joint_pos",
-        backend_str: tuple[str, str] = ("cpu", "gpu"),
-    ):
-        from mani_skill.agents import REGISTERED_AGENTS
-        from mani_skill.envs.scene import ManiSkillScene
-        from mani_skill.envs.utils.system.backend import (
-            parse_sim_and_render_backend,
-        )
-
-        if robot_name == "franka":
-            robot_name = "panda"  # remapping
-        if robot_name not in REGISTERED_AGENTS:
-            raise KeyError(
-                f"Robot `{robot_name}` not registered, chosen from {REGISTERED_AGENTS.keys()}."
-            )
-
-        ROBOT_CLS = REGISTERED_AGENTS[robot_name].agent_cls
-        # backend = parse_sim_and_render_backend("physx_cuda", "gpu")
-        backend = parse_sim_and_render_backend(*backend_str)
-        scene = ManiSkillScene(
-            [self.scene], device=backend_str[0], backend=backend
-        )
-        robot = ROBOT_CLS(
-            scene=scene,
-            control_freq=control_freq,
-            control_mode=control_mode,
-            initial_pose=pose,
-        )
-        # Set robot init joint rad agree(joint0 to joint6 w 2 finger).
-        qpos = np.array(
-            [
-                0.0,
-                np.pi / 8,
-                0,
-                -np.pi * 3 / 8,
-                0,
-                np.pi * 3 / 4,
-                np.pi / 4,
-                0.04,
-                0.04,
-            ]
-        )
-        qpos = (
-            np.random.normal(
-                0, robot_init_qpos_noise, (len(scene.sub_scenes), len(qpos))
-            )
-            + qpos
-        )
-        qpos[:, -2:] = 0.04
-        robot.reset(qpos)
-        robot.init_qpos = robot.robot.qpos
-        robot.controller.controllers["gripper"].reset()
-
-        return robot
-
-    def load_actor_from_urdf(
-        self,
-        file_path: str,
-        pose: sapien.Pose,
-        use_static: bool = False,
-        update_mass: bool = False,
-    ) -> sapien.pysapien.Entity:
-        """Load an actor from a URDF file into the scene.
-
-        Args:
-            file_path (str): Path to the URDF file.
-            pose (sapien.Pose): Pose of the actor.
-            use_static (bool): Whether to create a static actor.
-            update_mass (bool): Whether to update the actor's mass from the URDF.
-
-        Returns:
-            sapien.pysapien.Entity: The loaded actor.
-        """
-        tree = ET.parse(file_path)
-        root = tree.getroot()
-        node_name = root.get("name")
-        file_dir = os.path.dirname(file_path)
-        visual_file = root.find('.//visual/geometry/mesh').get("filename")
-        collision_file = root.find('.//collision/geometry/mesh').get(
-            "filename"
-        )
-        visual_file = os.path.join(file_dir, visual_file)
-        collision_file = os.path.join(file_dir, collision_file)
-        static_fric = root.find('.//collision/gazebo/mu1').text
-        dynamic_fric = root.find('.//collision/gazebo/mu2').text
-
-        material = self.scene.create_physical_material(
-            static_friction=np.clip(float(static_fric), 0.1, 0.7),
-            dynamic_friction=np.clip(float(dynamic_fric), 0.1, 0.6),
-            restitution=0.05,
-        )
-        builder = self.scene.create_actor_builder()
-        body_type = "static" if use_static else "dynamic"
-        builder.set_physx_body_type(body_type)
-        builder.add_multiple_convex_collisions_from_file(
-            collision_file, material=material
-        )
-        builder.add_visual_from_file(visual_file)
-        actor = builder.build(name=node_name)
-        actor.set_name(node_name)
-        actor.set_pose(pose)
-
-        if update_mass and hasattr(actor.components[1], "mass"):
-            node_mass = float(root.find('.//inertial/mass').get("value"))
-            actor.components[1].set_mass(node_mass)
-
-        return actor
-
    def create_camera(
        self,
        cam_name: str,
@ -344,257 +484,78 @@ class SapienSceneManager:

        return self.cameras

-    def render_images(
-        self,
-        camera: sapien.render.RenderCameraComponent,
-        render_keys: list[
-            Literal[
-                "Color",
-                "Segmentation",
-                "Normal",
-                "Mask",
-                "Depth",
-                "Foreground",
-            ]
-        ] = None,
-    ) -> dict[str, Image.Image]:
-        """Render images from a given camera.
-
-        Args:
-            camera (sapien.render.RenderCameraComponent): The camera to render from.
-            render_keys (List[str]): Types of images to render (e.g., Color, Segmentation).
-
-        Returns:
-            Dict[str, Image.Image]: Dictionary of rendered images.
-        """
-        if render_keys is None:
-            render_keys = [
-                "Color",
-                "Segmentation",
-                "Normal",
-                "Mask",
-                "Depth",
-                "Foreground",
-            ]
-
-        results: dict[str, Image.Image] = {}
-        if "Color" in render_keys:
-            color = camera.get_picture("Color")
-            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
-            results["Color"] = Image.fromarray(color_rgb)
-
-        if "Mask" in render_keys:
-            alpha = (np.clip(color[..., 3], 0, 1) * 255).astype(np.uint8)
-            results["Mask"] = Image.fromarray(alpha)
-
-        if "Segmentation" in render_keys:
-            seg_labels = camera.get_picture("Segmentation")
-            label0 = seg_labels[..., 0].astype(np.uint8)
-            seg_color = self.COLOR_PALETTE[label0]
-            results["Segmentation"] = Image.fromarray(seg_color)
-
-        if "Foreground" in render_keys:
-            seg_labels = camera.get_picture("Segmentation")
-            label0 = seg_labels[..., 0]
-            mask = np.where((label0 > 1), 255, 0).astype(np.uint8)
-            color = camera.get_picture("Color")
-            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
-            foreground = np.concatenate([color_rgb, mask[..., None]], axis=-1)
-            results["Foreground"] = Image.fromarray(foreground)
-
-        if "Normal" in render_keys:
-            normal = camera.get_picture("Normal")[..., :3]
-            normal_img = (((normal + 1) / 2) * 255).astype(np.uint8)
-            results["Normal"] = Image.fromarray(normal_img)
-
-        if "Depth" in render_keys:
-            position_map = camera.get_picture("Position")
-            depth = -position_map[..., 2]
-            alpha = torch.tensor(color[..., 3], dtype=torch.float32)
-            norm_depth = DiffrastRender.normalize_map_by_mask(
-                torch.tensor(depth), alpha
-            )
-            depth_img = (norm_depth * 255).to(torch.uint8).numpy()
-            results["Depth"] = Image.fromarray(depth_img)
-
-        return results
-
-    def load_assets_from_layout_file(
-        self,
-        layout: LayoutInfo,
-        z_offset: float = 0.0,
-        init_quat: list[float] = [0, 0, 0, 1],
-        robot_name: str = None,
-        control_freq: int = 40,
-        robot_init_qpos_noise: float = 0,
-    ) -> None:
-        """Load assets from `EmbodiedGen` layout-gen output and create actors in the scene.
-
-        Args:
-            layout (LayoutInfo): The layout information data.
-            z_offset (float): Offset to apply to the Z-coordinate of non-context objects.
-            init_quat (List[float]): Initial quaternion (x, y, z, w) for orientation adjustment.
-            robot_name (str): Name of the robot to load, None not load robot.
-            control_freq (int): Control frequency for the robot, default is 40.
-            robot_init_qpos_noise (float): Noise to add to the robot's initial joint positions, default is 0.
-        """
-        for node in layout.assets:
-            file_dir = layout.assets[node]
-            file_name = f"{node.replace(' ', '_')}.urdf"
-            urdf_file = os.path.join(file_dir, file_name)
-
-            if layout.objs_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
-                continue
-
-            position = layout.position[node].copy()
-            if layout.objs_mapping[node] != Scene3DItemEnum.CONTEXT.value:
-                position[2] += z_offset
-
-            use_static = (
-                layout.relation.get(Scene3DItemEnum.CONTEXT.value, None)
-                == node
-            )
-
-            # Combine initial quaternion with object quaternion
-            x, y, z, qx, qy, qz, qw = position
-            qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
-            actor = self.load_actor_from_urdf(
-                urdf_file,
-                sapien.Pose(p=[x, y, z], q=[qw, qx, qy, qz]),
-                use_static=use_static,
-                update_mass=False,
-            )
-            self.actors[node] = actor
-
-        if robot_name is not None:
-            robot_node = layout.relation[Scene3DItemEnum.ROBOT.value]
-            x, y, z, qx, qy, qz, qw = layout.position[robot_node]
-            delta_z = 0.002  # Add small offset to avoid collision.
-            pose = sapien.Pose([x, y, z + delta_z], [qw, qx, qy, qz])
-            self.robot = self.load_exists_robot(
-                robot_name, pose, control_freq, robot_init_qpos_noise
-            )
-
-        return
-

 class FrankaPandaGrasper(object):
-    """Grasper planner for Franka Panda robot arm.
-
-    Args:
-        robot: See from `mani_skill.agents.robots.panda.panda`.
-        scene: The `SapienSceneManager` object.
-        sim_steps_per_control: Number of simulation steps per control step.
-        control_timestep: Time step for planning.
-        joint_vel_limits: Velocity limit for all joints (default: 0.9)
-        joint_acc_limits: Acceleration limit for all joints (default: 0.9)
-        finger_length: Length of the robot's fingers (default: 0.025)
-    """
-
    def __init__(
        self,
-        robot,
-        scene: SapienSceneManager,
-        sim_steps_per_control: int,
-        control_timestep: float,
-        joint_vel_limits: float = 0.9,
-        joint_acc_limits: float = 0.9,
+        agent: BaseAgent,
+        control_freq: float,
+        joint_vel_limits: float = 2.0,
+        joint_acc_limits: float = 1.0,
        finger_length: float = 0.025,
    ) -> None:
-        self.agent = robot
-        self.robot = robot.robot
-        self.scene = scene
-        self.sim_steps_per_control = sim_steps_per_control
-        self.control_timestep = control_timestep
+        self.agent = agent
+        self.robot = agent.robot
+        self.control_freq = control_freq
+        self.control_timestep = 1 / control_freq
        self.joint_vel_limits = joint_vel_limits
        self.joint_acc_limits = joint_acc_limits
        self.finger_length = finger_length
-        self.planner = self._setup_planner()
+        self.planners = self._setup_planner()

    def _setup_planner(self) -> mplib.Planner:
-        link_names = [link.get_name() for link in self.robot.get_links()]
-        joint_names = [
-            joint.get_name() for joint in self.robot.get_active_joints()
-        ]
-
-        planner = mplib.Planner(
-            urdf=self.agent.urdf_path,
-            srdf=self.agent.urdf_path.replace(".urdf", ".srdf"),
-            user_link_names=link_names,
-            user_joint_names=joint_names,
-            move_group="panda_hand_tcp",
-            joint_vel_limits=np.ones(7) * self.joint_vel_limits,
-            joint_acc_limits=np.ones(7) * self.joint_acc_limits,
-        )
-        planner.set_base_pose(self.robot.pose.raw_pose[0].tolist())
-
-        return planner
-
-    def update_state(
-        self,
-        result: dict,
-        gripper_state: Literal[-1, 1],
-        sim_steps_per_control: int,
-        cameras: list[sapien.render.RenderCameraComponent],
-        render_keys: list[str],
-    ) -> dict[str, list[Image.Image]]:
-        n_step = len(result["position"])
-        total_frames = defaultdict(list)
-        for i in tqdm(range(n_step), desc="Grasping"):
-            qpos = result["position"][min(i, n_step - 1)]
-            action = np.hstack([qpos, gripper_state])[None, ...]
-            action = torch.from_numpy(action).float()
-            frames = self.scene.step_action(
-                action, sim_steps_per_control, cameras, render_keys
+        planners = []
+        for pose in self.robot.pose:
+            link_names = [link.get_name() for link in self.robot.get_links()]
+            joint_names = [
+                joint.get_name() for joint in self.robot.get_active_joints()
+            ]
+            planner = mplib.Planner(
+                urdf=self.agent.urdf_path,
+                srdf=self.agent.urdf_path.replace(".urdf", ".srdf"),
+                user_link_names=link_names,
+                user_joint_names=joint_names,
+                move_group="panda_hand_tcp",
+                joint_vel_limits=np.ones(7) * self.joint_vel_limits,
+                joint_acc_limits=np.ones(7) * self.joint_acc_limits,
            )
-            for camera in cameras:
-                total_frames[camera.name].extend(frames[camera.name])
+            planner.set_base_pose(pose.raw_pose[0].tolist())
+            planners.append(planner)

-        return total_frames
+        return planners

    def control_gripper(
        self,
-        sim_steps_per_control: int,
-        cameras: list[sapien.render.RenderCameraComponent],
-        render_keys: list[str],
        gripper_state: Literal[-1, 1],
        n_step: int = 10,
-    ) -> dict[str, list[Image.Image]]:
-        total_frames = defaultdict(list)
+    ) -> np.ndarray:
        qpos = self.robot.get_qpos()[0, :-2].cpu().numpy()
+        actions = []
        for _ in range(n_step):
            action = np.hstack([qpos, gripper_state])[None, ...]
-            action = torch.from_numpy(action).float()
-            frames = self.scene.step_action(
-                action, sim_steps_per_control, cameras, render_keys
-            )
-            for camera in cameras:
-                total_frames[camera.name].extend(frames[camera.name])
+            actions.append(action)

-        return total_frames
+        return np.concatenate(actions, axis=0)

    def move_to_pose(
        self,
        pose: sapien.Pose,
-        sim_steps_per_control: int,
        control_timestep: float,
-        cameras: list[sapien.render.RenderCameraComponent],
-        render_keys: list[str],
        gripper_state: Literal[-1, 1],
        use_point_cloud: bool = False,
        n_max_step: int = 100,
-    ) -> dict[str, list[Image.Image]]:
-        # First try screw motion planning
-        result = self.planner.plan_screw(
+        action_key: str = "position",
+        env_idx: int = 0,
+    ) -> np.ndarray:
+        result = self.planners[env_idx].plan_qpos_to_pose(
            np.concatenate([pose.p, pose.q]),
            self.robot.get_qpos().cpu().numpy()[0],
            time_step=control_timestep,
            use_point_cloud=use_point_cloud,
        )

-        # Fallback to standard planning if screw fails
        if result["status"] != "Success":
-            result = self.planner.plan_qpos_to_pose(
+            result = self.planners[env_idx].plan_screw(
                np.concatenate([pose.p, pose.q]),
                self.robot.get_qpos().cpu().numpy()[0],
                time_step=control_timestep,
@ -602,44 +563,34 @@ class FrankaPandaGrasper(object):
            )

        if result["status"] != "Success":
-            return defaultdict(list)
+            return

-        for key in result:
-            if key in ["status", "duration"]:
-                continue
-            sample_ratio = (len(result[key]) // n_max_step) + 1
-            result[key] = result[key][::sample_ratio]
+        sample_ratio = (len(result[action_key]) // n_max_step) + 1
+        result[action_key] = result[action_key][::sample_ratio]

-        frames = self.update_state(
-            result, gripper_state, sim_steps_per_control, cameras, render_keys
-        )
+        n_step = len(result[action_key])
+        actions = []
+        for i in range(n_step):
+            qpos = result[action_key][i]
+            action = np.hstack([qpos, gripper_state])[None, ...]
+            actions.append(action)

-        return frames
+        return np.concatenate(actions, axis=0)

-    def render_grasp(
+    def compute_grasp_action(
        self,
        actor: sapien.pysapien.Entity,
-        cameras: list[sapien.render.RenderCameraComponent],
-        render_keys: list[
-            Literal[
-                "Color",
-                "Foreground",
-                "Segmentation",
-                "Normal",
-                "Mask",
-                "Depth",
-            ]
-        ],
        reach_target_only: bool = True,
-    ) -> dict[str, list[Image.Image]]:
+        offset: tuple[float, float, float] = [0, 0, -0.05],
+        env_idx: int = 0,
+    ) -> np.ndarray:
        physx_rigid = actor.components[1]
        mesh = get_component_mesh(physx_rigid, to_world_frame=True)
        obb = mesh.bounding_box_oriented
        approaching = np.array([0, 0, -1])
+        tcp_pose = self.agent.tcp.pose[env_idx]
        target_closing = (
-            self.agent.tcp.pose.to_transformation_matrix()[0, :3, 1]
-            .cpu()
-            .numpy()
+            tcp_pose.to_transformation_matrix()[0, :3, 1].cpu().numpy()
        )
        grasp_info = compute_grasp_info_by_obb(
            obb,
@ -647,60 +598,46 @@ class FrankaPandaGrasper(object):
            target_closing=target_closing,
            depth=self.finger_length,
        )
-        closing, center = grasp_info["closing"], grasp_info["center"]
-        raw_tcp_pose = self.agent.tcp.pose.sp
-        grasp_pose = self.agent.build_grasp_pose(approaching, closing, center)
-        reach_pose = grasp_pose * sapien.Pose(p=[0, 0, -0.05])
-        grasp_pose = grasp_pose * sapien.Pose(p=[0, 0, 0.01])

-        total_frames = defaultdict(list)
-        frames_list = []
-        reach_frames = self.move_to_pose(
+        closing, center = grasp_info["closing"], grasp_info["center"]
+        raw_tcp_pose = tcp_pose.sp
+        grasp_pose = self.agent.build_grasp_pose(approaching, closing, center)
+        reach_pose = grasp_pose * sapien.Pose(p=offset)
+        grasp_pose = grasp_pose * sapien.Pose(p=[0, 0, 0.01])
+        actions = []
+        reach_actions = self.move_to_pose(
            reach_pose,
-            self.sim_steps_per_control,
            self.control_timestep,
-            cameras,
-            render_keys,
            gripper_state=1,
+            env_idx=env_idx,
        )
-        frames_list.append(reach_frames)
-        if len(reach_frames) == 0:
+        actions.append(reach_actions)
+
+        if reach_actions is None:
            logger.warning(
                f"Failed to reach the grasp pose for node `{actor.name}`, skipping grasping."
            )
-            return total_frames
+            return None

        if not reach_target_only:
-            grasp_frames = self.move_to_pose(
+            grasp_actions = self.move_to_pose(
                grasp_pose,
-                self.sim_steps_per_control,
                self.control_timestep,
-                cameras,
-                render_keys,
                gripper_state=1,
+                env_idx=env_idx,
            )
-            frames_list.append(grasp_frames)
-            close_frames = self.control_gripper(
-                self.sim_steps_per_control,
-                cameras,
-                render_keys,
+            actions.append(grasp_actions)
+            close_actions = self.control_gripper(
                gripper_state=-1,
+                env_idx=env_idx,
            )
-            frames_list.append(close_frames)
-            back_frames = self.move_to_pose(
+            actions.append(close_actions)
+            back_actions = self.move_to_pose(
                raw_tcp_pose,
-                self.sim_steps_per_control,
                self.control_timestep,
-                cameras,
-                render_keys,
                gripper_state=-1,
+                env_idx=env_idx,
            )
-            frames_list.append(back_frames)
+            actions.append(back_actions)

-        for frame_dict in frames_list:
-            for cam_name, frames in frame_dict.items():
-                total_frames[cam_name].extend(frames)
-
-        self.agent.reset(self.agent.init_qpos)
-
-        return total_frames
+        return np.concatenate(actions, axis=0)
--- a/embodied_gen/utils/tags.py
+++ b/embodied_gen/utils/tags.py
@ -1 +1 @@
-VERSION = "v0.1.3"
+VERSION = "v0.1.4"
--- a/embodied_gen/validators/urdf_convertor.py
+++ b/embodied_gen/validators/urdf_convertor.py
@ -24,6 +24,7 @@ from xml.dom.minidom import parseString

 import numpy as np
 import trimesh
+from embodied_gen.data.convex_decomposer import decompose_convex_mesh
 from embodied_gen.utils.gpt_clients import GPT_CLIENT, GPTclient
 from embodied_gen.utils.process_media import render_asset3d
 from embodied_gen.utils.tags import VERSION
@ -84,6 +85,7 @@ class URDFGenerator(object):
        attrs_name: list[str] = None,
        render_dir: str = "urdf_renders",
        render_view_num: int = 4,
+        decompose_convex: bool = False,
    ) -> None:
        if mesh_file_list is None:
            mesh_file_list = []
@ -132,8 +134,9 @@ class URDFGenerator(object):
            Estimate the vertical projection of their real length based on its pose.
            For example:
              - A pen standing upright in the first image (aligned with the image's vertical axis)
-              full body visible in the first image: → vertical height ≈ 0.14-0.20 m
-              - A pen lying flat in the first image (showing thickness or as a dot) → vertical height ≈ 0.018-0.025 m
+                full body visible in the first image: → vertical height ≈ 0.14-0.20 m
+              - A pen lying flat in the first image or either the tip or the tail is facing the image
+                (showing thickness or as a circle) → vertical height ≈ 0.018-0.025 m
              - Tilted pen in the first image (e.g., ~45° angle): vertical height ≈ 0.07-0.12 m
            - Use the rest views to help determine the object's 3D pose and orientation.
            Assume the object is in real-world scale and estimate the approximate vertical height
@ -156,6 +159,7 @@ class URDFGenerator(object):
                "gs_model",
            ]
        self.attrs_name = attrs_name
+        self.decompose_convex = decompose_convex

    def parse_response(self, response: str) -> dict[str, any]:
        lines = response.split("\n")
@ -258,9 +262,24 @@ class URDFGenerator(object):
        # Update collision geometry
        collision = link.find("collision/geometry/mesh")
        if collision is not None:
-            collision.set(
-                "filename", os.path.join(self.output_mesh_dir, obj_name)
-            )
+            collision_mesh = os.path.join(self.output_mesh_dir, obj_name)
+            if self.decompose_convex:
+                try:
+                    d_params = dict(
+                        threshold=0.05, max_convex_hull=64, verbose=False
+                    )
+                    filename = f"{os.path.splitext(obj_name)[0]}_collision.ply"
+                    output_path = os.path.join(mesh_folder, filename)
+                    decompose_convex_mesh(
+                        mesh_output_path, output_path, **d_params
+                    )
+                    collision_mesh = f"{self.output_mesh_dir}/{filename}"
+                except Exception as e:
+                    logger.warning(
+                        f"Convex decomposition failed for {output_path}, {e}."
+                        "Use original mesh for collision computation."
+                    )
+            collision.set("filename", collision_mesh)
            collision.set("scale", "1.0 1.0 1.0")

        # Update friction coefficients
--- a/install/install_basic.sh
+++ b/install/install_basic.sh
@ -9,10 +9,10 @@ PIP_INSTALL_PACKAGES=(
    "xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu118"
    "flash-attn==2.7.0.post2 --no-build-isolation"
    "-r requirements.txt --use-deprecated=legacy-resolver"
-    "utils3d@git+https://github.com/EasternJournalist/utils3d.git#egg=9a4eb15"
+    "utils3d@git+https://github.com/EasternJournalist/utils3d.git@9a4eb15"
    "clip@git+https://github.com/openai/CLIP.git"
-    "segment-anything@git+https://github.com/facebookresearch/segment-anything.git#egg=dca509f"
-    "nvdiffrast@git+https://github.com/NVlabs/nvdiffrast.git#egg=729261d"
+    "segment-anything@git+https://github.com/facebookresearch/segment-anything.git@dca509f"
+    "nvdiffrast@git+https://github.com/NVlabs/nvdiffrast.git@729261d"
    "kolors@git+https://github.com/HochCC/Kolors.git"
    "kaolin@git+https://github.com/NVIDIAGameWorks/kaolin.git@v0.16.0"
    "git+https://github.com/nerfstudio-project/gsplat.git@v1.5.3"
--- a/pyproject.toml
+++ b/pyproject.toml
@ -7,7 +7,7 @@ packages = ["embodied_gen"]

 [project]
 name = "embodied_gen"
-version = "v0.1.3"
+version = "v0.1.4"
 readme = "README.md"
 license = "Apache-2.0"
 license-files = ["LICENSE", "NOTICE"]
--- a/requirements.txt
+++ b/requirements.txt
@ -38,5 +38,6 @@ tyro
 pyquaternion
 shapely
 sapien==3.0.0b1
+coacd
 mani_skill==3.0.0b21
 typing_extensions==4.14.1
--- a/tests/test_examples/test_urdf_convertor.py
+++ b/tests/test_examples/test_urdf_convertor.py
@ -39,6 +39,7 @@ def test_urdf_convertor():
        "outputs/test_urdf/notebook/result/mesh/notebook.obj",
        "outputs/test_urdf/marker/result/mesh/marker.obj",
        "outputs/test_urdf/pen2/result/mesh/pen.obj",
+        "outputs/test_urdf/pen3/result/mesh/pen.obj",
    ]
    for idx, mesh_path in enumerate(mesh_paths):
        filename = mesh_path.split("/")[-1].split(".")[0]
@ -49,3 +50,19 @@ def test_urdf_convertor():
            # min_height=1.0,
            # max_height=1.2,
        )
+
+
+def test_decompose_convex_mesh():
+    urdf_gen = URDFGenerator(GPT_CLIENT, decompose_convex=True)
+    mesh_paths = [
+        "outputs/test_urdf/sample_0/mesh/pen.obj",
+        "outputs/test_urdf/sample_1/mesh/notepad.obj",
+        "outputs/test_urdf/sample_2/mesh/plate.obj",
+    ]
+    for idx, mesh_path in enumerate(mesh_paths):
+        filename = mesh_path.split("/")[-1].split(".")[0]
+        urdf_path = urdf_gen(
+            mesh_path=mesh_path,
+            output_root=f"outputs/test_urdf3/sample_{idx}",
+            category=filename,
+        )
--- a/tests/test_unit/test_mesh_process.py
+++ b/tests/test_unit/test_mesh_process.py
@ -0,0 +1,44 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+
+import logging
+import tempfile
+from time import time
+
+import pytest
+from embodied_gen.data.convex_decomposer import decompose_convex_mesh
+
+logger = logging.getLogger(__name__)
+
+
+@pytest.mark.parametrize(
+    "input_mesh_path, max_convex_hull",
+    [
+        ("apps/assets/example_texture/meshes/robot_text.obj", 8),
+        # ("apps/assets/example_texture/meshes/robot_text.obj", 32),
+        # ("apps/assets/example_texture/meshes/robot_text.obj", 64),
+        # ("apps/assets/example_texture/meshes/robot_text.obj", 128),
+    ],
+)
+def test_decompose_convex_mesh(input_mesh_path, max_convex_hull):
+    d_params = dict(
+        threshold=0.05, max_convex_hull=max_convex_hull, verbose=False
+    )
+    with tempfile.NamedTemporaryFile(suffix='.ply', delete=True) as tmp_file:
+        start_time = time()
+        decompose_convex_mesh(input_mesh_path, tmp_file.name, **d_params)
+        logger.info(f"Finished in {time()-start_time:.2f}s")