feat(pipe): Add layout generation by text desc pipe and release v0.1.3. (#33)

* feat(pipe): Add layout generation pipe. * tag v0.1.3 * chore: update assets * update --------- Co-authored-by: xinjie.wang <xinjie.wang@gpu-4090-dev015.hogpu.cc>
2025-08-19 15:40:33 +08:00 · 2025-08-19 15:40:33 +08:00 · e06900fe57
commit e06900fe57
parent e05095f51f
28 changed files with 2369 additions and 64 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,8 +4,18 @@ All notable changes to this project will be documented in this file.

 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0).

+## [0.1.3] - 2025-08-19
+### Feature
+- Release `layout-cli` 🏞️ Layout(Interactive 3D Worlds) Generation, generate interactive 3D scene from task description.

-## [0.1.1] - 2025-07-xx
+
+## [0.1.2] - 2025-07-31
+### Feature
+- Release `scene3d-cli` 🌍 3D Scene Generation, generate a color mesh and 3DGS scene from text description.
+- Release docker file and docker image.
+
+
+## [0.1.1] - 2025-07-11
 ### Feature
 - Added intelligent quality checkers and auto-retry pipeline for `image-to-3d` and `text-to-3d`.
 - Added unit tests for quality checkers.
--- a/README.md
+++ b/README.md
@ -30,7 +30,7 @@
 ```sh
 git clone https://github.com/HorizonRobotics/EmbodiedGen.git
 cd EmbodiedGen
-git checkout v0.1.2
+git checkout v0.1.3
 git submodule update --init --recursive --progress
 conda create -n embodiedgen python=3.10.13 -y # recommended to use a new env.
 conda activate embodiedgen
@ -194,8 +194,6 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \

 ### 💬 Generate Layout from task description

-🚧 *Coming Soon*
-
 <table>
  <tr>
    <td><img src="apps/assets/layout1.gif" alt="layout1" width="360"/></td>
@ -207,6 +205,47 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \
  </tr>
 </table>

+Text-to-image model based on SD3.5 Medium, usage requires agreement to the [model license](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium). All models auto-downloaded at the first run.
+
+You can generate any desired room as background using `scene3d-cli`. As each scene takes approximately 30 minutes to generate, we recommend pre-generating them for efficiency and adding them to `outputs/bg_scenes/scene_list.txt`.
+
+We provided some sample background assets created with `scene3d-cli`. Download them(~4G) using `hf download xinjjj/scene3d-bg --repo-type dataset --local-dir outputs`.
+
+Generating one interactive 3D scene from task description with `layout-cli` takes approximately 30 minutes.
+```sh
+layout-cli --task_descs "Place the pen in the mug on the desk" "Put the fruit on the table on the plate" \
+--bg_list "outputs/bg_scenes/scene_list.txt" --output_root "outputs/layouts_gen" --insert_robot
+```
+
+<table>
+  <tr>
+    <td><img src="apps/assets/Iscene_demo1.gif" alt="Iscene_demo1" width="300"/></td>
+    <td><img src="apps/assets/Iscene_demo2.gif" alt="Iscene_demo2" width="450"/></td>
+  </tr>
+</table>
+
+Run multiple tasks defined in `task_list.txt` in the backend.
+Remove `--insert_robot` if you don't consider the robot pose in layout generation.
+```sh
+CUDA_VISIBLE_DEVICES=0 nohup layout-cli \
+--task_descs "apps/assets/example_layout/task_list.txt" \
+--bg_list "outputs/bg_scenes/scene_list.txt" \
+--output_root "outputs/layouts_gens" --insert_robot > layouts_gens.log &
+```
+
+Using `compose_layout.py`, you can recompose the layout of the generated interactive 3D scenes. (Support for texture editing and augmentation will be added later.)
+```sh
+python embodied_gen/scripts/compose_layout.py \
+--layout_path "outputs/layouts_gens/task_0000/layout.json" \
+--output_dir "outputs/layouts_gens/task_0000/recompose" --insert_robot
+```
+
+We provide `sim-cli`, that allows users to easily load generated layouts into an interactive 3D simulation using the SAPIEN engine (will support for more simulators in future updates).
+
+```sh
+sim-cli --layout_path "outputs/layouts_gens/task_0000/recompose/layout.json" \
+--output_dir "outputs/layouts_gens/task_0000/recompose/sapien_render" --robot_name "franka"
+```

 ### 🖼️ Real-to-Sim Digital Twin

--- a/apps/assets/Iscene_demo1.gif
+++ b/apps/assets/Iscene_demo1.gif
--- a/apps/assets/Iscene_demo2.gif
+++ b/apps/assets/Iscene_demo2.gif
--- a/apps/assets/example_layout/task_list.txt
+++ b/apps/assets/example_layout/task_list.txt
@ -0,0 +1,18 @@
+Place the pen in the mug on the desk
+Get the pen from the table and put it inside the cup
+Put the fruit on the table on the plate
+Put the apples on the table on the plate
+Put the spoon on the plate in the table
+Put the green lid on the desk
+Put the pen on the tray
+Move the plate forward
+Move the lamp to the left
+Pick up the cup and put on the middle of the table
+Close the laptop
+Close the red jar with the lid on the table
+Pick the cloth and shove it in the box
+Pick up the marker from the table and put it in the bowl
+Pick up the charger and move it slightly to the left
+Move the jar to the left side of the desk
+Pick the rubik's cube on the top of the desk
+Move the mug to the right
--- a/apps/common.py
+++ b/apps/common.py
@ -189,7 +189,7 @@ os.makedirs(TMP_DIR, exist_ok=True)
 lighting_css = """
 <style>
 #lighter_mesh canvas {
-    filter: brightness(1.8) !important;
+    filter: brightness(1.9) !important;
 }
 </style>
 """
--- a/embodied_gen/models/gs_model.py
+++ b/embodied_gen/models/gs_model.py
@ -51,17 +51,15 @@ class RenderResult:

    def __post_init__(self):
        if isinstance(self.rgb, torch.Tensor):
-            rgb = self.rgb.detach().cpu().numpy()
-            rgb = (rgb * 255).astype(np.uint8)
-            self.rgb = cv2.cvtColor(rgb, cv2.COLOR_BGR2RGB)
+            rgb = (self.rgb * 255).to(torch.uint8)
+            self.rgb = rgb.cpu().numpy()[..., ::-1]
        if isinstance(self.depth, torch.Tensor):
-            self.depth = self.depth.detach().cpu().numpy()
+            self.depth = self.depth.cpu().numpy()
        if isinstance(self.opacity, torch.Tensor):
-            opacity = self.opacity.detach().cpu().numpy()
-            opacity = (opacity * 255).astype(np.uint8)
-            self.opacity = cv2.cvtColor(opacity, cv2.COLOR_GRAY2RGB)
+            opacity = (self.opacity * 255).to(torch.uint8)
+            self.opacity = opacity.cpu().numpy()
            mask = np.where(self.opacity > self.mask_threshold, 255, 0)
-            self.mask = mask[..., 0:1].astype(np.uint8)
+            self.mask = mask.astype(np.uint8)
            self.rgba = np.concatenate([self.rgb, self.mask], axis=-1)


--- a/embodied_gen/models/layout.py
+++ b/embodied_gen/models/layout.py
@ -0,0 +1,509 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+
+import argparse
+import json
+import logging
+import os
+import re
+
+import json_repair
+from embodied_gen.utils.enum import (
+    LayoutInfo,
+    RobotItemEnum,
+    Scene3DItemEnum,
+    SpatialRelationEnum,
+)
+from embodied_gen.utils.gpt_clients import GPT_CLIENT, GPTclient
+from embodied_gen.utils.process_media import SceneTreeVisualizer
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+__all__ = [
+    "LayoutDesigner",
+    "LAYOUT_DISASSEMBLER",
+    "LAYOUT_GRAPHER",
+    "LAYOUT_DESCRIBER",
+]
+
+
+DISTRACTOR_NUM = 3  # Maximum number of distractor objects allowed
+LAYOUT_DISASSEMBLE_PROMPT = f"""
+    You are an intelligent 3D scene planner. Given a natural language
+    description of a robotic task, output a structured description of
+    an interactive 3D scene.
+
+    The output must include the following fields:
+    - task: A high-level task type (e.g., "single-arm pick",
+        "dual-arm grasping", "pick and place", "object sorting").
+    - {Scene3DItemEnum.ROBOT}: The name or type of robot involved. If not mentioned,
+        use {RobotItemEnum.FRANKA} as default.
+    - {Scene3DItemEnum.BACKGROUND}: The room or indoor environment where the task happens
+        (e.g., Kitchen, Bedroom, Living Room, Workshop, Office).
+    - {Scene3DItemEnum.CONTEXT}: A indoor object involved in the manipulation
+        (e.g., Table, Shelf, Desk, Bed, Cabinet).
+    - {Scene3DItemEnum.MANIPULATED_OBJS}: The main object(s) that the robot directly interacts with.
+    - {Scene3DItemEnum.DISTRACTOR_OBJS}: Other objects that naturally belong to the scene but are not part of the main task.
+
+    Constraints:
+    - The {Scene3DItemEnum.BACKGROUND} must logically match the described task.
+    - The {Scene3DItemEnum.CONTEXT} must fit within the {Scene3DItemEnum.BACKGROUND}. (e.g., a bedroom may include a table or bed, but not a workbench.)
+    - The {Scene3DItemEnum.CONTEXT} must be a concrete indoor object, such as a "table",
+        "shelf", "desk", or "bed". It must not be an abstract concept (e.g., "area", "space", "zone")
+        or structural surface (e.g., "floor", "ground"). If the input describes an interaction near
+        the floor or vague space, you must infer a plausible object like a "table", "cabinet", or "storage box" instead.
+    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} objects must be plausible,
+        and semantically compatible with the {Scene3DItemEnum.CONTEXT} and {Scene3DItemEnum.BACKGROUND}.
+    - {Scene3DItemEnum.DISTRACTOR_OBJS} must not confuse or overlap with the manipulated objects.
+    - {Scene3DItemEnum.DISTRACTOR_OBJS} number limit: {DISTRACTOR_NUM} distractors maximum.
+    - All {Scene3DItemEnum.BACKGROUND} are limited to indoor environments.
+    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} are rigid bodies and not include flexible objects.
+    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} must be common
+        household or office items or furniture, not abstract concepts, not too small like needle.
+    - If the input includes a plural or grouped object (e.g., "pens", "bottles", "plates", "fruit"),
+        you must decompose it into multiple individual instances (e.g., ["pen", "pen"], ["apple", "pear"]).
+    - Containers that hold objects (e.g., "bowl of apples", "box of tools") must
+        be separated into individual items (e.g., ["bowl", "apple", "apple"]).
+    - Do not include transparent objects such as "glass", "plastic", etc.
+    - The output must be in compact JSON format and use Markdown syntax, just like the output in the example below.
+
+    Examples:
+
+    Input:
+    "Pick up the marker from the table and put it in the bowl robot {RobotItemEnum.UR5}."
+    Output:
+    ```json
+    {{
+        "task_desc": "Pick up the marker from the table and put it in the bowl.",
+        "task": "pick and place",
+        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
+        "{Scene3DItemEnum.BACKGROUND}": "kitchen",
+        "{Scene3DItemEnum.CONTEXT}": "table",
+        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker"],
+        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["mug", "notebook", "bowl"]
+    }}
+    ```
+
+    Input:
+    "Put the rubik's cube on the top of the shelf."
+    Output:
+    ```json
+    {{
+        "task_desc": "Put the rubik's cube on the top of the shelf.",
+        "task": "pick and place",
+        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
+        "{Scene3DItemEnum.BACKGROUND}": "bedroom",
+        "{Scene3DItemEnum.CONTEXT}": "shelf",
+        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["rubik's cube"],
+        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["pen", "cup", "toy car"]
+    }}
+    ```
+
+    Input:
+    "Remove all the objects from the white basket and put them on the table."
+    Output:
+    ```json
+    {{
+        "task_desc": "Remove all the objects from the white basket and put them on the table, robot {RobotItemEnum.PIPER}.",
+        "task": "pick and place",
+        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.PIPER}",
+        "{Scene3DItemEnum.BACKGROUND}": "office",
+        "{Scene3DItemEnum.CONTEXT}": "table",
+        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["banana", "mobile phone"],
+        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["plate", "white basket"]
+    }}
+    ```
+
+    Input:
+    "Pick up the rope on the chair and put it in the box."
+    Output:
+    ```json
+    {{
+        "task_desc": "Pick up the rope on the chair and put it in the box, robot {RobotItemEnum.FRANKA}.",
+        "task": "pick and place",
+        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
+        "{Scene3DItemEnum.BACKGROUND}": "living room",
+        "{Scene3DItemEnum.CONTEXT}": "chair",
+        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["rope", "box"],
+        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["magazine"]
+    }}
+    ```
+
+    Input:
+    "Pick up the seal tape and plastic from the counter and put them in the open drawer and close it."
+    Output:
+    ```json
+    {{
+        "task_desc": "Pick up the seal tape and plastic from the counter and put them in the open drawer and close it.",
+        "task": "pick and place",
+        "robot": "franka",
+        "background": "kitchen",
+        "context": "counter",
+        "manipulated_objs": ["seal tape", "plastic", "opened drawer"],
+        "distractor_objs": ["scissors"]
+    }}
+    ```
+
+    Input:
+    "Put the pens in the grey bowl."
+    Output:
+    ```json
+    {{
+        "task_desc": "Put the pens in the grey bowl.",
+        "task": "pick and place",
+        "robot": "franka",
+        "background": "office",
+        "context": "table",
+        "manipulated_objs": ["pen", "pen", "grey bowl"],
+        "distractor_objs": ["notepad", "cup"]
+    }}
+    ```
+
+"""
+
+
+LAYOUT_HIERARCHY_PROMPT = f"""
+    You are a 3D scene layout reasoning expert.
+    Your task is to generate a spatial relationship dictionary in multiway tree
+    that describes how objects are arranged in a 3D environment
+    based on a given task description and object list.
+
+    Input in JSON format containing the task description, task type,
+    {Scene3DItemEnum.ROBOT}, {Scene3DItemEnum.BACKGROUND}, {Scene3DItemEnum.CONTEXT},
+    and a list of objects, including {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS}.
+
+    ### Supported Spatial Relations:
+    - "{SpatialRelationEnum.ON}": The child object bottom is directly on top of the parent object top.
+    - "{SpatialRelationEnum.INSIDE}": The child object is inside the context object.
+    - "{SpatialRelationEnum.IN}": The {Scene3DItemEnum.ROBOT} in the {Scene3DItemEnum.BACKGROUND}.
+    - "{SpatialRelationEnum.FLOOR}": The child object bottom is on the floor of the {Scene3DItemEnum.BACKGROUND}.
+
+    ### Rules:
+    - The {Scene3DItemEnum.CONTEXT} object must be "{SpatialRelationEnum.FLOOR}" the {Scene3DItemEnum.BACKGROUND}.
+    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} must be either
+        "{SpatialRelationEnum.ON}" or "{SpatialRelationEnum.INSIDE}" the {Scene3DItemEnum.CONTEXT}
+    - Or "{SpatialRelationEnum.FLOOR}" {Scene3DItemEnum.BACKGROUND}.
+    - Use "{SpatialRelationEnum.INSIDE}" only if the parent is a container-like object (e.g., shelf, rack, cabinet).
+    - Do not define relationship edges between objects, only for the child and parent nodes.
+    - {Scene3DItemEnum.ROBOT} must "{SpatialRelationEnum.IN}" the {Scene3DItemEnum.BACKGROUND}.
+    - Ensure that each object appears only once in the layout tree, and its spatial relationship is defined with only one parent.
+    - Ensure a valid multiway tree structure with a maximum depth of 2 levels suitable for a 3D scene layout representation.
+    - Only output the final output in JSON format, using Markdown syntax as in examples.
+
+    ### Example
+    Input:
+    {{
+        "task_desc": "Pick up the marker from the table and put it in the bowl.",
+        "task": "pick and place",
+        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
+        "{Scene3DItemEnum.BACKGROUND}": "kitchen",
+        "{Scene3DItemEnum.CONTEXT}": "table",
+        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker", "bowl"],
+        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["mug", "chair"]
+    }}
+    Intermediate Think:
+        table {SpatialRelationEnum.FLOOR} kitchen
+        chair {SpatialRelationEnum.FLOOR} kitchen
+        {RobotItemEnum.FRANKA} {SpatialRelationEnum.IN} kitchen
+        marker {SpatialRelationEnum.ON} table
+        bowl {SpatialRelationEnum.ON} table
+        mug {SpatialRelationEnum.ON} table
+    Final Output:
+    ```json
+    {{
+        "kitchen": [
+            ["table", "{SpatialRelationEnum.FLOOR}"],
+            ["chair", "{SpatialRelationEnum.FLOOR}"],
+            ["{RobotItemEnum.FRANKA}", "{SpatialRelationEnum.IN}"]
+        ],
+        "table": [
+            ["marker", "{SpatialRelationEnum.ON}"],
+            ["bowl", "{SpatialRelationEnum.ON}"],
+            ["mug", "{SpatialRelationEnum.ON}"]
+        ]
+    }}
+    ```
+
+    Input:
+    {{
+        "task_desc": "Put the marker on top of the book.",
+        "task": "pick and place",
+        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
+        "{Scene3DItemEnum.BACKGROUND}": "office",
+        "{Scene3DItemEnum.CONTEXT}": "desk",
+        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker", "book"],
+        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["pen holder", "notepad"]
+    }}
+    Intermediate Think:
+        desk {SpatialRelationEnum.FLOOR} office
+        {RobotItemEnum.UR5} {SpatialRelationEnum.IN} office
+        marker {SpatialRelationEnum.ON} desk
+        book {SpatialRelationEnum.ON} desk
+        pen holder {SpatialRelationEnum.ON} desk
+        notepad {SpatialRelationEnum.ON} desk
+    Final Output:
+    ```json
+    {{
+        "office": [
+            ["desk", "{SpatialRelationEnum.FLOOR}"],
+            ["{RobotItemEnum.UR5}", "{SpatialRelationEnum.IN}"]
+        ],
+        "desk": [
+            ["marker", "{SpatialRelationEnum.ON}"],
+            ["book", "{SpatialRelationEnum.ON}"],
+            ["pen holder", "{SpatialRelationEnum.ON}"],
+            ["notepad", "{SpatialRelationEnum.ON}"]
+        ]
+    }}
+    ```
+
+    Input:
+    {{
+        "task_desc": "Put the rubik's cube on the top of the shelf.",
+        "task": "pick and place",
+        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
+        "{Scene3DItemEnum.BACKGROUND}": "bedroom",
+        "{Scene3DItemEnum.CONTEXT}": "shelf",
+        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["rubik's cube"],
+        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["toy car", "pen"]
+    }}
+    Intermediate Think:
+        shelf {SpatialRelationEnum.FLOOR} bedroom
+        {RobotItemEnum.UR5} {SpatialRelationEnum.IN} bedroom
+        rubik's cube {SpatialRelationEnum.INSIDE} shelf
+        toy car {SpatialRelationEnum.INSIDE} shelf
+        pen {SpatialRelationEnum.INSIDE} shelf
+    Final Output:
+    ```json
+    {{
+        "bedroom": [
+            ["shelf", "{SpatialRelationEnum.FLOOR}"],
+            ["{RobotItemEnum.UR5}", "{SpatialRelationEnum.IN}"]
+        ],
+        "shelf": [
+            ["rubik's cube", "{SpatialRelationEnum.INSIDE}"],
+            ["toy car", "{SpatialRelationEnum.INSIDE}"],
+            ["pen", "{SpatialRelationEnum.INSIDE}"]
+        ]
+    }}
+    ```
+
+    Input:
+    {{
+        "task_desc": "Put the marker in the cup on the counter.",
+        "task": "pick and place",
+        "robot": "franka",
+        "background": "kitchen",
+        "context": "counter",
+        "manipulated_objs": ["marker", "cup"],
+        "distractor_objs": ["plate", "spoon"]
+    }}
+    Intermediate Think:
+        counter {SpatialRelationEnum.FLOOR} kitchen
+        {RobotItemEnum.FRANKA} {SpatialRelationEnum.IN} kitchen
+        marker {SpatialRelationEnum.ON} counter
+        cup {SpatialRelationEnum.ON} counter
+        plate {SpatialRelationEnum.ON} counter
+        spoon {SpatialRelationEnum.ON} counter
+    Final Output:
+    ```json
+    {{
+        "kitchen": [
+            ["counter", "{SpatialRelationEnum.FLOOR}"],
+            ["{RobotItemEnum.FRANKA}", "{SpatialRelationEnum.IN}"]
+        ],
+        "counter": [
+            ["marker", "{SpatialRelationEnum.ON}"],
+            ["cup", "{SpatialRelationEnum.ON}"],
+            ["plate", "{SpatialRelationEnum.ON}"],
+            ["spoon", "{SpatialRelationEnum.ON}"]
+        ]
+    }}
+    ```
+"""
+
+
+LAYOUT_DESCRIBER_PROMPT = """
+    You are a 3D asset style descriptor.
+
+    Given a task description and a dictionary where the key is the object content and
+    the value is the object type, output a JSON dictionary with each object paired
+    with a concise, styled visual description suitable for 3D asset generation.
+
+    Generation Guidelines:
+    - For each object, brainstorm multiple style candidates before selecting the final
+        description. Vary phrasing, material, texture, color, and spatial details.
+    - Each description must be a maximum of 15 words, including color, style, materials.
+    - Descriptions should be visually grounded, specific, and reflect surface texture and structure.
+    - For objects marked as "context", explicitly mention the object is standalone, has an empty top.
+    - Use rich style descriptors: e.g., "scratched brown wooden desk" etc.
+    - Ensure all object styles align with the task's overall context and environment.
+
+    Format your output in JSON like the example below.
+
+    Example Input:
+    "Pick up the rope on the chair and put it in the box. {'living room': 'background', 'chair': 'context',
+        'rope': 'manipulated_objs', 'box': 'manipulated_objs', 'magazine': 'distractor_objs'}"
+
+    Example Output:
+    ```json
+    {
+        "living room": "modern cozy living room with soft sunlight and light grey carpet",
+        "chair": "standalone dark oak chair with no surroundings and clean empty seat",
+        "rope": "twisted hemp rope with rough fibers and dusty beige texture",
+        "box": "slightly crumpled cardboard box with open flaps and brown textured surface",
+        "magazine": "celebrity magazine with glossy red cover and large bold title"
+    }
+    ```
+"""
+
+
+class LayoutDesigner(object):
+    def __init__(
+        self,
+        gpt_client: GPTclient,
+        system_prompt: str,
+        verbose: bool = False,
+    ) -> None:
+        self.prompt = system_prompt.strip()
+        self.verbose = verbose
+        self.gpt_client = gpt_client
+
+    def query(self, prompt: str, params: dict = None) -> str:
+        full_prompt = self.prompt + f"\n\nInput:\n\"{prompt}\""
+
+        response = self.gpt_client.query(
+            text_prompt=full_prompt,
+            params=params,
+        )
+
+        if self.verbose:
+            logger.info(f"Response: {response}")
+
+        return response
+
+    def format_response(self, response: str) -> dict:
+        cleaned = re.sub(r"^```json\s*|\s*```$", "", response.strip())
+        try:
+            output = json.loads(cleaned)
+        except json.JSONDecodeError as e:
+            raise json.JSONDecodeError(
+                f"Error: {e}, failed to parse JSON response: {response}"
+            )
+
+        return output
+
+    def format_response_repair(self, response: str) -> dict:
+        return json_repair.loads(response)
+
+    def save_output(self, output: dict, save_path: str) -> None:
+        os.makedirs(os.path.dirname(save_path), exist_ok=True)
+        with open(save_path, 'w') as f:
+            json.dump(output, f, indent=4)
+
+    def __call__(
+        self, prompt: str, save_path: str = None, params: dict = None
+    ) -> dict | str:
+        response = self.query(prompt, params=params)
+        output = self.format_response_repair(response)
+        self.save_output(output, save_path) if save_path else None
+
+        return output
+
+
+LAYOUT_DISASSEMBLER = LayoutDesigner(
+    gpt_client=GPT_CLIENT, system_prompt=LAYOUT_DISASSEMBLE_PROMPT
+)
+LAYOUT_GRAPHER = LayoutDesigner(
+    gpt_client=GPT_CLIENT, system_prompt=LAYOUT_HIERARCHY_PROMPT
+)
+LAYOUT_DESCRIBER = LayoutDesigner(
+    gpt_client=GPT_CLIENT, system_prompt=LAYOUT_DESCRIBER_PROMPT
+)
+
+
+def build_scene_layout(
+    task_desc: str, output_path: str = None, gpt_params: dict = None
+) -> LayoutInfo:
+    layout_relation = LAYOUT_DISASSEMBLER(task_desc, params=gpt_params)
+    layout_tree = LAYOUT_GRAPHER(layout_relation, params=gpt_params)
+    object_mapping = Scene3DItemEnum.object_mapping(layout_relation)
+    obj_prompt = f'{layout_relation["task_desc"]} {object_mapping}'
+    objs_desc = LAYOUT_DESCRIBER(obj_prompt, params=gpt_params)
+    layout_info = LayoutInfo(
+        layout_tree, layout_relation, objs_desc, object_mapping
+    )
+
+    if output_path is not None:
+        visualizer = SceneTreeVisualizer(layout_info)
+        visualizer.render(save_path=output_path)
+        logger.info(f"Scene hierarchy tree saved to {output_path}")
+
+    return layout_info
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="3D Scene Layout Designer")
+    parser.add_argument(
+        "--task_desc",
+        type=str,
+        default="Put the apples on the table on the plate",
+        help="Natural language description of the robotic task",
+    )
+    parser.add_argument(
+        "--save_root",
+        type=str,
+        default="outputs/layout_tree",
+        help="Path to save the layout output",
+    )
+    return parser.parse_args()
+
+
+if __name__ == "__main__":
+    from embodied_gen.utils.enum import LayoutInfo
+    from embodied_gen.utils.process_media import SceneTreeVisualizer
+
+    args = parse_args()
+    params = {
+        "temperature": 1.0,
+        "top_p": 0.95,
+        "frequency_penalty": 0.3,
+        "presence_penalty": 0.5,
+    }
+    layout_relation = LAYOUT_DISASSEMBLER(args.task_desc, params=params)
+    layout_tree = LAYOUT_GRAPHER(layout_relation, params=params)
+
+    object_mapping = Scene3DItemEnum.object_mapping(layout_relation)
+    obj_prompt = f'{layout_relation["task_desc"]} {object_mapping}'
+
+    objs_desc = LAYOUT_DESCRIBER(obj_prompt, params=params)
+
+    layout_info = LayoutInfo(layout_tree, layout_relation, objs_desc)
+
+    visualizer = SceneTreeVisualizer(layout_info)
+    os.makedirs(args.save_root, exist_ok=True)
+    scene_graph_path = f"{args.save_root}/scene_tree.jpg"
+    visualizer.render(save_path=scene_graph_path)
+    with open(f"{args.save_root}/layout.json", "w") as f:
+        json.dump(layout_info.to_dict(), f, indent=4)
+
+    print(f"Scene hierarchy tree saved to {scene_graph_path}")
+    print(f"Disassembled Layout: {layout_relation}")
+    print(f"Layout Graph: {layout_tree}")
+    print(f"Layout Descriptions: {objs_desc}")
--- a/embodied_gen/scripts/compose_layout.py
+++ b/embodied_gen/scripts/compose_layout.py
@ -0,0 +1,73 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+import json
+import os
+from dataclasses import dataclass
+
+import tyro
+from embodied_gen.scripts.simulate_sapien import entrypoint as sim_cli
+from embodied_gen.utils.enum import LayoutInfo
+from embodied_gen.utils.geometry import bfs_placement, compose_mesh_scene
+from embodied_gen.utils.log import logger
+
+
+@dataclass
+class LayoutPlacementConfig:
+    layout_path: str
+    output_dir: str | None = None
+    seed: int | None = None
+    max_attempts: int = 1000
+    output_iscene: bool = False
+    insert_robot: bool = False
+
+
+def entrypoint(**kwargs):
+    if kwargs is None or len(kwargs) == 0:
+        args = tyro.cli(LayoutPlacementConfig)
+    else:
+        args = LayoutPlacementConfig(**kwargs)
+
+    output_dir = (
+        args.output_dir
+        if args.output_dir is not None
+        else os.path.dirname(args.layout_path)
+    )
+    os.makedirs(output_dir, exist_ok=True)
+    out_scene_path = f"{output_dir}/Iscene.glb"
+    out_layout_path = f"{output_dir}/layout.json"
+
+    with open(args.layout_path, "r") as f:
+        layout_info = LayoutInfo.from_dict(json.load(f))
+
+    layout_info = bfs_placement(layout_info, seed=args.seed)
+    with open(out_layout_path, "w") as f:
+        json.dump(layout_info.to_dict(), f, indent=4)
+
+    if args.output_iscene:
+        compose_mesh_scene(layout_info, out_scene_path)
+
+    sim_cli(
+        layout_path=out_layout_path,
+        output_dir=output_dir,
+        robot_name="franka" if args.insert_robot else None,
+    )
+
+    logger.info(f"Layout placement completed in {output_dir}")
+
+
+if __name__ == "__main__":
+    entrypoint()
--- a/embodied_gen/scripts/gen_layout.py
+++ b/embodied_gen/scripts/gen_layout.py
@ -0,0 +1,156 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+import gc
+import json
+import os
+from dataclasses import dataclass, field
+from shutil import copytree
+from time import time
+from typing import Optional
+
+import torch
+import tyro
+from embodied_gen.models.layout import build_scene_layout
+from embodied_gen.scripts.simulate_sapien import entrypoint as sim_cli
+from embodied_gen.scripts.textto3d import text_to_3d
+from embodied_gen.utils.config import GptParamsConfig
+from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
+from embodied_gen.utils.geometry import bfs_placement, compose_mesh_scene
+from embodied_gen.utils.gpt_clients import GPT_CLIENT
+from embodied_gen.utils.log import logger
+from embodied_gen.utils.process_media import (
+    load_scene_dict,
+    parse_text_prompts,
+)
+from embodied_gen.validators.quality_checkers import SemanticMatcher
+
+os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
+
+
+@dataclass
+class LayoutGenConfig:
+    task_descs: list[str]
+    output_root: str
+    bg_list: str = "outputs/bg_scenes/scene_list.txt"
+    n_img_sample: int = 3
+    text_guidance_scale: float = 7.0
+    img_denoise_step: int = 25
+    n_image_retry: int = 4
+    n_asset_retry: int = 3
+    n_pipe_retry: int = 2
+    seed_img: Optional[int] = None
+    seed_3d: Optional[int] = None
+    seed_layout: Optional[int] = None
+    keep_intermediate: bool = False
+    output_iscene: bool = False
+    insert_robot: bool = False
+    gpt_params: GptParamsConfig = field(
+        default_factory=lambda: GptParamsConfig(
+            temperature=1.0,
+            top_p=0.95,
+            frequency_penalty=0.3,
+            presence_penalty=0.5,
+        )
+    )
+
+
+def entrypoint() -> None:
+    args = tyro.cli(LayoutGenConfig)
+    SCENE_MATCHER = SemanticMatcher(GPT_CLIENT)
+    task_descs = parse_text_prompts(args.task_descs)
+    scene_dict = load_scene_dict(args.bg_list)
+    gpt_params = args.gpt_params.to_dict()
+    for idx, task_desc in enumerate(task_descs):
+        logger.info(f"Generate Layout and 3D scene for task: {task_desc}")
+        output_root = f"{args.output_root}/task_{idx:04d}"
+        scene_graph_path = f"{output_root}/scene_tree.jpg"
+        start_time = time()
+        layout_info: LayoutInfo = build_scene_layout(
+            task_desc, scene_graph_path, gpt_params
+        )
+        prompts_mapping = {v: k for k, v in layout_info.objs_desc.items()}
+        prompts = [
+            v
+            for k, v in layout_info.objs_desc.items()
+            if layout_info.objs_mapping[k] != Scene3DItemEnum.BACKGROUND.value
+        ]
+
+        for prompt in prompts:
+            node = prompts_mapping[prompt]
+            generation_log = text_to_3d(
+                prompts=[
+                    prompt,
+                ],
+                output_root=output_root,
+                asset_names=[
+                    node,
+                ],
+                n_img_sample=args.n_img_sample,
+                text_guidance_scale=args.text_guidance_scale,
+                img_denoise_step=args.img_denoise_step,
+                n_image_retry=args.n_image_retry,
+                n_asset_retry=args.n_asset_retry,
+                n_pipe_retry=args.n_pipe_retry,
+                seed_img=args.seed_img,
+                seed_3d=args.seed_3d,
+                keep_intermediate=args.keep_intermediate,
+            )
+            layout_info.assets.update(generation_log["assets"])
+            layout_info.quality.update(generation_log["quality"])
+
+        # Background GEN (for efficiency, temp use retrieval instead)
+        bg_node = layout_info.relation[Scene3DItemEnum.BACKGROUND.value]
+        text = layout_info.objs_desc[bg_node]
+        match_key = SCENE_MATCHER.query(text, str(scene_dict))
+        match_scene_path = f"{os.path.dirname(args.bg_list)}/{match_key}"
+        bg_save_dir = os.path.join(output_root, "background")
+        copytree(match_scene_path, bg_save_dir, dirs_exist_ok=True)
+        layout_info.assets[bg_node] = bg_save_dir
+
+        # BFS layout placement.
+        layout_info = bfs_placement(
+            layout_info,
+            limit_reach_range=True if args.insert_robot else False,
+            seed=args.seed_layout,
+        )
+        layout_path = f"{output_root}/layout.json"
+        with open(layout_path, "w") as f:
+            json.dump(layout_info.to_dict(), f, indent=4)
+
+        if args.output_iscene:
+            compose_mesh_scene(layout_info, f"{output_root}/Iscene.glb")
+
+        sim_cli(
+            layout_path=layout_path,
+            output_dir=output_root,
+            robot_name="franka" if args.insert_robot else None,
+        )
+
+        torch.cuda.empty_cache()
+        gc.collect()
+
+        elapsed_time = (time() - start_time) / 60
+        logger.info(
+            f"Layout generation done for {scene_graph_path}, layout result "
+            f"in {layout_path}, finished in {elapsed_time:.2f} mins."
+        )
+
+    logger.info(f"All tasks completed in {args.output_root}")
+
+
+if __name__ == "__main__":
+    entrypoint()
--- a/embodied_gen/scripts/imageto3d.py
+++ b/embodied_gen/scripts/imageto3d.py
@ -151,6 +151,9 @@ def entrypoint(**kwargs):
            seg_image.save(seg_path)

            seed = args.seed
+            asset_node = "unknown"
+            if isinstance(args.asset_type, list) and args.asset_type[idx]:
+                asset_node = args.asset_type[idx]
            for try_idx in range(args.n_retry):
                logger.info(
                    f"Try: {try_idx + 1}/{args.n_retry}, Seed: {seed}, Prompt: {seg_path}"
@ -207,7 +210,9 @@ def entrypoint(**kwargs):
                color_path = os.path.join(output_root, "color.png")
                render_gs_api(aligned_gs_path, color_path)

-                geo_flag, geo_result = GEO_CHECKER([color_path])
+                geo_flag, geo_result = GEO_CHECKER(
+                    [color_path], text=asset_node
+                )
                logger.warning(
                    f"{GEO_CHECKER.__class__.__name__}: {geo_result} for {seg_path}"
                )
--- a/embodied_gen/scripts/simulate_sapien.py
+++ b/embodied_gen/scripts/simulate_sapien.py
@ -0,0 +1,186 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+
+import json
+import os
+from collections import defaultdict
+from dataclasses import dataclass, field
+from typing import Literal
+
+import imageio
+import numpy as np
+import torch
+import tyro
+from tqdm import tqdm
+from embodied_gen.models.gs_model import GaussianOperator
+from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
+from embodied_gen.utils.geometry import quaternion_multiply
+from embodied_gen.utils.log import logger
+from embodied_gen.utils.process_media import alpha_blend_rgba
+from embodied_gen.utils.simulation import (
+    SIM_COORD_ALIGN,
+    FrankaPandaGrasper,
+    SapienSceneManager,
+)
+
+
+@dataclass
+class SapienSimConfig:
+    # Simulation settings.
+    layout_path: str
+    output_dir: str
+    sim_freq: int = 200
+    sim_step: int = 400
+    z_offset: float = 0.004
+    init_quat: list[float] = field(
+        default_factory=lambda: [0.7071, 0, 0, 0.7071]
+    )  # xyzw
+    device: str = "cuda"
+    robot_name: str | None = None
+    control_freq: int = 40
+    # Camera settings.
+    render_interval: int = 10
+    num_cameras: int = 3
+    camera_radius: float = 0.9
+    camera_height: float = 1.1
+    image_hw: tuple[int, int] = (512, 512)
+    ray_tracing: bool = True
+    fovy_deg: float = 75.0
+    camera_target_pt: list[float] = field(
+        default_factory=lambda: [0.0, 0.0, 0.9]
+    )
+    render_keys: list[
+        Literal[
+            "Color", "Foreground", "Segmentation", "Normal", "Mask", "Depth"
+        ]
+    ] = field(default_factory=lambda: ["Foreground"])
+
+
+def entrypoint(**kwargs):
+    if kwargs is None or len(kwargs) == 0:
+        cfg = tyro.cli(SapienSimConfig)
+    else:
+        cfg = SapienSimConfig(**kwargs)
+
+    scene_manager = SapienSceneManager(
+        cfg.sim_freq, ray_tracing=cfg.ray_tracing
+    )
+    _ = scene_manager.initialize_circular_cameras(
+        num_cameras=cfg.num_cameras,
+        radius=cfg.camera_radius,
+        height=cfg.camera_height,
+        target_pt=cfg.camera_target_pt,
+        image_hw=cfg.image_hw,
+        fovy_deg=cfg.fovy_deg,
+    )
+    with open(cfg.layout_path, "r") as f:
+        layout_data = json.load(f)
+        layout_data: LayoutInfo = LayoutInfo.from_dict(layout_data)
+
+    scene_manager.load_assets_from_layout_file(
+        layout_data,
+        cfg.z_offset,
+        cfg.init_quat,
+        cfg.robot_name,
+        cfg.control_freq,
+    )
+
+    frames = defaultdict(list)
+    image_cnt = 0
+    for step in tqdm(range(cfg.sim_step), desc="Simulation"):
+        scene_manager.scene.step()
+        if cfg.robot_name is not None:
+            scene_manager.robot.reset(scene_manager.robot.init_qpos)
+        if step % cfg.render_interval != 0:
+            continue
+        scene_manager.scene.update_render()
+        image_cnt += 1
+        for camera in scene_manager.cameras:
+            camera.take_picture()
+            images = scene_manager.render_images(
+                camera, render_keys=cfg.render_keys
+            )
+            frames[camera.name].append(images)
+
+    grasp_frames = defaultdict(dict)
+    if cfg.robot_name is not None:
+        sim_steps_per_control = cfg.sim_freq // cfg.control_freq
+        control_timestep = 1.0 / cfg.control_freq
+        grasper = FrankaPandaGrasper(
+            scene_manager.robot,
+            scene_manager,
+            sim_steps_per_control,
+            control_timestep,
+        )
+        for node in layout_data.relation[
+            Scene3DItemEnum.MANIPULATED_OBJS.value
+        ]:
+            grasp_frames[node] = grasper.render_grasp(
+                scene_manager.actors[node],
+                scene_manager.cameras,
+                cfg.render_keys,
+            )
+
+    if "Foreground" not in cfg.render_keys:
+        return
+
+    bg_node = layout_data.relation[Scene3DItemEnum.BACKGROUND.value]
+    gs_path = f"{layout_data.assets[bg_node]}/gs_model.ply"
+    gs_model: GaussianOperator = GaussianOperator.load_from_ply(gs_path)
+    x, y, z, qx, qy, qz, qw = layout_data.position[bg_node]
+    qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], cfg.init_quat)
+    init_pose = torch.tensor([x, y, z, qx, qy, qz, qw])
+    gs_model = gs_model.get_gaussians(instance_pose=init_pose)
+
+    bg_images = dict()
+    for camera in scene_manager.cameras:
+        Ks = camera.get_intrinsic_matrix()
+        c2w = camera.get_model_matrix()
+        c2w = c2w @ SIM_COORD_ALIGN
+        result = gs_model.render(
+            torch.tensor(c2w, dtype=torch.float32).to(cfg.device),
+            torch.tensor(Ks, dtype=torch.float32).to(cfg.device),
+            image_width=cfg.image_hw[1],
+            image_height=cfg.image_hw[0],
+        )
+        bg_images[camera.name] = result.rgb[..., ::-1]
+
+    video_frames = []
+    for camera in scene_manager.cameras:
+        # Scene rendering
+        for step in tqdm(range(image_cnt), desc="Rendering"):
+            rgba = alpha_blend_rgba(
+                frames[camera.name][step]["Foreground"],
+                bg_images[camera.name],
+            )
+            video_frames.append(np.array(rgba))
+        # Grasp rendering
+        for node in grasp_frames:
+            for frame in grasp_frames[node][camera.name]:
+                rgba = alpha_blend_rgba(
+                    frame["Foreground"], bg_images[camera.name]
+                )
+                video_frames.append(np.array(rgba))
+
+    os.makedirs(cfg.output_dir, exist_ok=True)
+    video_path = f"{cfg.output_dir}/Iscene.mp4"
+    imageio.mimsave(video_path, video_frames, fps=30)
+    logger.info(f"Interative 3D Scene Visualization saved in {video_path}")
+
+
+if __name__ == "__main__":
+    entrypoint()
--- a/embodied_gen/scripts/textto3d.sh
+++ b/embodied_gen/scripts/textto3d.sh
@ -81,6 +81,7 @@ done


 # Step 1: Text-to-Image
+echo ${prompt_args}
 eval python3 embodied_gen/scripts/text2image.py \
    --prompts ${prompt_args} \
    --output_root "${output_root}/images" \
--- a/embodied_gen/trainer/gsplat_trainer.py
+++ b/embodied_gen/trainer/gsplat_trainer.py
@ -617,7 +617,7 @@ class Runner:
        for rgb, depth in images_cache:
            depth_normalized = torch.clip(
                (depth - depth_global_min)
-                / (depth_global_max - depth_global_min),
+                / (depth_global_max - depth_global_min + 1e-8),
                0,
                1,
            )
--- a/embodied_gen/trainer/pono2mesh_trainer.py
+++ b/embodied_gen/trainer/pono2mesh_trainer.py
@ -30,7 +30,7 @@ from kornia.morphology import dilation
 from PIL import Image
 from embodied_gen.models.sr_model import ImageRealESRGAN
 from embodied_gen.utils.config import Pano2MeshSRConfig
-from embodied_gen.utils.gaussian import compute_pinhole_intrinsics
+from embodied_gen.utils.geometry import compute_pinhole_intrinsics
 from embodied_gen.utils.log import logger
 from thirdparty.pano2room.modules.geo_predictors import PanoJointPredictor
 from thirdparty.pano2room.modules.geo_predictors.PanoFusionDistancePredictor import (
--- a/embodied_gen/utils/config.py
+++ b/embodied_gen/utils/config.py
@ -17,15 +17,27 @@
 from dataclasses import dataclass, field
 from typing import List, Optional, Union

+from dataclasses_json import DataClassJsonMixin
 from gsplat.strategy import DefaultStrategy, MCMCStrategy
 from typing_extensions import Literal, assert_never

 __all__ = [
+    "GptParamsConfig",
    "Pano2MeshSRConfig",
    "GsplatTrainConfig",
 ]


+@dataclass
+class GptParamsConfig(DataClassJsonMixin):
+    temperature: float = 0.1
+    top_p: float = 0.1
+    frequency_penalty: float = 0.0
+    presence_penalty: float = 0.0
+    stop: int | None = None
+    max_tokens: int = 500
+
+
@dataclass
 class Pano2MeshSRConfig:
    mesh_file: str = "mesh_model.ply"
--- a/embodied_gen/utils/enum.py
+++ b/embodied_gen/utils/enum.py
@ -102,6 +102,7 @@ class LayoutInfo(DataClassJsonMixin):
    tree: dict[str, list]
    relation: dict[str, str | list[str]]
    objs_desc: dict[str, str] = field(default_factory=dict)
+    objs_mapping: dict[str, str] = field(default_factory=dict)
    assets: dict[str, str] = field(default_factory=dict)
    quality: dict[str, str] = field(default_factory=dict)
    position: dict[str, list[float]] = field(default_factory=dict)
--- a/embodied_gen/utils/gaussian.py
+++ b/embodied_gen/utils/gaussian.py
@ -35,7 +35,6 @@ __all__ = [
    "set_random_seed",
    "export_splats",
    "create_splats_with_optimizers",
-    "compute_pinhole_intrinsics",
    "resize_pinhole_intrinsics",
    "restore_scene_scale_and_position",
 ]
@ -265,12 +264,12 @@ def create_splats_with_optimizers(
    return splats, optimizers


-def compute_pinhole_intrinsics(
-    image_w: int, image_h: int, fov_deg: float
+def compute_intrinsics_from_fovy(
+    image_w: int, image_h: int, fovy_deg: float
 ) -> np.ndarray:
-    fov_rad = np.deg2rad(fov_deg)
-    fx = image_w / (2 * np.tan(fov_rad / 2))
-    fy = fx  # assuming square pixels
+    fovy_rad = np.deg2rad(fovy_deg)
+    fy = image_h / (2 * np.tan(fovy_rad / 2))
+    fx = fy * (image_w / image_h)
    cx = image_w / 2
    cy = image_h / 2
    K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
--- a/embodied_gen/utils/geometry.py
+++ b/embodied_gen/utils/geometry.py
@ -0,0 +1,458 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+import os
+import random
+from collections import defaultdict, deque
+from functools import wraps
+from typing import Literal
+
+import numpy as np
+import torch
+import trimesh
+from matplotlib.path import Path
+from pyquaternion import Quaternion
+from scipy.spatial import ConvexHull
+from scipy.spatial.transform import Rotation as R
+from shapely.geometry import Polygon
+from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
+from embodied_gen.utils.log import logger
+
+__all__ = [
+    "bfs_placement",
+    "with_seed",
+    "matrix_to_pose",
+    "pose_to_matrix",
+    "quaternion_multiply",
+    "check_reachable",
+    "bfs_placement",
+    "compose_mesh_scene",
+    "compute_pinhole_intrinsics",
+]
+
+
+def matrix_to_pose(matrix: np.ndarray) -> list[float]:
+    """Convert a 4x4 transformation matrix to a pose (x, y, z, qx, qy, qz, qw).
+
+    Args:
+        matrix (np.ndarray): 4x4 transformation matrix.
+
+    Returns:
+        List[float]: Pose as [x, y, z, qx, qy, qz, qw].
+    """
+    x, y, z = matrix[:3, 3]
+    rot_mat = matrix[:3, :3]
+    quat = R.from_matrix(rot_mat).as_quat()
+    qx, qy, qz, qw = quat
+
+    return [x, y, z, qx, qy, qz, qw]
+
+
+def pose_to_matrix(pose: list[float]) -> np.ndarray:
+    """Convert pose (x, y, z, qx, qy, qz, qw) to a 4x4 transformation matrix.
+
+    Args:
+        List[float]: Pose as [x, y, z, qx, qy, qz, qw].
+
+    Returns:
+        matrix (np.ndarray): 4x4 transformation matrix.
+    """
+    x, y, z, qx, qy, qz, qw = pose
+    r = R.from_quat([qx, qy, qz, qw])
+    matrix = np.eye(4)
+    matrix[:3, :3] = r.as_matrix()
+    matrix[:3, 3] = [x, y, z]
+
+    return matrix
+
+
+def compute_xy_bbox(
+    vertices: np.ndarray, col_x: int = 0, col_y: int = 2
+) -> list[float]:
+    x_vals = vertices[:, col_x]
+    y_vals = vertices[:, col_y]
+    return x_vals.min(), x_vals.max(), y_vals.min(), y_vals.max()
+
+
+def has_iou_conflict(
+    new_box: list[float],
+    placed_boxes: list[list[float]],
+    iou_threshold: float = 0.0,
+) -> bool:
+    new_min_x, new_max_x, new_min_y, new_max_y = new_box
+    for min_x, max_x, min_y, max_y in placed_boxes:
+        ix1 = max(new_min_x, min_x)
+        iy1 = max(new_min_y, min_y)
+        ix2 = min(new_max_x, max_x)
+        iy2 = min(new_max_y, max_y)
+        inter_area = max(0, ix2 - ix1) * max(0, iy2 - iy1)
+        if inter_area > iou_threshold:
+            return True
+    return False
+
+
+def with_seed(seed_attr_name: str = "seed"):
+    """A parameterized decorator that temporarily sets the random seed."""
+
+    def decorator(func):
+        @wraps(func)
+        def wrapper(*args, **kwargs):
+            seed = kwargs.get(seed_attr_name, None)
+            if seed is not None:
+                py_state = random.getstate()
+                np_state = np.random.get_state()
+                torch_state = torch.get_rng_state()
+
+                random.seed(seed)
+                np.random.seed(seed)
+                torch.manual_seed(seed)
+                try:
+                    result = func(*args, **kwargs)
+                finally:
+                    random.setstate(py_state)
+                    np.random.set_state(np_state)
+                    torch.set_rng_state(torch_state)
+                return result
+            else:
+                return func(*args, **kwargs)
+
+        return wrapper
+
+    return decorator
+
+
+def compute_convex_hull_path(
+    vertices: np.ndarray,
+    z_threshold: float = 0.05,
+    interp_per_edge: int = 3,
+    margin: float = -0.02,
+) -> Path:
+    top_vertices = vertices[
+        vertices[:, 1] > vertices[:, 1].max() - z_threshold
+    ]
+    top_xy = top_vertices[:, [0, 2]]
+
+    if len(top_xy) < 3:
+        raise ValueError("Not enough points to form a convex hull")
+
+    hull = ConvexHull(top_xy)
+    hull_points = top_xy[hull.vertices]
+
+    polygon = Polygon(hull_points)
+    polygon = polygon.buffer(margin)
+    hull_points = np.array(polygon.exterior.coords)
+
+    dense_points = []
+    for i in range(len(hull_points)):
+        p1 = hull_points[i]
+        p2 = hull_points[(i + 1) % len(hull_points)]
+        for t in np.linspace(0, 1, interp_per_edge, endpoint=False):
+            pt = (1 - t) * p1 + t * p2
+            dense_points.append(pt)
+
+    return Path(np.array(dense_points), closed=True)
+
+
+def find_parent_node(node: str, tree: dict) -> str | None:
+    for parent, children in tree.items():
+        if any(child[0] == node for child in children):
+            return parent
+    return None
+
+
+def all_corners_inside(hull: Path, box: list, threshold: int = 3) -> bool:
+    x1, x2, y1, y2 = box
+    corners = [[x1, y1], [x2, y1], [x1, y2], [x2, y2]]
+
+    num_inside = sum(hull.contains_point(c) for c in corners)
+    return num_inside >= threshold
+
+
+def compute_axis_rotation_quat(
+    axis: Literal["x", "y", "z"], angle_rad: float
+) -> list[float]:
+    if axis.lower() == 'x':
+        q = Quaternion(axis=[1, 0, 0], angle=angle_rad)
+    elif axis.lower() == 'y':
+        q = Quaternion(axis=[0, 1, 0], angle=angle_rad)
+    elif axis.lower() == 'z':
+        q = Quaternion(axis=[0, 0, 1], angle=angle_rad)
+    else:
+        raise ValueError(f"Unsupported axis '{axis}', must be one of x, y, z")
+
+    return [q.x, q.y, q.z, q.w]
+
+
+def quaternion_multiply(
+    init_quat: list[float], rotate_quat: list[float]
+) -> list[float]:
+    qx, qy, qz, qw = init_quat
+    q1 = Quaternion(w=qw, x=qx, y=qy, z=qz)
+    qx, qy, qz, qw = rotate_quat
+    q2 = Quaternion(w=qw, x=qx, y=qy, z=qz)
+    quat = q2 * q1
+
+    return [quat.x, quat.y, quat.z, quat.w]
+
+
+def check_reachable(
+    base_xyz: np.ndarray,
+    reach_xyz: np.ndarray,
+    min_reach: float = 0.25,
+    max_reach: float = 0.85,
+) -> bool:
+    """Check if the target point is within the reachable range."""
+    distance = np.linalg.norm(reach_xyz - base_xyz)
+
+    return min_reach < distance < max_reach
+
+
+@with_seed("seed")
+def bfs_placement(
+    layout_info: LayoutInfo,
+    floor_margin: float = 0,
+    beside_margin: float = 0.1,
+    max_attempts: int = 3000,
+    rotate_objs: bool = True,
+    rotate_bg: bool = True,
+    limit_reach_range: bool = True,
+    robot_dim: float = 0.12,
+    seed: int = None,
+) -> LayoutInfo:
+    object_mapping = layout_info.objs_mapping
+    position = {}  # node: [x, y, z, qx, qy, qz, qw]
+    parent_bbox_xy = {}
+    placed_boxes_map = defaultdict(list)
+    mesh_info = defaultdict(dict)
+    robot_node = layout_info.relation[Scene3DItemEnum.ROBOT.value]
+    for node in object_mapping:
+        if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
+            bg_quat = (
+                compute_axis_rotation_quat(
+                    axis="y",
+                    angle_rad=np.random.uniform(0, 2 * np.pi),
+                )
+                if rotate_bg
+                else [0, 0, 0, 1]
+            )
+            bg_quat = [round(q, 4) for q in bg_quat]
+            continue
+
+        mesh_path = (
+            f"{layout_info.assets[node]}/mesh/{node.replace(' ', '_')}.obj"
+        )
+        mesh_info[node]["path"] = mesh_path
+        mesh = trimesh.load(mesh_path)
+        vertices = mesh.vertices
+        z1 = np.percentile(vertices[:, 1], 1)
+        z2 = np.percentile(vertices[:, 1], 99)
+
+        if object_mapping[node] == Scene3DItemEnum.CONTEXT.value:
+            object_quat = [0, 0, 0, 1]
+            mesh_info[node]["surface"] = compute_convex_hull_path(vertices)
+            # Put robot in the CONTEXT edge.
+            x, y = random.choice(mesh_info[node]["surface"].vertices)
+            theta = np.arctan2(y, x)
+            quat_initial = Quaternion(axis=[0, 0, 1], angle=theta)
+            quat_extra = Quaternion(axis=[0, 0, 1], angle=np.pi)
+            quat = quat_extra * quat_initial
+            _pose = [x, y, z2 - z1, quat.x, quat.y, quat.z, quat.w]
+            position[robot_node] = [round(v, 4) for v in _pose]
+            node_box = [
+                x - robot_dim / 2,
+                x + robot_dim / 2,
+                y - robot_dim / 2,
+                y + robot_dim / 2,
+            ]
+            placed_boxes_map[node].append(node_box)
+        elif rotate_objs:
+            # For manipulated and distractor objects, apply random rotation
+            angle_rad = np.random.uniform(0, 2 * np.pi)
+            object_quat = compute_axis_rotation_quat(
+                axis="y", angle_rad=angle_rad
+            )
+            object_quat_scipy = np.roll(object_quat, 1)  # [w, x, y, z]
+            rotation = R.from_quat(object_quat_scipy).as_matrix()
+            vertices = np.dot(mesh.vertices, rotation.T)
+            z1 = np.percentile(vertices[:, 1], 1)
+            z2 = np.percentile(vertices[:, 1], 99)
+
+        x1, x2, y1, y2 = compute_xy_bbox(vertices)
+        mesh_info[node]["pose"] = [x1, x2, y1, y2, z1, z2, *object_quat]
+        mesh_info[node]["area"] = max(1e-5, (x2 - x1) * (y2 - y1))
+
+    root = list(layout_info.tree.keys())[0]
+    queue = deque([((root, None), layout_info.tree.get(root, []))])
+    while queue:
+        (node, relation), children = queue.popleft()
+        if node not in object_mapping:
+            continue
+
+        if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
+            position[node] = [0, 0, floor_margin, *bg_quat]
+        else:
+            x1, x2, y1, y2, z1, z2, qx, qy, qz, qw = mesh_info[node]["pose"]
+            if object_mapping[node] == Scene3DItemEnum.CONTEXT.value:
+                position[node] = [0, 0, -round(z1, 4), qx, qy, qz, qw]
+                parent_bbox_xy[node] = [x1, x2, y1, y2, z1, z2]
+            elif object_mapping[node] in [
+                Scene3DItemEnum.MANIPULATED_OBJS.value,
+                Scene3DItemEnum.DISTRACTOR_OBJS.value,
+            ]:
+                parent_node = find_parent_node(node, layout_info.tree)
+                parent_pos = position[parent_node]
+                (
+                    p_x1,
+                    p_x2,
+                    p_y1,
+                    p_y2,
+                    p_z1,
+                    p_z2,
+                ) = parent_bbox_xy[parent_node]
+
+                obj_dx = x2 - x1
+                obj_dy = y2 - y1
+                hull_path = mesh_info[parent_node].get("surface")
+                for _ in range(max_attempts):
+                    node_x1 = random.uniform(p_x1, p_x2 - obj_dx)
+                    node_y1 = random.uniform(p_y1, p_y2 - obj_dy)
+                    node_box = [
+                        node_x1,
+                        node_x1 + obj_dx,
+                        node_y1,
+                        node_y1 + obj_dy,
+                    ]
+                    if hull_path and not all_corners_inside(
+                        hull_path, node_box
+                    ):
+                        continue
+                    # Make sure the manipulated object is reachable by robot.
+                    if (
+                        limit_reach_range
+                        and object_mapping[node]
+                        == Scene3DItemEnum.MANIPULATED_OBJS.value
+                    ):
+                        cx = parent_pos[0] + node_box[0] + obj_dx / 2
+                        cy = parent_pos[1] + node_box[2] + obj_dy / 2
+                        cz = parent_pos[2] + p_z2 - z1
+                        robot_pose = position[robot_node][:3]
+                        if not check_reachable(
+                            base_xyz=np.array(robot_pose),
+                            reach_xyz=np.array([cx, cy, cz]),
+                        ):
+                            continue
+
+                    if not has_iou_conflict(
+                        node_box, placed_boxes_map[parent_node]
+                    ):
+                        z_offset = 0
+                        break
+                else:
+                    logger.warning(
+                        f"Cannot place {node} on {parent_node} without overlap"
+                        f" after {max_attempts} attempts, place beside {parent_node}."
+                    )
+                    for _ in range(max_attempts):
+                        node_x1 = random.choice(
+                            [
+                                random.uniform(
+                                    p_x1 - obj_dx - beside_margin,
+                                    p_x1 - obj_dx,
+                                ),
+                                random.uniform(p_x2, p_x2 + beside_margin),
+                            ]
+                        )
+                        node_y1 = random.choice(
+                            [
+                                random.uniform(
+                                    p_y1 - obj_dy - beside_margin,
+                                    p_y1 - obj_dy,
+                                ),
+                                random.uniform(p_y2, p_y2 + beside_margin),
+                            ]
+                        )
+                        node_box = [
+                            node_x1,
+                            node_x1 + obj_dx,
+                            node_y1,
+                            node_y1 + obj_dy,
+                        ]
+                        z_offset = -(parent_pos[2] + p_z2)
+                        if not has_iou_conflict(
+                            node_box, placed_boxes_map[parent_node]
+                        ):
+                            break
+
+                placed_boxes_map[parent_node].append(node_box)
+
+                abs_cx = parent_pos[0] + node_box[0] + obj_dx / 2
+                abs_cy = parent_pos[1] + node_box[2] + obj_dy / 2
+                abs_cz = parent_pos[2] + p_z2 - z1 + z_offset
+                position[node] = [
+                    round(v, 4)
+                    for v in [abs_cx, abs_cy, abs_cz, qx, qy, qz, qw]
+                ]
+                parent_bbox_xy[node] = [x1, x2, y1, y2, z1, z2]
+
+        sorted_children = sorted(
+            children, key=lambda x: -mesh_info[x[0]].get("area", 0)
+        )
+        for child, rel in sorted_children:
+            queue.append(((child, rel), layout_info.tree.get(child, [])))
+
+    layout_info.position = position
+
+    return layout_info
+
+
+def compose_mesh_scene(
+    layout_info: LayoutInfo, out_scene_path: str, with_bg: bool = False
+) -> None:
+    object_mapping = Scene3DItemEnum.object_mapping(layout_info.relation)
+    scene = trimesh.Scene()
+    for node in layout_info.assets:
+        if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
+            mesh_path = f"{layout_info.assets[node]}/mesh_model.ply"
+            if not with_bg:
+                continue
+        else:
+            mesh_path = (
+                f"{layout_info.assets[node]}/mesh/{node.replace(' ', '_')}.obj"
+            )
+
+        mesh = trimesh.load(mesh_path)
+        offset = np.array(layout_info.position[node])[[0, 2, 1]]
+        mesh.vertices += offset
+        scene.add_geometry(mesh, node_name=node)
+
+    os.makedirs(os.path.dirname(out_scene_path), exist_ok=True)
+    scene.export(out_scene_path)
+    logger.info(f"Composed interactive 3D layout saved in {out_scene_path}")
+
+    return
+
+
+def compute_pinhole_intrinsics(
+    image_w: int, image_h: int, fov_deg: float
+) -> np.ndarray:
+    fov_rad = np.deg2rad(fov_deg)
+    fx = image_w / (2 * np.tan(fov_rad / 2))
+    fy = fx  # assuming square pixels
+    cx = image_w / 2
+    cy = image_h / 2
+    K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
+
+    return K
--- a/embodied_gen/utils/process_media.py
+++ b/embodied_gen/utils/process_media.py
@ -166,7 +166,7 @@ def combine_images_to_grid(
    images: list[str | Image.Image],
    cat_row_col: tuple[int, int] = None,
    target_wh: tuple[int, int] = (512, 512),
-) -> list[str | Image.Image]:
+) -> list[Image.Image]:
    n_images = len(images)
    if n_images == 1:
        return images
@ -377,6 +377,42 @@ def parse_text_prompts(prompts: list[str]) -> list[str]:
    return prompts


+def alpha_blend_rgba(
+    fg_image: Union[str, Image.Image, np.ndarray],
+    bg_image: Union[str, Image.Image, np.ndarray],
+) -> Image.Image:
+    """Alpha blends a foreground RGBA image over a background RGBA image.
+
+    Args:
+        fg_image: Foreground image. Can be a file path (str), a PIL Image,
+            or a NumPy ndarray.
+        bg_image: Background image. Can be a file path (str), a PIL Image,
+            or a NumPy ndarray.
+
+    Returns:
+        A PIL Image representing the alpha-blended result in RGBA mode.
+    """
+    if isinstance(fg_image, str):
+        fg_image = Image.open(fg_image)
+    elif isinstance(fg_image, np.ndarray):
+        fg_image = Image.fromarray(fg_image)
+
+    if isinstance(bg_image, str):
+        bg_image = Image.open(bg_image)
+    elif isinstance(bg_image, np.ndarray):
+        bg_image = Image.fromarray(bg_image)
+
+    if fg_image.size != bg_image.size:
+        raise ValueError(
+            f"Image sizes not match {fg_image.size} v.s. {bg_image.size}."
+        )
+
+    fg = fg_image.convert("RGBA")
+    bg = bg_image.convert("RGBA")
+
+    return Image.alpha_composite(bg, fg)
+
+
 def check_object_edge_truncated(
    mask: np.ndarray, edge_threshold: int = 5
 ) -> bool:
@ -400,8 +436,15 @@ def check_object_edge_truncated(


 if __name__ == "__main__":
-    merge_video_video(
-        "outputs/imageto3d/room_bottle7/room_bottle_007/URDF_room_bottle_007/mesh_glo_normal.mp4",  # noqa
-        "outputs/imageto3d/room_bottle7/room_bottle_007/URDF_room_bottle_007/mesh.mp4",  # noqa
-        "merge.mp4",
-    )
+    image_paths = [
+        "outputs/layouts_sim/task_0000/images/pen.png",
+        "outputs/layouts_sim/task_0000/images/notebook.png",
+        "outputs/layouts_sim/task_0000/images/mug.png",
+        "outputs/layouts_sim/task_0000/images/lamp.png",
+        "outputs/layouts_sim2/task_0014/images/cloth.png",  # TODO
+    ]
+    for image_path in image_paths:
+        image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
+        mask = image[..., -1]
+        flag = check_object_edge_truncated(mask)
+        print(flag, image_path)
--- a/embodied_gen/utils/simulation.py
+++ b/embodied_gen/utils/simulation.py
@ -0,0 +1,706 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+
+import logging
+import os
+import xml.etree.ElementTree as ET
+from collections import defaultdict
+from typing import Literal
+
+import mplib
+import numpy as np
+import sapien.core as sapien
+import torch
+from PIL import Image, ImageColor
+from scipy.spatial.transform import Rotation as R
+from tqdm import tqdm
+from embodied_gen.data.utils import DiffrastRender
+from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
+from embodied_gen.utils.geometry import quaternion_multiply
+
+logger = logging.getLogger(__name__)
+from mani_skill.examples.motionplanning.panda.utils import (
+    compute_grasp_info_by_obb,
+)
+from mani_skill.utils.geometry.trimesh_utils import get_component_mesh
+
+COLORMAP = list(set(ImageColor.colormap.values()))
+COLOR_PALETTE = np.array(
+    [ImageColor.getrgb(c) for c in COLORMAP], dtype=np.uint8
+)
+SIM_COORD_ALIGN = np.array(
+    [
+        [1.0, 0.0, 0.0, 0.0],
+        [0.0, -1.0, 0.0, 0.0],
+        [0.0, 0.0, -1.0, 0.0],
+        [0.0, 0.0, 0.0, 1.0],
+    ]
+)  # Used to align SAPIEN, MuJoCo coordinate system with the world coordinate system
+
+__all__ = [
+    "SIM_COORD_ALIGN",
+    "FrankaPandaGrasper",
+    "SimpleGrasper",
+]
+
+
+class SapienSceneManager:
+    """A class to manage SAPIEN simulator."""
+
+    def __init__(
+        self, sim_freq: int, ray_tracing: bool, device: str = "cuda"
+    ) -> None:
+        self.sim_freq = sim_freq
+        self.ray_tracing = ray_tracing
+        self.device = device
+        self.renderer = sapien.SapienRenderer()
+        self.scene = self._setup_scene()
+        self.cameras: list[sapien.render.RenderCameraComponent] = []
+        self.actors: dict[str, sapien.pysapien.Entity] = {}
+
+    def _setup_scene(self) -> sapien.Scene:
+        """Set up the SAPIEN scene with lighting and ground."""
+        # Ray tracing settings
+        if self.ray_tracing:
+            sapien.render.set_camera_shader_dir("rt")
+            sapien.render.set_ray_tracing_samples_per_pixel(64)
+            sapien.render.set_ray_tracing_path_depth(10)
+            sapien.render.set_ray_tracing_denoiser("oidn")
+
+        # TODO: Support GPU.
+        # import sapien.physx as physx
+        # physx.enable_gpu()
+        # scene = sapien.Scene(systems=[sapien.physx.PhysxGpuSystem(), sapien.render.RenderSystem()])
+
+        scene = sapien.Scene()
+        scene.set_timestep(1 / self.sim_freq)
+
+        # Add lighting
+        scene.set_ambient_light([0.2, 0.2, 0.2])
+        scene.add_directional_light(
+            direction=[0, 1, -1],
+            color=[1.5, 1.45, 1.4],
+            shadow=True,
+            shadow_map_size=2048,
+        )
+        scene.add_directional_light(
+            direction=[0, -0.5, 1], color=[0.8, 0.8, 0.85], shadow=False
+        )
+        scene.add_directional_light(
+            direction=[0, -1, 1], color=[1.0, 1.0, 1.0], shadow=False
+        )
+
+        ground_material = self.renderer.create_material()
+        ground_material.base_color = [0.5, 0.5, 0.5, 1]  # rgba, gray
+        ground_material.roughness = 0.7
+        ground_material.metallic = 0.0
+        scene.add_ground(0, render_material=ground_material)
+
+        return scene
+
+    def step_action(
+        self,
+        action: torch.Tensor,
+        sim_steps_per_control: int,
+        cameras: list[sapien.render.RenderCameraComponent],
+        render_keys: list[str],
+    ) -> dict:
+        self.robot.set_action(action)
+        frames = defaultdict(list)
+        for _ in range(sim_steps_per_control):
+            self.scene.step()
+
+        self.scene.update_render()
+        for camera in cameras:
+            camera.take_picture()
+            images = self.render_images(camera, render_keys=render_keys)
+            frames[camera.name].append(images)
+
+        return frames
+
+    def load_exists_robot(
+        self,
+        robot_name: str,
+        pose: sapien.Pose,
+        control_freq: int = 20,
+        robot_init_qpos_noise: float = 0.02,
+        control_mode: str = "pd_joint_pos",
+        backend_str: tuple[str, str] = ("cpu", "gpu"),
+    ):
+        from mani_skill.agents import REGISTERED_AGENTS
+        from mani_skill.envs.scene import ManiSkillScene
+        from mani_skill.envs.utils.system.backend import (
+            parse_sim_and_render_backend,
+        )
+
+        if robot_name == "franka":
+            robot_name = "panda"  # remapping
+        if robot_name not in REGISTERED_AGENTS:
+            raise KeyError(
+                f"Robot `{robot_name}` not registered, chosen from {REGISTERED_AGENTS.keys()}."
+            )
+
+        ROBOT_CLS = REGISTERED_AGENTS[robot_name].agent_cls
+        # backend = parse_sim_and_render_backend("physx_cuda", "gpu")
+        backend = parse_sim_and_render_backend(*backend_str)
+        scene = ManiSkillScene(
+            [self.scene], device=backend_str[0], backend=backend
+        )
+        robot = ROBOT_CLS(
+            scene=scene,
+            control_freq=control_freq,
+            control_mode=control_mode,
+            initial_pose=pose,
+        )
+        # Set robot init joint rad agree(joint0 to joint6 w 2 finger).
+        qpos = np.array(
+            [
+                0.0,
+                np.pi / 8,
+                0,
+                -np.pi * 3 / 8,
+                0,
+                np.pi * 3 / 4,
+                np.pi / 4,
+                0.04,
+                0.04,
+            ]
+        )
+        qpos = (
+            np.random.normal(
+                0, robot_init_qpos_noise, (len(scene.sub_scenes), len(qpos))
+            )
+            + qpos
+        )
+        qpos[:, -2:] = 0.04
+        robot.reset(qpos)
+        robot.init_qpos = robot.robot.qpos
+        robot.controller.controllers["gripper"].reset()
+
+        return robot
+
+    def load_actor_from_urdf(
+        self,
+        file_path: str,
+        pose: sapien.Pose,
+        use_static: bool = False,
+        update_mass: bool = False,
+    ) -> sapien.pysapien.Entity:
+        """Load an actor from a URDF file into the scene.
+
+        Args:
+            file_path (str): Path to the URDF file.
+            pose (sapien.Pose): Pose of the actor.
+            use_static (bool): Whether to create a static actor.
+            update_mass (bool): Whether to update the actor's mass from the URDF.
+
+        Returns:
+            sapien.pysapien.Entity: The loaded actor.
+        """
+        tree = ET.parse(file_path)
+        root = tree.getroot()
+        node_name = root.get("name")
+        file_dir = os.path.dirname(file_path)
+        visual_file = root.find('.//visual/geometry/mesh').get("filename")
+        collision_file = root.find('.//collision/geometry/mesh').get(
+            "filename"
+        )
+        visual_file = os.path.join(file_dir, visual_file)
+        collision_file = os.path.join(file_dir, collision_file)
+        static_fric = root.find('.//collision/gazebo/mu1').text
+        dynamic_fric = root.find('.//collision/gazebo/mu2').text
+
+        material = self.scene.create_physical_material(
+            static_friction=np.clip(float(static_fric), 0.1, 0.7),
+            dynamic_friction=np.clip(float(dynamic_fric), 0.1, 0.6),
+            restitution=0.05,
+        )
+        builder = self.scene.create_actor_builder()
+        body_type = "static" if use_static else "dynamic"
+        builder.set_physx_body_type(body_type)
+        builder.add_multiple_convex_collisions_from_file(
+            collision_file, material=material
+        )
+        builder.add_visual_from_file(visual_file)
+        actor = builder.build(name=node_name)
+        actor.set_name(node_name)
+        actor.set_pose(pose)
+
+        if update_mass and hasattr(actor.components[1], "mass"):
+            node_mass = float(root.find('.//inertial/mass').get("value"))
+            actor.components[1].set_mass(node_mass)
+
+        return actor
+
+    def create_camera(
+        self,
+        cam_name: str,
+        pose: sapien.Pose,
+        image_hw: tuple[int, int],
+        fovy_deg: float,
+    ) -> sapien.render.RenderCameraComponent:
+        """Create a single camera in the scene.
+
+        Args:
+            cam_name (str): Name of the camera.
+            pose (sapien.Pose): Camera pose p=(x, y, z), q=(w, x, y, z)
+            image_hw (Tuple[int, int]): Image resolution (height, width) for cameras.
+            fovy_deg (float): Field of view in degrees for cameras.
+
+        Returns:
+            sapien.render.RenderCameraComponent: The created camera.
+        """
+        cam_actor = self.scene.create_actor_builder().build_kinematic()
+        cam_actor.set_pose(pose)
+        camera = self.scene.add_mounted_camera(
+            name=cam_name,
+            mount=cam_actor,
+            pose=sapien.Pose(p=[0, 0, 0], q=[1, 0, 0, 0]),
+            width=image_hw[1],
+            height=image_hw[0],
+            fovy=np.deg2rad(fovy_deg),
+            near=0.01,
+            far=100,
+        )
+        self.cameras.append(camera)
+
+        return camera
+
+    def initialize_circular_cameras(
+        self,
+        num_cameras: int,
+        radius: float,
+        height: float,
+        target_pt: list[float],
+        image_hw: tuple[int, int],
+        fovy_deg: float,
+    ) -> list[sapien.render.RenderCameraComponent]:
+        """Initialize multiple cameras arranged in a circle.
+
+        Args:
+            num_cameras (int): Number of cameras to create.
+            radius (float): Radius of the camera circle.
+            height (float): Fixed Z-coordinate of the cameras.
+            target_pt (list[float]): 3D point (x, y, z) that cameras look at.
+            image_hw (Tuple[int, int]): Image resolution (height, width) for cameras.
+            fovy_deg (float): Field of view in degrees for cameras.
+
+        Returns:
+            List[sapien.render.RenderCameraComponent]: List of created cameras.
+        """
+        angle_step = 2 * np.pi / num_cameras
+        world_up_vec = np.array([0.0, 0.0, 1.0])
+        target_pt = np.array(target_pt)
+
+        for i in range(num_cameras):
+            angle = i * angle_step
+            cam_x = radius * np.cos(angle)
+            cam_y = radius * np.sin(angle)
+            cam_z = height
+            eye_pos = [cam_x, cam_y, cam_z]
+
+            forward_vec = target_pt - eye_pos
+            forward_vec = forward_vec / np.linalg.norm(forward_vec)
+            temp_right_vec = np.cross(forward_vec, world_up_vec)
+
+            if np.linalg.norm(temp_right_vec) < 1e-6:
+                temp_right_vec = np.array([1.0, 0.0, 0.0])
+                if np.abs(np.dot(temp_right_vec, forward_vec)) > 0.99:
+                    temp_right_vec = np.array([0.0, 1.0, 0.0])
+
+            right_vec = temp_right_vec / np.linalg.norm(temp_right_vec)
+            up_vec = np.cross(right_vec, forward_vec)
+            rotation_matrix = np.array([forward_vec, -right_vec, up_vec]).T
+
+            rot = R.from_matrix(rotation_matrix)
+            scipy_quat = rot.as_quat()  # (x, y, z, w)
+            quat = [
+                scipy_quat[3],
+                scipy_quat[0],
+                scipy_quat[1],
+                scipy_quat[2],
+            ]  # (w, x, y, z)
+
+            self.create_camera(
+                f"camera_{i}",
+                sapien.Pose(p=eye_pos, q=quat),
+                image_hw,
+                fovy_deg,
+            )
+
+        return self.cameras
+
+    def render_images(
+        self,
+        camera: sapien.render.RenderCameraComponent,
+        render_keys: list[
+            Literal[
+                "Color",
+                "Segmentation",
+                "Normal",
+                "Mask",
+                "Depth",
+                "Foreground",
+            ]
+        ] = None,
+    ) -> dict[str, Image.Image]:
+        """Render images from a given camera.
+
+        Args:
+            camera (sapien.render.RenderCameraComponent): The camera to render from.
+            render_keys (List[str]): Types of images to render (e.g., Color, Segmentation).
+
+        Returns:
+            Dict[str, Image.Image]: Dictionary of rendered images.
+        """
+        if render_keys is None:
+            render_keys = [
+                "Color",
+                "Segmentation",
+                "Normal",
+                "Mask",
+                "Depth",
+                "Foreground",
+            ]
+
+        results: dict[str, Image.Image] = {}
+        if "Color" in render_keys:
+            color = camera.get_picture("Color")
+            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
+            results["Color"] = Image.fromarray(color_rgb)
+
+        if "Mask" in render_keys:
+            alpha = (np.clip(color[..., 3], 0, 1) * 255).astype(np.uint8)
+            results["Mask"] = Image.fromarray(alpha)
+
+        if "Segmentation" in render_keys:
+            seg_labels = camera.get_picture("Segmentation")
+            label0 = seg_labels[..., 0].astype(np.uint8)
+            seg_color = self.COLOR_PALETTE[label0]
+            results["Segmentation"] = Image.fromarray(seg_color)
+
+        if "Foreground" in render_keys:
+            seg_labels = camera.get_picture("Segmentation")
+            label0 = seg_labels[..., 0]
+            mask = np.where((label0 > 1), 255, 0).astype(np.uint8)
+            color = camera.get_picture("Color")
+            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
+            foreground = np.concatenate([color_rgb, mask[..., None]], axis=-1)
+            results["Foreground"] = Image.fromarray(foreground)
+
+        if "Normal" in render_keys:
+            normal = camera.get_picture("Normal")[..., :3]
+            normal_img = (((normal + 1) / 2) * 255).astype(np.uint8)
+            results["Normal"] = Image.fromarray(normal_img)
+
+        if "Depth" in render_keys:
+            position_map = camera.get_picture("Position")
+            depth = -position_map[..., 2]
+            alpha = torch.tensor(color[..., 3], dtype=torch.float32)
+            norm_depth = DiffrastRender.normalize_map_by_mask(
+                torch.tensor(depth), alpha
+            )
+            depth_img = (norm_depth * 255).to(torch.uint8).numpy()
+            results["Depth"] = Image.fromarray(depth_img)
+
+        return results
+
+    def load_assets_from_layout_file(
+        self,
+        layout: LayoutInfo,
+        z_offset: float = 0.0,
+        init_quat: list[float] = [0, 0, 0, 1],
+        robot_name: str = None,
+        control_freq: int = 40,
+        robot_init_qpos_noise: float = 0,
+    ) -> None:
+        """Load assets from `EmbodiedGen` layout-gen output and create actors in the scene.
+
+        Args:
+            layout (LayoutInfo): The layout information data.
+            z_offset (float): Offset to apply to the Z-coordinate of non-context objects.
+            init_quat (List[float]): Initial quaternion (x, y, z, w) for orientation adjustment.
+            robot_name (str): Name of the robot to load, None not load robot.
+            control_freq (int): Control frequency for the robot, default is 40.
+            robot_init_qpos_noise (float): Noise to add to the robot's initial joint positions, default is 0.
+        """
+        for node in layout.assets:
+            file_dir = layout.assets[node]
+            file_name = f"{node.replace(' ', '_')}.urdf"
+            urdf_file = os.path.join(file_dir, file_name)
+
+            if layout.objs_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
+                continue
+
+            position = layout.position[node].copy()
+            if layout.objs_mapping[node] != Scene3DItemEnum.CONTEXT.value:
+                position[2] += z_offset
+
+            use_static = (
+                layout.relation.get(Scene3DItemEnum.CONTEXT.value, None)
+                == node
+            )
+
+            # Combine initial quaternion with object quaternion
+            x, y, z, qx, qy, qz, qw = position
+            qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
+            actor = self.load_actor_from_urdf(
+                urdf_file,
+                sapien.Pose(p=[x, y, z], q=[qw, qx, qy, qz]),
+                use_static=use_static,
+                update_mass=False,
+            )
+            self.actors[node] = actor
+
+        if robot_name is not None:
+            robot_node = layout.relation[Scene3DItemEnum.ROBOT.value]
+            x, y, z, qx, qy, qz, qw = layout.position[robot_node]
+            delta_z = 0.002  # Add small offset to avoid collision.
+            pose = sapien.Pose([x, y, z + delta_z], [qw, qx, qy, qz])
+            self.robot = self.load_exists_robot(
+                robot_name, pose, control_freq, robot_init_qpos_noise
+            )
+
+        return
+
+
+class FrankaPandaGrasper(object):
+    """Grasper planner for Franka Panda robot arm.
+
+    Args:
+        robot: See from `mani_skill.agents.robots.panda.panda`.
+        scene: The `SapienSceneManager` object.
+        sim_steps_per_control: Number of simulation steps per control step.
+        control_timestep: Time step for planning.
+        joint_vel_limits: Velocity limit for all joints (default: 0.9)
+        joint_acc_limits: Acceleration limit for all joints (default: 0.9)
+        finger_length: Length of the robot's fingers (default: 0.025)
+    """
+
+    def __init__(
+        self,
+        robot,
+        scene: SapienSceneManager,
+        sim_steps_per_control: int,
+        control_timestep: float,
+        joint_vel_limits: float = 0.9,
+        joint_acc_limits: float = 0.9,
+        finger_length: float = 0.025,
+    ) -> None:
+        self.agent = robot
+        self.robot = robot.robot
+        self.scene = scene
+        self.sim_steps_per_control = sim_steps_per_control
+        self.control_timestep = control_timestep
+        self.joint_vel_limits = joint_vel_limits
+        self.joint_acc_limits = joint_acc_limits
+        self.finger_length = finger_length
+        self.planner = self._setup_planner()
+
+    def _setup_planner(self) -> mplib.Planner:
+        link_names = [link.get_name() for link in self.robot.get_links()]
+        joint_names = [
+            joint.get_name() for joint in self.robot.get_active_joints()
+        ]
+
+        planner = mplib.Planner(
+            urdf=self.agent.urdf_path,
+            srdf=self.agent.urdf_path.replace(".urdf", ".srdf"),
+            user_link_names=link_names,
+            user_joint_names=joint_names,
+            move_group="panda_hand_tcp",
+            joint_vel_limits=np.ones(7) * self.joint_vel_limits,
+            joint_acc_limits=np.ones(7) * self.joint_acc_limits,
+        )
+        planner.set_base_pose(self.robot.pose.raw_pose[0].tolist())
+
+        return planner
+
+    def update_state(
+        self,
+        result: dict,
+        gripper_state: Literal[-1, 1],
+        sim_steps_per_control: int,
+        cameras: list[sapien.render.RenderCameraComponent],
+        render_keys: list[str],
+    ) -> dict[str, list[Image.Image]]:
+        n_step = len(result["position"])
+        total_frames = defaultdict(list)
+        for i in tqdm(range(n_step), desc="Grasping"):
+            qpos = result["position"][min(i, n_step - 1)]
+            action = np.hstack([qpos, gripper_state])[None, ...]
+            action = torch.from_numpy(action).float()
+            frames = self.scene.step_action(
+                action, sim_steps_per_control, cameras, render_keys
+            )
+            for camera in cameras:
+                total_frames[camera.name].extend(frames[camera.name])
+
+        return total_frames
+
+    def control_gripper(
+        self,
+        sim_steps_per_control: int,
+        cameras: list[sapien.render.RenderCameraComponent],
+        render_keys: list[str],
+        gripper_state: Literal[-1, 1],
+        n_step: int = 10,
+    ) -> dict[str, list[Image.Image]]:
+        total_frames = defaultdict(list)
+        qpos = self.robot.get_qpos()[0, :-2].cpu().numpy()
+        for _ in range(n_step):
+            action = np.hstack([qpos, gripper_state])[None, ...]
+            action = torch.from_numpy(action).float()
+            frames = self.scene.step_action(
+                action, sim_steps_per_control, cameras, render_keys
+            )
+            for camera in cameras:
+                total_frames[camera.name].extend(frames[camera.name])
+
+        return total_frames
+
+    def move_to_pose(
+        self,
+        pose: sapien.Pose,
+        sim_steps_per_control: int,
+        control_timestep: float,
+        cameras: list[sapien.render.RenderCameraComponent],
+        render_keys: list[str],
+        gripper_state: Literal[-1, 1],
+        use_point_cloud: bool = False,
+        n_max_step: int = 100,
+    ) -> dict[str, list[Image.Image]]:
+        # First try screw motion planning
+        result = self.planner.plan_screw(
+            np.concatenate([pose.p, pose.q]),
+            self.robot.get_qpos().cpu().numpy()[0],
+            time_step=control_timestep,
+            use_point_cloud=use_point_cloud,
+        )
+
+        # Fallback to standard planning if screw fails
+        if result["status"] != "Success":
+            result = self.planner.plan_qpos_to_pose(
+                np.concatenate([pose.p, pose.q]),
+                self.robot.get_qpos().cpu().numpy()[0],
+                time_step=control_timestep,
+                use_point_cloud=use_point_cloud,
+            )
+
+        if result["status"] != "Success":
+            return defaultdict(list)
+
+        for key in result:
+            if key in ["status", "duration"]:
+                continue
+            sample_ratio = (len(result[key]) // n_max_step) + 1
+            result[key] = result[key][::sample_ratio]
+
+        frames = self.update_state(
+            result, gripper_state, sim_steps_per_control, cameras, render_keys
+        )
+
+        return frames
+
+    def render_grasp(
+        self,
+        actor: sapien.pysapien.Entity,
+        cameras: list[sapien.render.RenderCameraComponent],
+        render_keys: list[
+            Literal[
+                "Color",
+                "Foreground",
+                "Segmentation",
+                "Normal",
+                "Mask",
+                "Depth",
+            ]
+        ],
+        reach_target_only: bool = True,
+    ) -> dict[str, list[Image.Image]]:
+        physx_rigid = actor.components[1]
+        mesh = get_component_mesh(physx_rigid, to_world_frame=True)
+        obb = mesh.bounding_box_oriented
+        approaching = np.array([0, 0, -1])
+        target_closing = (
+            self.agent.tcp.pose.to_transformation_matrix()[0, :3, 1]
+            .cpu()
+            .numpy()
+        )
+        grasp_info = compute_grasp_info_by_obb(
+            obb,
+            approaching=approaching,
+            target_closing=target_closing,
+            depth=self.finger_length,
+        )
+        closing, center = grasp_info["closing"], grasp_info["center"]
+        raw_tcp_pose = self.agent.tcp.pose.sp
+        grasp_pose = self.agent.build_grasp_pose(approaching, closing, center)
+        reach_pose = grasp_pose * sapien.Pose(p=[0, 0, -0.05])
+        grasp_pose = grasp_pose * sapien.Pose(p=[0, 0, 0.01])
+
+        total_frames = defaultdict(list)
+        frames_list = []
+        reach_frames = self.move_to_pose(
+            reach_pose,
+            self.sim_steps_per_control,
+            self.control_timestep,
+            cameras,
+            render_keys,
+            gripper_state=1,
+        )
+        frames_list.append(reach_frames)
+        if len(reach_frames) == 0:
+            logger.warning(
+                f"Failed to reach the grasp pose for node `{actor.name}`, skipping grasping."
+            )
+            return total_frames
+
+        if not reach_target_only:
+            grasp_frames = self.move_to_pose(
+                grasp_pose,
+                self.sim_steps_per_control,
+                self.control_timestep,
+                cameras,
+                render_keys,
+                gripper_state=1,
+            )
+            frames_list.append(grasp_frames)
+            close_frames = self.control_gripper(
+                self.sim_steps_per_control,
+                cameras,
+                render_keys,
+                gripper_state=-1,
+            )
+            frames_list.append(close_frames)
+            back_frames = self.move_to_pose(
+                raw_tcp_pose,
+                self.sim_steps_per_control,
+                self.control_timestep,
+                cameras,
+                render_keys,
+                gripper_state=-1,
+            )
+            frames_list.append(back_frames)
+
+        for frame_dict in frames_list:
+            for cam_name, frames in frame_dict.items():
+                total_frames[cam_name].extend(frames)
+
+        self.agent.reset(self.agent.init_qpos)
+
+        return total_frames
--- a/embodied_gen/utils/tags.py
+++ b/embodied_gen/utils/tags.py
@ -1 +1 @@
-VERSION = "v0.1.2"
+VERSION = "v0.1.3"
--- a/embodied_gen/validators/quality_checkers.py
+++ b/embodied_gen/validators/quality_checkers.py
@ -109,7 +109,7 @@ class MeshGeoChecker(BaseChecker):
        if self.prompt is None:
            self.prompt = """
            You are an expert in evaluating the geometry quality of generated 3D asset.
-            You will be given rendered views of a generated 3D asset with black background.
+            You will be given rendered views of a generated 3D asset, type {}, with black background.
            Your task is to evaluate the quality of the 3D asset generation,
            including geometry, structure, and appearance, based on the rendered views.
            Criteria:
@ -130,10 +130,13 @@ class MeshGeoChecker(BaseChecker):
            Image shows a chair with simplified back legs and soft edges → YES
            """

-    def query(self, image_paths: list[str | Image.Image]) -> str:
+    def query(
+        self, image_paths: list[str | Image.Image], text: str = "unknown"
+    ) -> str:
+        input_prompt = self.prompt.format(text)

        return self.gpt_client.query(
-            text_prompt=self.prompt,
+            text_prompt=input_prompt,
            image_base64=image_paths,
        )

--- a/embodied_gen/validators/urdf_convertor.py
+++ b/embodied_gen/validators/urdf_convertor.py
@ -107,36 +107,37 @@ class URDFGenerator(object):
            already provided, use it directly), accurately describe this 3D object asset (within 15 words),
            Determine the pose of the object in the first image and estimate the true vertical height
            (vertical projection) range of the object (in meters), i.e., how tall the object appears from top
-            to bottom in the front view (first) image. also weight range (unit: kilogram), the average
+            to bottom in the first image. also weight range (unit: kilogram), the average
            static friction coefficient of the object relative to rubber and the average dynamic friction
-            coefficient of the object relative to rubber. Return response format as shown in Output Example.
+            coefficient of the object relative to rubber. Return response in format as shown in Output Example.

            Output Example:
            Category: cup
            Description: shiny golden cup with floral design
-            Height: 0.1-0.15 m
+            Pose: <short_description_within_10_words>
+            Height: 0.10-0.15 m
            Weight: 0.3-0.6 kg
            Static friction coefficient: 0.6
            Dynamic friction coefficient: 0.5

-            IMPORTANT: Estimating Vertical Height from the First (Front View) Image.
+            IMPORTANT: Estimating Vertical Height from the First (Front View) Image and pose estimation based on all views.
            - The "vertical height" refers to the real-world vertical size of the object
            as projected in the first image, aligned with the image's vertical axis.
            - For flat objects like plates or disks or book, if their face is visible in the front view,
            use the diameter as the vertical height. If the edge is visible, use the thickness instead.
            - This is not necessarily the full length of the object, but how tall it appears
-            in the first image vertically, based on its pose and orientation.
-            - For objects(e.g., spoons, forks, writing instruments etc.) at an angle showing in
-            the first image, tilted at 45° will appear shorter vertically than when upright.
+            in the first image vertically, based on its pose and orientation estimation on all views.
+            - For objects(e.g., spoons, forks, writing instruments etc.) at an angle showing in images,
+                e.g., tilted at 45° will appear shorter vertically than when upright.
            Estimate the vertical projection of their real length based on its pose.
            For example:
-              - A pen standing upright in the first view (aligned with the image's vertical axis)
+              - A pen standing upright in the first image (aligned with the image's vertical axis)
              full body visible in the first image: → vertical height ≈ 0.14-0.20 m
-              - A pen lying flat in the front view (showing thickness) → vertical height ≈ 0.018-0.025 m
+              - A pen lying flat in the first image (showing thickness or as a dot) → vertical height ≈ 0.018-0.025 m
              - Tilted pen in the first image (e.g., ~45° angle): vertical height ≈ 0.07-0.12 m
-            - Use the rest views(except the first image) to help determine the object's 3D pose and orientation.
+            - Use the rest views to help determine the object's 3D pose and orientation.
            Assume the object is in real-world scale and estimate the approximate vertical height
-            (in meters) based on how large it appears vertically in the first image.
+            based on the pose estimation and how large it appears vertically in the first image.
            """
            )

@ -163,14 +164,14 @@ class URDFGenerator(object):
        description = lines[1].split(": ")[1]
        min_height, max_height = map(
            lambda x: float(x.strip().replace(",", "").split()[0]),
-            lines[2].split(": ")[1].split("-"),
+            lines[3].split(": ")[1].split("-"),
        )
        min_mass, max_mass = map(
            lambda x: float(x.strip().replace(",", "").split()[0]),
-            lines[3].split(": ")[1].split("-"),
+            lines[4].split(": ")[1].split("-"),
        )
-        mu1 = float(lines[4].split(": ")[1].replace(",", ""))
-        mu2 = float(lines[5].split(": ")[1].replace(",", ""))
+        mu1 = float(lines[5].split(": ")[1].replace(",", ""))
+        mu2 = float(lines[6].split(": ")[1].replace(",", ""))

        return {
            "category": category.lower(),
--- a/pyproject.toml
+++ b/pyproject.toml
@ -7,7 +7,7 @@ packages = ["embodied_gen"]

 [project]
 name = "embodied_gen"
-version = "v0.1.2"
+version = "v0.1.3"
 readme = "README.md"
 license = "Apache-2.0"
 license-files = ["LICENSE", "NOTICE"]
@ -32,6 +32,8 @@ backproject-cli = "embodied_gen.data.backproject_v2:entrypoint"
 img3d-cli = "embodied_gen.scripts.imageto3d:entrypoint"
 text3d-cli = "embodied_gen.scripts.textto3d:text_to_3d"
 scene3d-cli = "embodied_gen.scripts.gen_scene3d:entrypoint"
+layout-cli = "embodied_gen.scripts.gen_layout:entrypoint"
+sim-cli = "embodied_gen.scripts.simulate_sapien:entrypoint"

 [tool.pydocstyle]
 match = '(?!test_).*(?!_pb2)\.py'
--- a/requirements.txt
+++ b/requirements.txt
@ -38,4 +38,5 @@ tyro
 pyquaternion
 shapely
 sapien==3.0.0b1
+mani_skill==3.0.0b21
 typing_extensions==4.14.1
--- a/tests/test_examples/test_quality_checkers.py
+++ b/tests/test_examples/test_quality_checkers.py
@ -77,10 +77,18 @@ def panoocc_checker():


 def test_geo_checker(geo_checker):
+    from embodied_gen.utils.process_media import combine_images_to_grid
+
+    image_paths = glob(
+        "outputs/layouts_gens3/task_0000/asset3d/pen/result/renders/image_color/*png"
+    )
+    image = combine_images_to_grid(image_paths)[0]
+    # image = "outputs/utest_cases/geo_checker/color.png"
    flag, result = geo_checker(
        [
-            "apps/assets/example_image/sample_02.jpg",
-        ]
+            image,
+        ],
+        text="pen",
    )
    logger.info(f"geo_checker: {flag}, {result}")
    assert isinstance(flag, bool)
@ -107,11 +115,40 @@ def test_seg_checker(seg_checker):


 def test_semantic_checker(semantic_checker):
+    test_cases = [
+        ("table", "outputs/utest_cases/semantic_checker/task_0001_table.png"),
+        ("table", "outputs/utest_cases/semantic_checker/task_0002_table.png"),
+        (
+            "banana",
+            "outputs/utest_cases/semantic_checker/task_0002_banana.png",
+        ),
+        (
+            "napkin",
+            "outputs/utest_cases/semantic_checker/task_0003_napkin.png",
+        ),
+        ("table", "outputs/utest_cases/semantic_checker/task_0003_table.png"),
+        ("fork", "outputs/utest_cases/semantic_checker/task_0005_fork.png"),
+        ("spoon", "outputs/utest_cases/semantic_checker/task_0005_spoon.png"),
+        ("table", "outputs/utest_cases/semantic_checker/task_0007_table.png"),
+        ("table", "outputs/utest_cases/semantic_checker/task_0008_table.png"),
+        ("table", "outputs/utest_cases/semantic_checker/task_0009_table.png"),
+        ("table", "outputs/utest_cases/semantic_checker/task_0011_table.png"),
+        ("desk", "outputs/utest_cases/semantic_checker/task_0012_desk.png"),
+        (
+            "laptop",
+            "outputs/utest_cases/semantic_checker/task_0012_laptop.png",
+        ),
+        ("table", "outputs/utest_cases/semantic_checker/task_0014_table.png"),
+        ("desk", "outputs/utest_cases/semantic_checker/task_0016_desk.png"),
+        ("shelf", "outputs/utest_cases/semantic_checker/task_0018_shelf.png"),
+        ("table", "outputs/utest_cases/semantic_checker/task_0000_table.png"),
+    ]
+    for test_case in test_cases:
        flag, result = semantic_checker(
-        text="can",
-        image=["apps/assets/example_image/sample_02.jpg"],
+            text=test_case[0],
+            image=[test_case[1]],
        )
-    logger.info(f"semantic_checker: {flag}, {result}")
+        logger.info(f"semantic_checker: {flag}, {result}, {test_case[1]}")
        assert isinstance(flag, bool)
        assert isinstance(result, str)

@ -139,25 +176,21 @@ def test_textgen_checker(textalign_checker, mesh_path, text_desc):


 def test_panoheight_estimator(pano_height_estimator):
-    image_paths = glob("outputs/bg_v3/test2/*/*.png")
+    image_paths = glob("outputs/bg_scenes/*/*.png")
    for image_path in image_paths:
        result = pano_height_estimator(image_path)
        logger.info(f"{type(result)}, {result}")


 def test_pano_checker(pano_checker):
-    # image_paths = [
-    #     "outputs/bg_gen2/scene_0000/pano_image.png",
-    #     "outputs/bg_gen2/scene_0001/pano_image.png",
-    # ]
-    image_paths = glob("outputs/bg_gen/*/*.png")
+    image_paths = glob("outputs/bg_scenes/*/*.png")
    for image_path in image_paths:
        flag, result = pano_checker(image_path)
        logger.info(f"{image_path} {flag}, {result}")


 def test_panoocc_checker(panoocc_checker):
-    image_paths = glob("outputs/bg_gen/*/*.png")
+    image_paths = glob("outputs/bg_scenes/*/*.png")
    for image_path in image_paths:
        flag, result = panoocc_checker(image_path)
        logger.info(f"{image_path} {flag}, {result}")
--- a/tests/test_examples/test_urdf_convertor.py
+++ b/tests/test_examples/test_urdf_convertor.py
@ -0,0 +1,51 @@
+# Project EmbodiedGen
+#
+# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied. See the License for the specific language governing
+# permissions and limitations under the License.
+
+from embodied_gen.utils.gpt_clients import GPT_CLIENT
+from embodied_gen.validators.urdf_convertor import URDFGenerator
+
+
+def test_urdf_convertor():
+    urdf_gen = URDFGenerator(GPT_CLIENT, render_view_num=4)
+    mesh_paths = [
+        "outputs/test_urdf/sample_0/mesh/pen.obj",
+        "outputs/test_urdf/sample_1/mesh/notepad.obj",
+        "outputs/test_urdf/sample_2/mesh/plate.obj",
+        "outputs/test_urdf/sample_3/mesh/spoon.obj",
+        "outputs/test_urdf/sample_4/mesh/notebook.obj",
+        "outputs/test_urdf/sample_5/mesh/plate.obj",
+        "outputs/test_urdf/sample_6/mesh/spoon.obj",
+        "outputs/test_urdf/sample_7/mesh/book.obj",
+        "outputs/test_urdf/sample_8/mesh/lamp.obj",
+        "outputs/test_urdf/sample_9/mesh/remote_control.obj",
+        "outputs/test_urdf/sample_10/mesh/keyboard.obj",
+        "outputs/test_urdf/sample_11/mesh/mouse.obj",
+        "outputs/test_urdf/sample_12/mesh/table.obj",
+        "outputs/test_urdf/sample_13/mesh/marker.obj",
+        "outputs/test_urdf/pen/result/mesh/pen.obj",
+        "outputs/test_urdf/notebook/result/mesh/notebook.obj",
+        "outputs/test_urdf/marker/result/mesh/marker.obj",
+        "outputs/test_urdf/pen2/result/mesh/pen.obj",
+    ]
+    for idx, mesh_path in enumerate(mesh_paths):
+        filename = mesh_path.split("/")[-1].split(".")[0]
+        urdf_path = urdf_gen(
+            mesh_path=mesh_path,
+            output_root=f"outputs/test_urdf2/sample_{idx}",
+            category=filename,
+            # min_height=1.0,
+            # max_height=1.2,
+        )