feat(pipe): Add layout generation by text desc pipe and release v0.1.3. (#33)

* feat(pipe): Add layout generation pipe. * tag v0.1.3 * chore: update assets * update --------- Co-authored-by: xinjie.wang <xinjie.wang@gpu-4090-dev015.hogpu.cc>
2025-08-19 15:40:33 +08:00 · 2025-08-19 15:40:33 +08:00 · e06900fe57
commit e06900fe57
parent e05095f51f
28 changed files with 2369 additions and 64 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -4,8 +4,18 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0).
 ## [0.1.3] - 2025-08-19
 ### Feature
 - Release `layout-cli` 🏞️ Layout(Interactive 3D Worlds) Generation, generate interactive 3D scene from task description.
-## [0.1.1] - 2025-07-xx
+
 ## [0.1.2] - 2025-07-31
 ### Feature
 - Release `scene3d-cli` 🌍 3D Scene Generation, generate a color mesh and 3DGS scene from text description.
 - Release docker file and docker image.
 ## [0.1.1] - 2025-07-11
 ### Feature
 - Added intelligent quality checkers and auto-retry pipeline for `image-to-3d` and `text-to-3d`.
 - Added unit tests for quality checkers.
--- a/README.md
+++ b/README.md
@ -30,7 +30,7 @@
 ```sh
 git clone https://github.com/HorizonRobotics/EmbodiedGen.git
 cd EmbodiedGen
-git checkout v0.1.2
+git checkout v0.1.3
 git submodule update --init --recursive --progress
 conda create -n embodiedgen python=3.10.13 -y # recommended to use a new env.
 conda activate embodiedgen
@ -194,8 +194,6 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \
 ### 💬 Generate Layout from task description
 🚧 *Coming Soon*
 <table>
  <tr>
    <td><img src="apps/assets/layout1.gif" alt="layout1" width="360"/></td>
@ -207,6 +205,47 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \
  </tr>
 </table>
 Text-to-image model based on SD3.5 Medium, usage requires agreement to the [model license](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium). All models auto-downloaded at the first run.
 You can generate any desired room as background using `scene3d-cli`. As each scene takes approximately 30 minutes to generate, we recommend pre-generating them for efficiency and adding them to `outputs/bg_scenes/scene_list.txt`.
 We provided some sample background assets created with `scene3d-cli`. Download them(~4G) using `hf download xinjjj/scene3d-bg --repo-type dataset --local-dir outputs`.
 Generating one interactive 3D scene from task description with `layout-cli` takes approximately 30 minutes.
 ```sh
 layout-cli --task_descs "Place the pen in the mug on the desk" "Put the fruit on the table on the plate" \
 --bg_list "outputs/bg_scenes/scene_list.txt" --output_root "outputs/layouts_gen" --insert_robot
 ```
 <table>
  <tr>
    <td><img src="apps/assets/Iscene_demo1.gif" alt="Iscene_demo1" width="300"/></td>
    <td><img src="apps/assets/Iscene_demo2.gif" alt="Iscene_demo2" width="450"/></td>
  </tr>
 </table>
 Run multiple tasks defined in `task_list.txt` in the backend.
 Remove `--insert_robot` if you don't consider the robot pose in layout generation.
 ```sh
 CUDA_VISIBLE_DEVICES=0 nohup layout-cli \
 --task_descs "apps/assets/example_layout/task_list.txt" \
 --bg_list "outputs/bg_scenes/scene_list.txt" \
 --output_root "outputs/layouts_gens" --insert_robot > layouts_gens.log &
 ```
 Using `compose_layout.py`, you can recompose the layout of the generated interactive 3D scenes. (Support for texture editing and augmentation will be added later.)
 ```sh
 python embodied_gen/scripts/compose_layout.py \
 --layout_path "outputs/layouts_gens/task_0000/layout.json" \
 --output_dir "outputs/layouts_gens/task_0000/recompose" --insert_robot
 ```
 We provide `sim-cli`, that allows users to easily load generated layouts into an interactive 3D simulation using the SAPIEN engine (will support for more simulators in future updates).
 ```sh
 sim-cli --layout_path "outputs/layouts_gens/task_0000/recompose/layout.json" \
 --output_dir "outputs/layouts_gens/task_0000/recompose/sapien_render" --robot_name "franka"
 ```
 ### 🖼️ Real-to-Sim Digital Twin
--- a/apps/assets/Iscene_demo1.gif
+++ b/apps/assets/Iscene_demo1.gif
--- a/apps/assets/Iscene_demo2.gif
+++ b/apps/assets/Iscene_demo2.gif
--- a/apps/assets/example_layout/task_list.txt
+++ b/apps/assets/example_layout/task_list.txt
@ -0,0 +1,18 @@
 Place the pen in the mug on the desk
 Get the pen from the table and put it inside the cup
 Put the fruit on the table on the plate
 Put the apples on the table on the plate
 Put the spoon on the plate in the table
 Put the green lid on the desk
 Put the pen on the tray
 Move the plate forward
 Move the lamp to the left
 Pick up the cup and put on the middle of the table
 Close the laptop
 Close the red jar with the lid on the table
 Pick the cloth and shove it in the box
 Pick up the marker from the table and put it in the bowl
 Pick up the charger and move it slightly to the left
 Move the jar to the left side of the desk
 Pick the rubik's cube on the top of the desk
 Move the mug to the right
--- a/apps/common.py
+++ b/apps/common.py
@ -189,7 +189,7 @@ os.makedirs(TMP_DIR, exist_ok=True)
 lighting_css = """
 <style>
 #lighter_mesh canvas {
-    filter: brightness(1.8) !important;
+    filter: brightness(1.9) !important;
 }
 </style>
 """
--- a/embodied_gen/models/gs_model.py
+++ b/embodied_gen/models/gs_model.py
@ -51,17 +51,15 @@ class RenderResult:
    def __post_init__(self):
        if isinstance(self.rgb, torch.Tensor):
-            rgb = self.rgb.detach().cpu().numpy()
+            rgb = (self.rgb * 255).to(torch.uint8)
-            rgb = (rgb * 255).astype(np.uint8)
+            self.rgb = rgb.cpu().numpy()[..., ::-1]
            self.rgb = cv2.cvtColor(rgb, cv2.COLOR_BGR2RGB)
        if isinstance(self.depth, torch.Tensor):
-            self.depth = self.depth.detach().cpu().numpy()
+            self.depth = self.depth.cpu().numpy()
        if isinstance(self.opacity, torch.Tensor):
-            opacity = self.opacity.detach().cpu().numpy()
+            opacity = (self.opacity * 255).to(torch.uint8)
-            opacity = (opacity * 255).astype(np.uint8)
+            self.opacity = opacity.cpu().numpy()
            self.opacity = cv2.cvtColor(opacity, cv2.COLOR_GRAY2RGB)
            mask = np.where(self.opacity > self.mask_threshold, 255, 0)
-            self.mask = mask[..., 0:1].astype(np.uint8)
+            self.mask = mask.astype(np.uint8)
            self.rgba = np.concatenate([self.rgb, self.mask], axis=-1)
--- a/embodied_gen/models/layout.py
+++ b/embodied_gen/models/layout.py
@ -0,0 +1,509 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import argparse
 import json
 import logging
 import os
 import re
 import json_repair
 from embodied_gen.utils.enum import (
    LayoutInfo,
    RobotItemEnum,
    Scene3DItemEnum,
    SpatialRelationEnum,
 )
 from embodied_gen.utils.gpt_clients import GPT_CLIENT, GPTclient
 from embodied_gen.utils.process_media import SceneTreeVisualizer
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
 __all__ = [
    "LayoutDesigner",
    "LAYOUT_DISASSEMBLER",
    "LAYOUT_GRAPHER",
    "LAYOUT_DESCRIBER",
 ]
 DISTRACTOR_NUM = 3  # Maximum number of distractor objects allowed
 LAYOUT_DISASSEMBLE_PROMPT = f"""
    You are an intelligent 3D scene planner. Given a natural language
    description of a robotic task, output a structured description of
    an interactive 3D scene.
    The output must include the following fields:
    - task: A high-level task type (e.g., "single-arm pick",
        "dual-arm grasping", "pick and place", "object sorting").
    - {Scene3DItemEnum.ROBOT}: The name or type of robot involved. If not mentioned,
        use {RobotItemEnum.FRANKA} as default.
    - {Scene3DItemEnum.BACKGROUND}: The room or indoor environment where the task happens
        (e.g., Kitchen, Bedroom, Living Room, Workshop, Office).
    - {Scene3DItemEnum.CONTEXT}: A indoor object involved in the manipulation
        (e.g., Table, Shelf, Desk, Bed, Cabinet).
    - {Scene3DItemEnum.MANIPULATED_OBJS}: The main object(s) that the robot directly interacts with.
    - {Scene3DItemEnum.DISTRACTOR_OBJS}: Other objects that naturally belong to the scene but are not part of the main task.
    Constraints:
    - The {Scene3DItemEnum.BACKGROUND} must logically match the described task.
    - The {Scene3DItemEnum.CONTEXT} must fit within the {Scene3DItemEnum.BACKGROUND}. (e.g., a bedroom may include a table or bed, but not a workbench.)
    - The {Scene3DItemEnum.CONTEXT} must be a concrete indoor object, such as a "table",
        "shelf", "desk", or "bed". It must not be an abstract concept (e.g., "area", "space", "zone")
        or structural surface (e.g., "floor", "ground"). If the input describes an interaction near
        the floor or vague space, you must infer a plausible object like a "table", "cabinet", or "storage box" instead.
    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} objects must be plausible,
        and semantically compatible with the {Scene3DItemEnum.CONTEXT} and {Scene3DItemEnum.BACKGROUND}.
    - {Scene3DItemEnum.DISTRACTOR_OBJS} must not confuse or overlap with the manipulated objects.
    - {Scene3DItemEnum.DISTRACTOR_OBJS} number limit: {DISTRACTOR_NUM} distractors maximum.
    - All {Scene3DItemEnum.BACKGROUND} are limited to indoor environments.
    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} are rigid bodies and not include flexible objects.
    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} must be common
        household or office items or furniture, not abstract concepts, not too small like needle.
    - If the input includes a plural or grouped object (e.g., "pens", "bottles", "plates", "fruit"),
        you must decompose it into multiple individual instances (e.g., ["pen", "pen"], ["apple", "pear"]).
    - Containers that hold objects (e.g., "bowl of apples", "box of tools") must
        be separated into individual items (e.g., ["bowl", "apple", "apple"]).
    - Do not include transparent objects such as "glass", "plastic", etc.
    - The output must be in compact JSON format and use Markdown syntax, just like the output in the example below.
    Examples:
    Input:
    "Pick up the marker from the table and put it in the bowl robot {RobotItemEnum.UR5}."
    Output:
    ```json
    {{
        "task_desc": "Pick up the marker from the table and put it in the bowl.",
        "task": "pick and place",
        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
        "{Scene3DItemEnum.BACKGROUND}": "kitchen",
        "{Scene3DItemEnum.CONTEXT}": "table",
        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker"],
        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["mug", "notebook", "bowl"]
    }}
    ```
    Input:
    "Put the rubik's cube on the top of the shelf."
    Output:
    ```json
    {{
        "task_desc": "Put the rubik's cube on the top of the shelf.",
        "task": "pick and place",
        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
        "{Scene3DItemEnum.BACKGROUND}": "bedroom",
        "{Scene3DItemEnum.CONTEXT}": "shelf",
        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["rubik's cube"],
        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["pen", "cup", "toy car"]
    }}
    ```
    Input:
    "Remove all the objects from the white basket and put them on the table."
    Output:
    ```json
    {{
        "task_desc": "Remove all the objects from the white basket and put them on the table, robot {RobotItemEnum.PIPER}.",
        "task": "pick and place",
        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.PIPER}",
        "{Scene3DItemEnum.BACKGROUND}": "office",
        "{Scene3DItemEnum.CONTEXT}": "table",
        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["banana", "mobile phone"],
        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["plate", "white basket"]
    }}
    ```
    Input:
    "Pick up the rope on the chair and put it in the box."
    Output:
    ```json
    {{
        "task_desc": "Pick up the rope on the chair and put it in the box, robot {RobotItemEnum.FRANKA}.",
        "task": "pick and place",
        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
        "{Scene3DItemEnum.BACKGROUND}": "living room",
        "{Scene3DItemEnum.CONTEXT}": "chair",
        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["rope", "box"],
        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["magazine"]
    }}
    ```
    Input:
    "Pick up the seal tape and plastic from the counter and put them in the open drawer and close it."
    Output:
    ```json
    {{
        "task_desc": "Pick up the seal tape and plastic from the counter and put them in the open drawer and close it.",
        "task": "pick and place",
        "robot": "franka",
        "background": "kitchen",
        "context": "counter",
        "manipulated_objs": ["seal tape", "plastic", "opened drawer"],
        "distractor_objs": ["scissors"]
    }}
    ```
    Input:
    "Put the pens in the grey bowl."
    Output:
    ```json
    {{
        "task_desc": "Put the pens in the grey bowl.",
        "task": "pick and place",
        "robot": "franka",
        "background": "office",
        "context": "table",
        "manipulated_objs": ["pen", "pen", "grey bowl"],
        "distractor_objs": ["notepad", "cup"]
    }}
    ```
 """
 LAYOUT_HIERARCHY_PROMPT = f"""
    You are a 3D scene layout reasoning expert.
    Your task is to generate a spatial relationship dictionary in multiway tree
    that describes how objects are arranged in a 3D environment
    based on a given task description and object list.
    Input in JSON format containing the task description, task type,
    {Scene3DItemEnum.ROBOT}, {Scene3DItemEnum.BACKGROUND}, {Scene3DItemEnum.CONTEXT},
    and a list of objects, including {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS}.
    ### Supported Spatial Relations:
    - "{SpatialRelationEnum.ON}": The child object bottom is directly on top of the parent object top.
    - "{SpatialRelationEnum.INSIDE}": The child object is inside the context object.
    - "{SpatialRelationEnum.IN}": The {Scene3DItemEnum.ROBOT} in the {Scene3DItemEnum.BACKGROUND}.
    - "{SpatialRelationEnum.FLOOR}": The child object bottom is on the floor of the {Scene3DItemEnum.BACKGROUND}.
    ### Rules:
    - The {Scene3DItemEnum.CONTEXT} object must be "{SpatialRelationEnum.FLOOR}" the {Scene3DItemEnum.BACKGROUND}.
    - {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} must be either
        "{SpatialRelationEnum.ON}" or "{SpatialRelationEnum.INSIDE}" the {Scene3DItemEnum.CONTEXT}
    - Or "{SpatialRelationEnum.FLOOR}" {Scene3DItemEnum.BACKGROUND}.
    - Use "{SpatialRelationEnum.INSIDE}" only if the parent is a container-like object (e.g., shelf, rack, cabinet).
    - Do not define relationship edges between objects, only for the child and parent nodes.
    - {Scene3DItemEnum.ROBOT} must "{SpatialRelationEnum.IN}" the {Scene3DItemEnum.BACKGROUND}.
    - Ensure that each object appears only once in the layout tree, and its spatial relationship is defined with only one parent.
    - Ensure a valid multiway tree structure with a maximum depth of 2 levels suitable for a 3D scene layout representation.
    - Only output the final output in JSON format, using Markdown syntax as in examples.
    ### Example
    Input:
    {{
        "task_desc": "Pick up the marker from the table and put it in the bowl.",
        "task": "pick and place",
        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
        "{Scene3DItemEnum.BACKGROUND}": "kitchen",
        "{Scene3DItemEnum.CONTEXT}": "table",
        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker", "bowl"],
        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["mug", "chair"]
    }}
    Intermediate Think:
        table {SpatialRelationEnum.FLOOR} kitchen
        chair {SpatialRelationEnum.FLOOR} kitchen
        {RobotItemEnum.FRANKA} {SpatialRelationEnum.IN} kitchen
        marker {SpatialRelationEnum.ON} table
        bowl {SpatialRelationEnum.ON} table
        mug {SpatialRelationEnum.ON} table
    Final Output:
    ```json
    {{
        "kitchen": [
            ["table", "{SpatialRelationEnum.FLOOR}"],
            ["chair", "{SpatialRelationEnum.FLOOR}"],
            ["{RobotItemEnum.FRANKA}", "{SpatialRelationEnum.IN}"]
        ],
        "table": [
            ["marker", "{SpatialRelationEnum.ON}"],
            ["bowl", "{SpatialRelationEnum.ON}"],
            ["mug", "{SpatialRelationEnum.ON}"]
        ]
    }}
    ```
    Input:
    {{
        "task_desc": "Put the marker on top of the book.",
        "task": "pick and place",
        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
        "{Scene3DItemEnum.BACKGROUND}": "office",
        "{Scene3DItemEnum.CONTEXT}": "desk",
        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker", "book"],
        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["pen holder", "notepad"]
    }}
    Intermediate Think:
        desk {SpatialRelationEnum.FLOOR} office
        {RobotItemEnum.UR5} {SpatialRelationEnum.IN} office
        marker {SpatialRelationEnum.ON} desk
        book {SpatialRelationEnum.ON} desk
        pen holder {SpatialRelationEnum.ON} desk
        notepad {SpatialRelationEnum.ON} desk
    Final Output:
    ```json
    {{
        "office": [
            ["desk", "{SpatialRelationEnum.FLOOR}"],
            ["{RobotItemEnum.UR5}", "{SpatialRelationEnum.IN}"]
        ],
        "desk": [
            ["marker", "{SpatialRelationEnum.ON}"],
            ["book", "{SpatialRelationEnum.ON}"],
            ["pen holder", "{SpatialRelationEnum.ON}"],
            ["notepad", "{SpatialRelationEnum.ON}"]
        ]
    }}
    ```
    Input:
    {{
        "task_desc": "Put the rubik's cube on the top of the shelf.",
        "task": "pick and place",
        "{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
        "{Scene3DItemEnum.BACKGROUND}": "bedroom",
        "{Scene3DItemEnum.CONTEXT}": "shelf",
        "{Scene3DItemEnum.MANIPULATED_OBJS}": ["rubik's cube"],
        "{Scene3DItemEnum.DISTRACTOR_OBJS}": ["toy car", "pen"]
    }}
    Intermediate Think:
        shelf {SpatialRelationEnum.FLOOR} bedroom
        {RobotItemEnum.UR5} {SpatialRelationEnum.IN} bedroom
        rubik's cube {SpatialRelationEnum.INSIDE} shelf
        toy car {SpatialRelationEnum.INSIDE} shelf
        pen {SpatialRelationEnum.INSIDE} shelf
    Final Output:
    ```json
    {{
        "bedroom": [
            ["shelf", "{SpatialRelationEnum.FLOOR}"],
            ["{RobotItemEnum.UR5}", "{SpatialRelationEnum.IN}"]
        ],
        "shelf": [
            ["rubik's cube", "{SpatialRelationEnum.INSIDE}"],
            ["toy car", "{SpatialRelationEnum.INSIDE}"],
            ["pen", "{SpatialRelationEnum.INSIDE}"]
        ]
    }}
    ```
    Input:
    {{
        "task_desc": "Put the marker in the cup on the counter.",
        "task": "pick and place",
        "robot": "franka",
        "background": "kitchen",
        "context": "counter",
        "manipulated_objs": ["marker", "cup"],
        "distractor_objs": ["plate", "spoon"]
    }}
    Intermediate Think:
        counter {SpatialRelationEnum.FLOOR} kitchen
        {RobotItemEnum.FRANKA} {SpatialRelationEnum.IN} kitchen
        marker {SpatialRelationEnum.ON} counter
        cup {SpatialRelationEnum.ON} counter
        plate {SpatialRelationEnum.ON} counter
        spoon {SpatialRelationEnum.ON} counter
    Final Output:
    ```json
    {{
        "kitchen": [
            ["counter", "{SpatialRelationEnum.FLOOR}"],
            ["{RobotItemEnum.FRANKA}", "{SpatialRelationEnum.IN}"]
        ],
        "counter": [
            ["marker", "{SpatialRelationEnum.ON}"],
            ["cup", "{SpatialRelationEnum.ON}"],
            ["plate", "{SpatialRelationEnum.ON}"],
            ["spoon", "{SpatialRelationEnum.ON}"]
        ]
    }}
    ```
 """
 LAYOUT_DESCRIBER_PROMPT = """
    You are a 3D asset style descriptor.
    Given a task description and a dictionary where the key is the object content and
    the value is the object type, output a JSON dictionary with each object paired
    with a concise, styled visual description suitable for 3D asset generation.
    Generation Guidelines:
    - For each object, brainstorm multiple style candidates before selecting the final
        description. Vary phrasing, material, texture, color, and spatial details.
    - Each description must be a maximum of 15 words, including color, style, materials.
    - Descriptions should be visually grounded, specific, and reflect surface texture and structure.
    - For objects marked as "context", explicitly mention the object is standalone, has an empty top.
    - Use rich style descriptors: e.g., "scratched brown wooden desk" etc.
    - Ensure all object styles align with the task's overall context and environment.
    Format your output in JSON like the example below.
    Example Input:
    "Pick up the rope on the chair and put it in the box. {'living room': 'background', 'chair': 'context',
        'rope': 'manipulated_objs', 'box': 'manipulated_objs', 'magazine': 'distractor_objs'}"
    Example Output:
    ```json
    {
        "living room": "modern cozy living room with soft sunlight and light grey carpet",
        "chair": "standalone dark oak chair with no surroundings and clean empty seat",
        "rope": "twisted hemp rope with rough fibers and dusty beige texture",
        "box": "slightly crumpled cardboard box with open flaps and brown textured surface",
        "magazine": "celebrity magazine with glossy red cover and large bold title"
    }
    ```
 """
 class LayoutDesigner(object):
    def __init__(
        self,
        gpt_client: GPTclient,
        system_prompt: str,
        verbose: bool = False,
    ) -> None:
        self.prompt = system_prompt.strip()
        self.verbose = verbose
        self.gpt_client = gpt_client
    def query(self, prompt: str, params: dict = None) -> str:
        full_prompt = self.prompt + f"\n\nInput:\n\"{prompt}\""
        response = self.gpt_client.query(
            text_prompt=full_prompt,
            params=params,
        )
        if self.verbose:
            logger.info(f"Response: {response}")
        return response
    def format_response(self, response: str) -> dict:
        cleaned = re.sub(r"^```json\s*|\s*```$", "", response.strip())
        try:
            output = json.loads(cleaned)
        except json.JSONDecodeError as e:
            raise json.JSONDecodeError(
                f"Error: {e}, failed to parse JSON response: {response}"
            )
        return output
    def format_response_repair(self, response: str) -> dict:
        return json_repair.loads(response)
    def save_output(self, output: dict, save_path: str) -> None:
        os.makedirs(os.path.dirname(save_path), exist_ok=True)
        with open(save_path, 'w') as f:
            json.dump(output, f, indent=4)
    def __call__(
        self, prompt: str, save_path: str = None, params: dict = None
    ) -> dict | str:
        response = self.query(prompt, params=params)
        output = self.format_response_repair(response)
        self.save_output(output, save_path) if save_path else None
        return output
 LAYOUT_DISASSEMBLER = LayoutDesigner(
    gpt_client=GPT_CLIENT, system_prompt=LAYOUT_DISASSEMBLE_PROMPT
 )
 LAYOUT_GRAPHER = LayoutDesigner(
    gpt_client=GPT_CLIENT, system_prompt=LAYOUT_HIERARCHY_PROMPT
 )
 LAYOUT_DESCRIBER = LayoutDesigner(
    gpt_client=GPT_CLIENT, system_prompt=LAYOUT_DESCRIBER_PROMPT
 )
 def build_scene_layout(
    task_desc: str, output_path: str = None, gpt_params: dict = None
 ) -> LayoutInfo:
    layout_relation = LAYOUT_DISASSEMBLER(task_desc, params=gpt_params)
    layout_tree = LAYOUT_GRAPHER(layout_relation, params=gpt_params)
    object_mapping = Scene3DItemEnum.object_mapping(layout_relation)
    obj_prompt = f'{layout_relation["task_desc"]} {object_mapping}'
    objs_desc = LAYOUT_DESCRIBER(obj_prompt, params=gpt_params)
    layout_info = LayoutInfo(
        layout_tree, layout_relation, objs_desc, object_mapping
    )
    if output_path is not None:
        visualizer = SceneTreeVisualizer(layout_info)
        visualizer.render(save_path=output_path)
        logger.info(f"Scene hierarchy tree saved to {output_path}")
    return layout_info
 def parse_args():
    parser = argparse.ArgumentParser(description="3D Scene Layout Designer")
    parser.add_argument(
        "--task_desc",
        type=str,
        default="Put the apples on the table on the plate",
        help="Natural language description of the robotic task",
    )
    parser.add_argument(
        "--save_root",
        type=str,
        default="outputs/layout_tree",
        help="Path to save the layout output",
    )
    return parser.parse_args()
 if __name__ == "__main__":
    from embodied_gen.utils.enum import LayoutInfo
    from embodied_gen.utils.process_media import SceneTreeVisualizer
    args = parse_args()
    params = {
        "temperature": 1.0,
        "top_p": 0.95,
        "frequency_penalty": 0.3,
        "presence_penalty": 0.5,
    }
    layout_relation = LAYOUT_DISASSEMBLER(args.task_desc, params=params)
    layout_tree = LAYOUT_GRAPHER(layout_relation, params=params)
    object_mapping = Scene3DItemEnum.object_mapping(layout_relation)
    obj_prompt = f'{layout_relation["task_desc"]} {object_mapping}'
    objs_desc = LAYOUT_DESCRIBER(obj_prompt, params=params)
    layout_info = LayoutInfo(layout_tree, layout_relation, objs_desc)
    visualizer = SceneTreeVisualizer(layout_info)
    os.makedirs(args.save_root, exist_ok=True)
    scene_graph_path = f"{args.save_root}/scene_tree.jpg"
    visualizer.render(save_path=scene_graph_path)
    with open(f"{args.save_root}/layout.json", "w") as f:
        json.dump(layout_info.to_dict(), f, indent=4)
    print(f"Scene hierarchy tree saved to {scene_graph_path}")
    print(f"Disassembled Layout: {layout_relation}")
    print(f"Layout Graph: {layout_tree}")
    print(f"Layout Descriptions: {objs_desc}")
--- a/embodied_gen/scripts/compose_layout.py
+++ b/embodied_gen/scripts/compose_layout.py
@ -0,0 +1,73 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import json
 import os
 from dataclasses import dataclass
 import tyro
 from embodied_gen.scripts.simulate_sapien import entrypoint as sim_cli
 from embodied_gen.utils.enum import LayoutInfo
 from embodied_gen.utils.geometry import bfs_placement, compose_mesh_scene
 from embodied_gen.utils.log import logger
@dataclass
 class LayoutPlacementConfig:
    layout_path: str
    output_dir: str | None = None
    seed: int | None = None
    max_attempts: int = 1000
    output_iscene: bool = False
    insert_robot: bool = False
 def entrypoint(**kwargs):
    if kwargs is None or len(kwargs) == 0:
        args = tyro.cli(LayoutPlacementConfig)
    else:
        args = LayoutPlacementConfig(**kwargs)
    output_dir = (
        args.output_dir
        if args.output_dir is not None
        else os.path.dirname(args.layout_path)
    )
    os.makedirs(output_dir, exist_ok=True)
    out_scene_path = f"{output_dir}/Iscene.glb"
    out_layout_path = f"{output_dir}/layout.json"
    with open(args.layout_path, "r") as f:
        layout_info = LayoutInfo.from_dict(json.load(f))
    layout_info = bfs_placement(layout_info, seed=args.seed)
    with open(out_layout_path, "w") as f:
        json.dump(layout_info.to_dict(), f, indent=4)
    if args.output_iscene:
        compose_mesh_scene(layout_info, out_scene_path)
    sim_cli(
        layout_path=out_layout_path,
        output_dir=output_dir,
        robot_name="franka" if args.insert_robot else None,
    )
    logger.info(f"Layout placement completed in {output_dir}")
 if __name__ == "__main__":
    entrypoint()
--- a/embodied_gen/scripts/gen_layout.py
+++ b/embodied_gen/scripts/gen_layout.py
@ -0,0 +1,156 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import gc
 import json
 import os
 from dataclasses import dataclass, field
 from shutil import copytree
 from time import time
 from typing import Optional
 import torch
 import tyro
 from embodied_gen.models.layout import build_scene_layout
 from embodied_gen.scripts.simulate_sapien import entrypoint as sim_cli
 from embodied_gen.scripts.textto3d import text_to_3d
 from embodied_gen.utils.config import GptParamsConfig
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.geometry import bfs_placement, compose_mesh_scene
 from embodied_gen.utils.gpt_clients import GPT_CLIENT
 from embodied_gen.utils.log import logger
 from embodied_gen.utils.process_media import (
    load_scene_dict,
    parse_text_prompts,
 )
 from embodied_gen.validators.quality_checkers import SemanticMatcher
 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
@dataclass
 class LayoutGenConfig:
    task_descs: list[str]
    output_root: str
    bg_list: str = "outputs/bg_scenes/scene_list.txt"
    n_img_sample: int = 3
    text_guidance_scale: float = 7.0
    img_denoise_step: int = 25
    n_image_retry: int = 4
    n_asset_retry: int = 3
    n_pipe_retry: int = 2
    seed_img: Optional[int] = None
    seed_3d: Optional[int] = None
    seed_layout: Optional[int] = None
    keep_intermediate: bool = False
    output_iscene: bool = False
    insert_robot: bool = False
    gpt_params: GptParamsConfig = field(
        default_factory=lambda: GptParamsConfig(
            temperature=1.0,
            top_p=0.95,
            frequency_penalty=0.3,
            presence_penalty=0.5,
        )
    )
 def entrypoint() -> None:
    args = tyro.cli(LayoutGenConfig)
    SCENE_MATCHER = SemanticMatcher(GPT_CLIENT)
    task_descs = parse_text_prompts(args.task_descs)
    scene_dict = load_scene_dict(args.bg_list)
    gpt_params = args.gpt_params.to_dict()
    for idx, task_desc in enumerate(task_descs):
        logger.info(f"Generate Layout and 3D scene for task: {task_desc}")
        output_root = f"{args.output_root}/task_{idx:04d}"
        scene_graph_path = f"{output_root}/scene_tree.jpg"
        start_time = time()
        layout_info: LayoutInfo = build_scene_layout(
            task_desc, scene_graph_path, gpt_params
        )
        prompts_mapping = {v: k for k, v in layout_info.objs_desc.items()}
        prompts = [
            v
            for k, v in layout_info.objs_desc.items()
            if layout_info.objs_mapping[k] != Scene3DItemEnum.BACKGROUND.value
        ]
        for prompt in prompts:
            node = prompts_mapping[prompt]
            generation_log = text_to_3d(
                prompts=[
                    prompt,
                ],
                output_root=output_root,
                asset_names=[
                    node,
                ],
                n_img_sample=args.n_img_sample,
                text_guidance_scale=args.text_guidance_scale,
                img_denoise_step=args.img_denoise_step,
                n_image_retry=args.n_image_retry,
                n_asset_retry=args.n_asset_retry,
                n_pipe_retry=args.n_pipe_retry,
                seed_img=args.seed_img,
                seed_3d=args.seed_3d,
                keep_intermediate=args.keep_intermediate,
            )
            layout_info.assets.update(generation_log["assets"])
            layout_info.quality.update(generation_log["quality"])
        # Background GEN (for efficiency, temp use retrieval instead)
        bg_node = layout_info.relation[Scene3DItemEnum.BACKGROUND.value]
        text = layout_info.objs_desc[bg_node]
        match_key = SCENE_MATCHER.query(text, str(scene_dict))
        match_scene_path = f"{os.path.dirname(args.bg_list)}/{match_key}"
        bg_save_dir = os.path.join(output_root, "background")
        copytree(match_scene_path, bg_save_dir, dirs_exist_ok=True)
        layout_info.assets[bg_node] = bg_save_dir
        # BFS layout placement.
        layout_info = bfs_placement(
            layout_info,
            limit_reach_range=True if args.insert_robot else False,
            seed=args.seed_layout,
        )
        layout_path = f"{output_root}/layout.json"
        with open(layout_path, "w") as f:
            json.dump(layout_info.to_dict(), f, indent=4)
        if args.output_iscene:
            compose_mesh_scene(layout_info, f"{output_root}/Iscene.glb")
        sim_cli(
            layout_path=layout_path,
            output_dir=output_root,
            robot_name="franka" if args.insert_robot else None,
        )
        torch.cuda.empty_cache()
        gc.collect()
        elapsed_time = (time() - start_time) / 60
        logger.info(
            f"Layout generation done for {scene_graph_path}, layout result "
            f"in {layout_path}, finished in {elapsed_time:.2f} mins."
        )
    logger.info(f"All tasks completed in {args.output_root}")
 if __name__ == "__main__":
    entrypoint()
--- a/embodied_gen/scripts/imageto3d.py
+++ b/embodied_gen/scripts/imageto3d.py
@ -151,6 +151,9 @@ def entrypoint(**kwargs):
            seg_image.save(seg_path)
            seed = args.seed
            asset_node = "unknown"
            if isinstance(args.asset_type, list) and args.asset_type[idx]:
                asset_node = args.asset_type[idx]
            for try_idx in range(args.n_retry):
                logger.info(
                    f"Try: {try_idx + 1}/{args.n_retry}, Seed: {seed}, Prompt: {seg_path}"
@ -207,7 +210,9 @@ def entrypoint(**kwargs):
                color_path = os.path.join(output_root, "color.png")
                render_gs_api(aligned_gs_path, color_path)
-                geo_flag, geo_result = GEO_CHECKER([color_path])
+                geo_flag, geo_result = GEO_CHECKER(
                    [color_path], text=asset_node
                )
                logger.warning(
                    f"{GEO_CHECKER.__class__.__name__}: {geo_result} for {seg_path}"
                )
--- a/embodied_gen/scripts/simulate_sapien.py
+++ b/embodied_gen/scripts/simulate_sapien.py
@ -0,0 +1,186 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import json
 import os
 from collections import defaultdict
 from dataclasses import dataclass, field
 from typing import Literal
 import imageio
 import numpy as np
 import torch
 import tyro
 from tqdm import tqdm
 from embodied_gen.models.gs_model import GaussianOperator
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.geometry import quaternion_multiply
 from embodied_gen.utils.log import logger
 from embodied_gen.utils.process_media import alpha_blend_rgba
 from embodied_gen.utils.simulation import (
    SIM_COORD_ALIGN,
    FrankaPandaGrasper,
    SapienSceneManager,
 )
@dataclass
 class SapienSimConfig:
    # Simulation settings.
    layout_path: str
    output_dir: str
    sim_freq: int = 200
    sim_step: int = 400
    z_offset: float = 0.004
    init_quat: list[float] = field(
        default_factory=lambda: [0.7071, 0, 0, 0.7071]
    )  # xyzw
    device: str = "cuda"
    robot_name: str | None = None
    control_freq: int = 40
    # Camera settings.
    render_interval: int = 10
    num_cameras: int = 3
    camera_radius: float = 0.9
    camera_height: float = 1.1
    image_hw: tuple[int, int] = (512, 512)
    ray_tracing: bool = True
    fovy_deg: float = 75.0
    camera_target_pt: list[float] = field(
        default_factory=lambda: [0.0, 0.0, 0.9]
    )
    render_keys: list[
        Literal[
            "Color", "Foreground", "Segmentation", "Normal", "Mask", "Depth"
        ]
    ] = field(default_factory=lambda: ["Foreground"])
 def entrypoint(**kwargs):
    if kwargs is None or len(kwargs) == 0:
        cfg = tyro.cli(SapienSimConfig)
    else:
        cfg = SapienSimConfig(**kwargs)
    scene_manager = SapienSceneManager(
        cfg.sim_freq, ray_tracing=cfg.ray_tracing
    )
    _ = scene_manager.initialize_circular_cameras(
        num_cameras=cfg.num_cameras,
        radius=cfg.camera_radius,
        height=cfg.camera_height,
        target_pt=cfg.camera_target_pt,
        image_hw=cfg.image_hw,
        fovy_deg=cfg.fovy_deg,
    )
    with open(cfg.layout_path, "r") as f:
        layout_data = json.load(f)
        layout_data: LayoutInfo = LayoutInfo.from_dict(layout_data)
    scene_manager.load_assets_from_layout_file(
        layout_data,
        cfg.z_offset,
        cfg.init_quat,
        cfg.robot_name,
        cfg.control_freq,
    )
    frames = defaultdict(list)
    image_cnt = 0
    for step in tqdm(range(cfg.sim_step), desc="Simulation"):
        scene_manager.scene.step()
        if cfg.robot_name is not None:
            scene_manager.robot.reset(scene_manager.robot.init_qpos)
        if step % cfg.render_interval != 0:
            continue
        scene_manager.scene.update_render()
        image_cnt += 1
        for camera in scene_manager.cameras:
            camera.take_picture()
            images = scene_manager.render_images(
                camera, render_keys=cfg.render_keys
            )
            frames[camera.name].append(images)
    grasp_frames = defaultdict(dict)
    if cfg.robot_name is not None:
        sim_steps_per_control = cfg.sim_freq // cfg.control_freq
        control_timestep = 1.0 / cfg.control_freq
        grasper = FrankaPandaGrasper(
            scene_manager.robot,
            scene_manager,
            sim_steps_per_control,
            control_timestep,
        )
        for node in layout_data.relation[
            Scene3DItemEnum.MANIPULATED_OBJS.value
        ]:
            grasp_frames[node] = grasper.render_grasp(
                scene_manager.actors[node],
                scene_manager.cameras,
                cfg.render_keys,
            )
    if "Foreground" not in cfg.render_keys:
        return
    bg_node = layout_data.relation[Scene3DItemEnum.BACKGROUND.value]
    gs_path = f"{layout_data.assets[bg_node]}/gs_model.ply"
    gs_model: GaussianOperator = GaussianOperator.load_from_ply(gs_path)
    x, y, z, qx, qy, qz, qw = layout_data.position[bg_node]
    qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], cfg.init_quat)
    init_pose = torch.tensor([x, y, z, qx, qy, qz, qw])
    gs_model = gs_model.get_gaussians(instance_pose=init_pose)
    bg_images = dict()
    for camera in scene_manager.cameras:
        Ks = camera.get_intrinsic_matrix()
        c2w = camera.get_model_matrix()
        c2w = c2w @ SIM_COORD_ALIGN
        result = gs_model.render(
            torch.tensor(c2w, dtype=torch.float32).to(cfg.device),
            torch.tensor(Ks, dtype=torch.float32).to(cfg.device),
            image_width=cfg.image_hw[1],
            image_height=cfg.image_hw[0],
        )
        bg_images[camera.name] = result.rgb[..., ::-1]
    video_frames = []
    for camera in scene_manager.cameras:
        # Scene rendering
        for step in tqdm(range(image_cnt), desc="Rendering"):
            rgba = alpha_blend_rgba(
                frames[camera.name][step]["Foreground"],
                bg_images[camera.name],
            )
            video_frames.append(np.array(rgba))
        # Grasp rendering
        for node in grasp_frames:
            for frame in grasp_frames[node][camera.name]:
                rgba = alpha_blend_rgba(
                    frame["Foreground"], bg_images[camera.name]
                )
                video_frames.append(np.array(rgba))
    os.makedirs(cfg.output_dir, exist_ok=True)
    video_path = f"{cfg.output_dir}/Iscene.mp4"
    imageio.mimsave(video_path, video_frames, fps=30)
    logger.info(f"Interative 3D Scene Visualization saved in {video_path}")
 if __name__ == "__main__":
    entrypoint()
--- a/embodied_gen/scripts/textto3d.sh
+++ b/embodied_gen/scripts/textto3d.sh
@ -81,6 +81,7 @@ done
 # Step 1: Text-to-Image
 echo ${prompt_args}
 eval python3 embodied_gen/scripts/text2image.py \
    --prompts ${prompt_args} \
    --output_root "${output_root}/images" \
--- a/embodied_gen/trainer/gsplat_trainer.py
+++ b/embodied_gen/trainer/gsplat_trainer.py
@ -617,7 +617,7 @@ class Runner:
        for rgb, depth in images_cache:
            depth_normalized = torch.clip(
                (depth - depth_global_min)
-                / (depth_global_max - depth_global_min),
+                / (depth_global_max - depth_global_min + 1e-8),
                0,
                1,
            )
--- a/embodied_gen/trainer/pono2mesh_trainer.py
+++ b/embodied_gen/trainer/pono2mesh_trainer.py
@ -30,7 +30,7 @@ from kornia.morphology import dilation
 from PIL import Image
 from embodied_gen.models.sr_model import ImageRealESRGAN
 from embodied_gen.utils.config import Pano2MeshSRConfig
-from embodied_gen.utils.gaussian import compute_pinhole_intrinsics
+from embodied_gen.utils.geometry import compute_pinhole_intrinsics
 from embodied_gen.utils.log import logger
 from thirdparty.pano2room.modules.geo_predictors import PanoJointPredictor
 from thirdparty.pano2room.modules.geo_predictors.PanoFusionDistancePredictor import (
--- a/embodied_gen/utils/config.py
+++ b/embodied_gen/utils/config.py
@ -17,15 +17,27 @@
 from dataclasses import dataclass, field
 from typing import List, Optional, Union
 from dataclasses_json import DataClassJsonMixin
 from gsplat.strategy import DefaultStrategy, MCMCStrategy
 from typing_extensions import Literal, assert_never
 __all__ = [
    "GptParamsConfig",
    "Pano2MeshSRConfig",
    "GsplatTrainConfig",
 ]
@dataclass
 class GptParamsConfig(DataClassJsonMixin):
    temperature: float = 0.1
    top_p: float = 0.1
    frequency_penalty: float = 0.0
    presence_penalty: float = 0.0
    stop: int | None = None
    max_tokens: int = 500
@dataclass
 class Pano2MeshSRConfig:
    mesh_file: str = "mesh_model.ply"
--- a/embodied_gen/utils/enum.py
+++ b/embodied_gen/utils/enum.py
@ -102,6 +102,7 @@ class LayoutInfo(DataClassJsonMixin):
    tree: dict[str, list]
    relation: dict[str, str | list[str]]
    objs_desc: dict[str, str] = field(default_factory=dict)
    objs_mapping: dict[str, str] = field(default_factory=dict)
    assets: dict[str, str] = field(default_factory=dict)
    quality: dict[str, str] = field(default_factory=dict)
    position: dict[str, list[float]] = field(default_factory=dict)
--- a/embodied_gen/utils/gaussian.py
+++ b/embodied_gen/utils/gaussian.py
@ -35,7 +35,6 @@ __all__ = [
    "set_random_seed",
    "export_splats",
    "create_splats_with_optimizers",
    "compute_pinhole_intrinsics",
    "resize_pinhole_intrinsics",
    "restore_scene_scale_and_position",
 ]
@ -265,12 +264,12 @@ def create_splats_with_optimizers(
    return splats, optimizers
-def compute_pinhole_intrinsics(
+def compute_intrinsics_from_fovy(
-    image_w: int, image_h: int, fov_deg: float
+    image_w: int, image_h: int, fovy_deg: float
 ) -> np.ndarray:
-    fov_rad = np.deg2rad(fov_deg)
+    fovy_rad = np.deg2rad(fovy_deg)
-    fx = image_w / (2 * np.tan(fov_rad / 2))
+    fy = image_h / (2 * np.tan(fovy_rad / 2))
-    fy = fx  # assuming square pixels
+    fx = fy * (image_w / image_h)
    cx = image_w / 2
    cy = image_h / 2
    K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
--- a/embodied_gen/utils/geometry.py
+++ b/embodied_gen/utils/geometry.py
@ -0,0 +1,458 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import os
 import random
 from collections import defaultdict, deque
 from functools import wraps
 from typing import Literal
 import numpy as np
 import torch
 import trimesh
 from matplotlib.path import Path
 from pyquaternion import Quaternion
 from scipy.spatial import ConvexHull
 from scipy.spatial.transform import Rotation as R
 from shapely.geometry import Polygon
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.log import logger
 __all__ = [
    "bfs_placement",
    "with_seed",
    "matrix_to_pose",
    "pose_to_matrix",
    "quaternion_multiply",
    "check_reachable",
    "bfs_placement",
    "compose_mesh_scene",
    "compute_pinhole_intrinsics",
 ]
 def matrix_to_pose(matrix: np.ndarray) -> list[float]:
    """Convert a 4x4 transformation matrix to a pose (x, y, z, qx, qy, qz, qw).
    Args:
        matrix (np.ndarray): 4x4 transformation matrix.
    Returns:
        List[float]: Pose as [x, y, z, qx, qy, qz, qw].
    """
    x, y, z = matrix[:3, 3]
    rot_mat = matrix[:3, :3]
    quat = R.from_matrix(rot_mat).as_quat()
    qx, qy, qz, qw = quat
    return [x, y, z, qx, qy, qz, qw]
 def pose_to_matrix(pose: list[float]) -> np.ndarray:
    """Convert pose (x, y, z, qx, qy, qz, qw) to a 4x4 transformation matrix.
    Args:
        List[float]: Pose as [x, y, z, qx, qy, qz, qw].
    Returns:
        matrix (np.ndarray): 4x4 transformation matrix.
    """
    x, y, z, qx, qy, qz, qw = pose
    r = R.from_quat([qx, qy, qz, qw])
    matrix = np.eye(4)
    matrix[:3, :3] = r.as_matrix()
    matrix[:3, 3] = [x, y, z]
    return matrix
 def compute_xy_bbox(
    vertices: np.ndarray, col_x: int = 0, col_y: int = 2
 ) -> list[float]:
    x_vals = vertices[:, col_x]
    y_vals = vertices[:, col_y]
    return x_vals.min(), x_vals.max(), y_vals.min(), y_vals.max()
 def has_iou_conflict(
    new_box: list[float],
    placed_boxes: list[list[float]],
    iou_threshold: float = 0.0,
 ) -> bool:
    new_min_x, new_max_x, new_min_y, new_max_y = new_box
    for min_x, max_x, min_y, max_y in placed_boxes:
        ix1 = max(new_min_x, min_x)
        iy1 = max(new_min_y, min_y)
        ix2 = min(new_max_x, max_x)
        iy2 = min(new_max_y, max_y)
        inter_area = max(0, ix2 - ix1) * max(0, iy2 - iy1)
        if inter_area > iou_threshold:
            return True
    return False
 def with_seed(seed_attr_name: str = "seed"):
    """A parameterized decorator that temporarily sets the random seed."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            seed = kwargs.get(seed_attr_name, None)
            if seed is not None:
                py_state = random.getstate()
                np_state = np.random.get_state()
                torch_state = torch.get_rng_state()
                random.seed(seed)
                np.random.seed(seed)
                torch.manual_seed(seed)
                try:
                    result = func(*args, **kwargs)
                finally:
                    random.setstate(py_state)
                    np.random.set_state(np_state)
                    torch.set_rng_state(torch_state)
                return result
            else:
                return func(*args, **kwargs)
        return wrapper
    return decorator
 def compute_convex_hull_path(
    vertices: np.ndarray,
    z_threshold: float = 0.05,
    interp_per_edge: int = 3,
    margin: float = -0.02,
 ) -> Path:
    top_vertices = vertices[
        vertices[:, 1] > vertices[:, 1].max() - z_threshold
    ]
    top_xy = top_vertices[:, [0, 2]]
    if len(top_xy) < 3:
        raise ValueError("Not enough points to form a convex hull")
    hull = ConvexHull(top_xy)
    hull_points = top_xy[hull.vertices]
    polygon = Polygon(hull_points)
    polygon = polygon.buffer(margin)
    hull_points = np.array(polygon.exterior.coords)
    dense_points = []
    for i in range(len(hull_points)):
        p1 = hull_points[i]
        p2 = hull_points[(i + 1) % len(hull_points)]
        for t in np.linspace(0, 1, interp_per_edge, endpoint=False):
            pt = (1 - t) * p1 + t * p2
            dense_points.append(pt)
    return Path(np.array(dense_points), closed=True)
 def find_parent_node(node: str, tree: dict) -> str | None:
    for parent, children in tree.items():
        if any(child[0] == node for child in children):
            return parent
    return None
 def all_corners_inside(hull: Path, box: list, threshold: int = 3) -> bool:
    x1, x2, y1, y2 = box
    corners = [[x1, y1], [x2, y1], [x1, y2], [x2, y2]]
    num_inside = sum(hull.contains_point(c) for c in corners)
    return num_inside >= threshold
 def compute_axis_rotation_quat(
    axis: Literal["x", "y", "z"], angle_rad: float
 ) -> list[float]:
    if axis.lower() == 'x':
        q = Quaternion(axis=[1, 0, 0], angle=angle_rad)
    elif axis.lower() == 'y':
        q = Quaternion(axis=[0, 1, 0], angle=angle_rad)
    elif axis.lower() == 'z':
        q = Quaternion(axis=[0, 0, 1], angle=angle_rad)
    else:
        raise ValueError(f"Unsupported axis '{axis}', must be one of x, y, z")
    return [q.x, q.y, q.z, q.w]
 def quaternion_multiply(
    init_quat: list[float], rotate_quat: list[float]
 ) -> list[float]:
    qx, qy, qz, qw = init_quat
    q1 = Quaternion(w=qw, x=qx, y=qy, z=qz)
    qx, qy, qz, qw = rotate_quat
    q2 = Quaternion(w=qw, x=qx, y=qy, z=qz)
    quat = q2 * q1
    return [quat.x, quat.y, quat.z, quat.w]
 def check_reachable(
    base_xyz: np.ndarray,
    reach_xyz: np.ndarray,
    min_reach: float = 0.25,
    max_reach: float = 0.85,
 ) -> bool:
    """Check if the target point is within the reachable range."""
    distance = np.linalg.norm(reach_xyz - base_xyz)
    return min_reach < distance < max_reach
@with_seed("seed")
 def bfs_placement(
    layout_info: LayoutInfo,
    floor_margin: float = 0,
    beside_margin: float = 0.1,
    max_attempts: int = 3000,
    rotate_objs: bool = True,
    rotate_bg: bool = True,
    limit_reach_range: bool = True,
    robot_dim: float = 0.12,
    seed: int = None,
 ) -> LayoutInfo:
    object_mapping = layout_info.objs_mapping
    position = {}  # node: [x, y, z, qx, qy, qz, qw]
    parent_bbox_xy = {}
    placed_boxes_map = defaultdict(list)
    mesh_info = defaultdict(dict)
    robot_node = layout_info.relation[Scene3DItemEnum.ROBOT.value]
    for node in object_mapping:
        if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
            bg_quat = (
                compute_axis_rotation_quat(
                    axis="y",
                    angle_rad=np.random.uniform(0, 2 * np.pi),
                )
                if rotate_bg
                else [0, 0, 0, 1]
            )
            bg_quat = [round(q, 4) for q in bg_quat]
            continue
        mesh_path = (
            f"{layout_info.assets[node]}/mesh/{node.replace(' ', '_')}.obj"
        )
        mesh_info[node]["path"] = mesh_path
        mesh = trimesh.load(mesh_path)
        vertices = mesh.vertices
        z1 = np.percentile(vertices[:, 1], 1)
        z2 = np.percentile(vertices[:, 1], 99)
        if object_mapping[node] == Scene3DItemEnum.CONTEXT.value:
            object_quat = [0, 0, 0, 1]
            mesh_info[node]["surface"] = compute_convex_hull_path(vertices)
            # Put robot in the CONTEXT edge.
            x, y = random.choice(mesh_info[node]["surface"].vertices)
            theta = np.arctan2(y, x)
            quat_initial = Quaternion(axis=[0, 0, 1], angle=theta)
            quat_extra = Quaternion(axis=[0, 0, 1], angle=np.pi)
            quat = quat_extra * quat_initial
            _pose = [x, y, z2 - z1, quat.x, quat.y, quat.z, quat.w]
            position[robot_node] = [round(v, 4) for v in _pose]
            node_box = [
                x - robot_dim / 2,
                x + robot_dim / 2,
                y - robot_dim / 2,
                y + robot_dim / 2,
            ]
            placed_boxes_map[node].append(node_box)
        elif rotate_objs:
            # For manipulated and distractor objects, apply random rotation
            angle_rad = np.random.uniform(0, 2 * np.pi)
            object_quat = compute_axis_rotation_quat(
                axis="y", angle_rad=angle_rad
            )
            object_quat_scipy = np.roll(object_quat, 1)  # [w, x, y, z]
            rotation = R.from_quat(object_quat_scipy).as_matrix()
            vertices = np.dot(mesh.vertices, rotation.T)
            z1 = np.percentile(vertices[:, 1], 1)
            z2 = np.percentile(vertices[:, 1], 99)
        x1, x2, y1, y2 = compute_xy_bbox(vertices)
        mesh_info[node]["pose"] = [x1, x2, y1, y2, z1, z2, *object_quat]
        mesh_info[node]["area"] = max(1e-5, (x2 - x1) * (y2 - y1))
    root = list(layout_info.tree.keys())[0]
    queue = deque([((root, None), layout_info.tree.get(root, []))])
    while queue:
        (node, relation), children = queue.popleft()
        if node not in object_mapping:
            continue
        if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
            position[node] = [0, 0, floor_margin, *bg_quat]
        else:
            x1, x2, y1, y2, z1, z2, qx, qy, qz, qw = mesh_info[node]["pose"]
            if object_mapping[node] == Scene3DItemEnum.CONTEXT.value:
                position[node] = [0, 0, -round(z1, 4), qx, qy, qz, qw]
                parent_bbox_xy[node] = [x1, x2, y1, y2, z1, z2]
            elif object_mapping[node] in [
                Scene3DItemEnum.MANIPULATED_OBJS.value,
                Scene3DItemEnum.DISTRACTOR_OBJS.value,
            ]:
                parent_node = find_parent_node(node, layout_info.tree)
                parent_pos = position[parent_node]
                (
                    p_x1,
                    p_x2,
                    p_y1,
                    p_y2,
                    p_z1,
                    p_z2,
                ) = parent_bbox_xy[parent_node]
                obj_dx = x2 - x1
                obj_dy = y2 - y1
                hull_path = mesh_info[parent_node].get("surface")
                for _ in range(max_attempts):
                    node_x1 = random.uniform(p_x1, p_x2 - obj_dx)
                    node_y1 = random.uniform(p_y1, p_y2 - obj_dy)
                    node_box = [
                        node_x1,
                        node_x1 + obj_dx,
                        node_y1,
                        node_y1 + obj_dy,
                    ]
                    if hull_path and not all_corners_inside(
                        hull_path, node_box
                    ):
                        continue
                    # Make sure the manipulated object is reachable by robot.
                    if (
                        limit_reach_range
                        and object_mapping[node]
                        == Scene3DItemEnum.MANIPULATED_OBJS.value
                    ):
                        cx = parent_pos[0] + node_box[0] + obj_dx / 2
                        cy = parent_pos[1] + node_box[2] + obj_dy / 2
                        cz = parent_pos[2] + p_z2 - z1
                        robot_pose = position[robot_node][:3]
                        if not check_reachable(
                            base_xyz=np.array(robot_pose),
                            reach_xyz=np.array([cx, cy, cz]),
                        ):
                            continue
                    if not has_iou_conflict(
                        node_box, placed_boxes_map[parent_node]
                    ):
                        z_offset = 0
                        break
                else:
                    logger.warning(
                        f"Cannot place {node} on {parent_node} without overlap"
                        f" after {max_attempts} attempts, place beside {parent_node}."
                    )
                    for _ in range(max_attempts):
                        node_x1 = random.choice(
                            [
                                random.uniform(
                                    p_x1 - obj_dx - beside_margin,
                                    p_x1 - obj_dx,
                                ),
                                random.uniform(p_x2, p_x2 + beside_margin),
                            ]
                        )
                        node_y1 = random.choice(
                            [
                                random.uniform(
                                    p_y1 - obj_dy - beside_margin,
                                    p_y1 - obj_dy,
                                ),
                                random.uniform(p_y2, p_y2 + beside_margin),
                            ]
                        )
                        node_box = [
                            node_x1,
                            node_x1 + obj_dx,
                            node_y1,
                            node_y1 + obj_dy,
                        ]
                        z_offset = -(parent_pos[2] + p_z2)
                        if not has_iou_conflict(
                            node_box, placed_boxes_map[parent_node]
                        ):
                            break
                placed_boxes_map[parent_node].append(node_box)
                abs_cx = parent_pos[0] + node_box[0] + obj_dx / 2
                abs_cy = parent_pos[1] + node_box[2] + obj_dy / 2
                abs_cz = parent_pos[2] + p_z2 - z1 + z_offset
                position[node] = [
                    round(v, 4)
                    for v in [abs_cx, abs_cy, abs_cz, qx, qy, qz, qw]
                ]
                parent_bbox_xy[node] = [x1, x2, y1, y2, z1, z2]
        sorted_children = sorted(
            children, key=lambda x: -mesh_info[x[0]].get("area", 0)
        )
        for child, rel in sorted_children:
            queue.append(((child, rel), layout_info.tree.get(child, [])))
    layout_info.position = position
    return layout_info
 def compose_mesh_scene(
    layout_info: LayoutInfo, out_scene_path: str, with_bg: bool = False
 ) -> None:
    object_mapping = Scene3DItemEnum.object_mapping(layout_info.relation)
    scene = trimesh.Scene()
    for node in layout_info.assets:
        if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
            mesh_path = f"{layout_info.assets[node]}/mesh_model.ply"
            if not with_bg:
                continue
        else:
            mesh_path = (
                f"{layout_info.assets[node]}/mesh/{node.replace(' ', '_')}.obj"
            )
        mesh = trimesh.load(mesh_path)
        offset = np.array(layout_info.position[node])[[0, 2, 1]]
        mesh.vertices += offset
        scene.add_geometry(mesh, node_name=node)
    os.makedirs(os.path.dirname(out_scene_path), exist_ok=True)
    scene.export(out_scene_path)
    logger.info(f"Composed interactive 3D layout saved in {out_scene_path}")
    return
 def compute_pinhole_intrinsics(
    image_w: int, image_h: int, fov_deg: float
 ) -> np.ndarray:
    fov_rad = np.deg2rad(fov_deg)
    fx = image_w / (2 * np.tan(fov_rad / 2))
    fy = fx  # assuming square pixels
    cx = image_w / 2
    cy = image_h / 2
    K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
    return K
--- a/embodied_gen/utils/process_media.py
+++ b/embodied_gen/utils/process_media.py
@ -166,7 +166,7 @@ def combine_images_to_grid(
    images: list[str | Image.Image],
    cat_row_col: tuple[int, int] = None,
    target_wh: tuple[int, int] = (512, 512),
-) -> list[str | Image.Image]:
+) -> list[Image.Image]:
    n_images = len(images)
    if n_images == 1:
        return images
@ -377,6 +377,42 @@ def parse_text_prompts(prompts: list[str]) -> list[str]:
    return prompts
 def alpha_blend_rgba(
    fg_image: Union[str, Image.Image, np.ndarray],
    bg_image: Union[str, Image.Image, np.ndarray],
 ) -> Image.Image:
    """Alpha blends a foreground RGBA image over a background RGBA image.
    Args:
        fg_image: Foreground image. Can be a file path (str), a PIL Image,
            or a NumPy ndarray.
        bg_image: Background image. Can be a file path (str), a PIL Image,
            or a NumPy ndarray.
    Returns:
        A PIL Image representing the alpha-blended result in RGBA mode.
    """
    if isinstance(fg_image, str):
        fg_image = Image.open(fg_image)
    elif isinstance(fg_image, np.ndarray):
        fg_image = Image.fromarray(fg_image)
    if isinstance(bg_image, str):
        bg_image = Image.open(bg_image)
    elif isinstance(bg_image, np.ndarray):
        bg_image = Image.fromarray(bg_image)
    if fg_image.size != bg_image.size:
        raise ValueError(
            f"Image sizes not match {fg_image.size} v.s. {bg_image.size}."
        )
    fg = fg_image.convert("RGBA")
    bg = bg_image.convert("RGBA")
    return Image.alpha_composite(bg, fg)
 def check_object_edge_truncated(
    mask: np.ndarray, edge_threshold: int = 5
 ) -> bool:
@ -400,8 +436,15 @@ def check_object_edge_truncated(
 if __name__ == "__main__":
-    merge_video_video(
+    image_paths = [
-        "outputs/imageto3d/room_bottle7/room_bottle_007/URDF_room_bottle_007/mesh_glo_normal.mp4",  # noqa
+        "outputs/layouts_sim/task_0000/images/pen.png",
-        "outputs/imageto3d/room_bottle7/room_bottle_007/URDF_room_bottle_007/mesh.mp4",  # noqa
+        "outputs/layouts_sim/task_0000/images/notebook.png",
-        "merge.mp4",
+        "outputs/layouts_sim/task_0000/images/mug.png",
-    )
+        "outputs/layouts_sim/task_0000/images/lamp.png",
        "outputs/layouts_sim2/task_0014/images/cloth.png",  # TODO
    ]
    for image_path in image_paths:
        image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
        mask = image[..., -1]
        flag = check_object_edge_truncated(mask)
        print(flag, image_path)
--- a/embodied_gen/utils/simulation.py
+++ b/embodied_gen/utils/simulation.py
@ -0,0 +1,706 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 import logging
 import os
 import xml.etree.ElementTree as ET
 from collections import defaultdict
 from typing import Literal
 import mplib
 import numpy as np
 import sapien.core as sapien
 import torch
 from PIL import Image, ImageColor
 from scipy.spatial.transform import Rotation as R
 from tqdm import tqdm
 from embodied_gen.data.utils import DiffrastRender
 from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
 from embodied_gen.utils.geometry import quaternion_multiply
 logger = logging.getLogger(__name__)
 from mani_skill.examples.motionplanning.panda.utils import (
    compute_grasp_info_by_obb,
 )
 from mani_skill.utils.geometry.trimesh_utils import get_component_mesh
 COLORMAP = list(set(ImageColor.colormap.values()))
 COLOR_PALETTE = np.array(
    [ImageColor.getrgb(c) for c in COLORMAP], dtype=np.uint8
 )
 SIM_COORD_ALIGN = np.array(
    [
        [1.0, 0.0, 0.0, 0.0],
        [0.0, -1.0, 0.0, 0.0],
        [0.0, 0.0, -1.0, 0.0],
        [0.0, 0.0, 0.0, 1.0],
    ]
 )  # Used to align SAPIEN, MuJoCo coordinate system with the world coordinate system
 __all__ = [
    "SIM_COORD_ALIGN",
    "FrankaPandaGrasper",
    "SimpleGrasper",
 ]
 class SapienSceneManager:
    """A class to manage SAPIEN simulator."""
    def __init__(
        self, sim_freq: int, ray_tracing: bool, device: str = "cuda"
    ) -> None:
        self.sim_freq = sim_freq
        self.ray_tracing = ray_tracing
        self.device = device
        self.renderer = sapien.SapienRenderer()
        self.scene = self._setup_scene()
        self.cameras: list[sapien.render.RenderCameraComponent] = []
        self.actors: dict[str, sapien.pysapien.Entity] = {}
    def _setup_scene(self) -> sapien.Scene:
        """Set up the SAPIEN scene with lighting and ground."""
        # Ray tracing settings
        if self.ray_tracing:
            sapien.render.set_camera_shader_dir("rt")
            sapien.render.set_ray_tracing_samples_per_pixel(64)
            sapien.render.set_ray_tracing_path_depth(10)
            sapien.render.set_ray_tracing_denoiser("oidn")
        # TODO: Support GPU.
        # import sapien.physx as physx
        # physx.enable_gpu()
        # scene = sapien.Scene(systems=[sapien.physx.PhysxGpuSystem(), sapien.render.RenderSystem()])
        scene = sapien.Scene()
        scene.set_timestep(1 / self.sim_freq)
        # Add lighting
        scene.set_ambient_light([0.2, 0.2, 0.2])
        scene.add_directional_light(
            direction=[0, 1, -1],
            color=[1.5, 1.45, 1.4],
            shadow=True,
            shadow_map_size=2048,
        )
        scene.add_directional_light(
            direction=[0, -0.5, 1], color=[0.8, 0.8, 0.85], shadow=False
        )
        scene.add_directional_light(
            direction=[0, -1, 1], color=[1.0, 1.0, 1.0], shadow=False
        )
        ground_material = self.renderer.create_material()
        ground_material.base_color = [0.5, 0.5, 0.5, 1]  # rgba, gray
        ground_material.roughness = 0.7
        ground_material.metallic = 0.0
        scene.add_ground(0, render_material=ground_material)
        return scene
    def step_action(
        self,
        action: torch.Tensor,
        sim_steps_per_control: int,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
    ) -> dict:
        self.robot.set_action(action)
        frames = defaultdict(list)
        for _ in range(sim_steps_per_control):
            self.scene.step()
        self.scene.update_render()
        for camera in cameras:
            camera.take_picture()
            images = self.render_images(camera, render_keys=render_keys)
            frames[camera.name].append(images)
        return frames
    def load_exists_robot(
        self,
        robot_name: str,
        pose: sapien.Pose,
        control_freq: int = 20,
        robot_init_qpos_noise: float = 0.02,
        control_mode: str = "pd_joint_pos",
        backend_str: tuple[str, str] = ("cpu", "gpu"),
    ):
        from mani_skill.agents import REGISTERED_AGENTS
        from mani_skill.envs.scene import ManiSkillScene
        from mani_skill.envs.utils.system.backend import (
            parse_sim_and_render_backend,
        )
        if robot_name == "franka":
            robot_name = "panda"  # remapping
        if robot_name not in REGISTERED_AGENTS:
            raise KeyError(
                f"Robot `{robot_name}` not registered, chosen from {REGISTERED_AGENTS.keys()}."
            )
        ROBOT_CLS = REGISTERED_AGENTS[robot_name].agent_cls
        # backend = parse_sim_and_render_backend("physx_cuda", "gpu")
        backend = parse_sim_and_render_backend(*backend_str)
        scene = ManiSkillScene(
            [self.scene], device=backend_str[0], backend=backend
        )
        robot = ROBOT_CLS(
            scene=scene,
            control_freq=control_freq,
            control_mode=control_mode,
            initial_pose=pose,
        )
        # Set robot init joint rad agree(joint0 to joint6 w 2 finger).
        qpos = np.array(
            [
                0.0,
                np.pi / 8,
                0,
                -np.pi * 3 / 8,
                0,
                np.pi * 3 / 4,
                np.pi / 4,
                0.04,
                0.04,
            ]
        )
        qpos = (
            np.random.normal(
                0, robot_init_qpos_noise, (len(scene.sub_scenes), len(qpos))
            )
            + qpos
        )
        qpos[:, -2:] = 0.04
        robot.reset(qpos)
        robot.init_qpos = robot.robot.qpos
        robot.controller.controllers["gripper"].reset()
        return robot
    def load_actor_from_urdf(
        self,
        file_path: str,
        pose: sapien.Pose,
        use_static: bool = False,
        update_mass: bool = False,
    ) -> sapien.pysapien.Entity:
        """Load an actor from a URDF file into the scene.
        Args:
            file_path (str): Path to the URDF file.
            pose (sapien.Pose): Pose of the actor.
            use_static (bool): Whether to create a static actor.
            update_mass (bool): Whether to update the actor's mass from the URDF.
        Returns:
            sapien.pysapien.Entity: The loaded actor.
        """
        tree = ET.parse(file_path)
        root = tree.getroot()
        node_name = root.get("name")
        file_dir = os.path.dirname(file_path)
        visual_file = root.find('.//visual/geometry/mesh').get("filename")
        collision_file = root.find('.//collision/geometry/mesh').get(
            "filename"
        )
        visual_file = os.path.join(file_dir, visual_file)
        collision_file = os.path.join(file_dir, collision_file)
        static_fric = root.find('.//collision/gazebo/mu1').text
        dynamic_fric = root.find('.//collision/gazebo/mu2').text
        material = self.scene.create_physical_material(
            static_friction=np.clip(float(static_fric), 0.1, 0.7),
            dynamic_friction=np.clip(float(dynamic_fric), 0.1, 0.6),
            restitution=0.05,
        )
        builder = self.scene.create_actor_builder()
        body_type = "static" if use_static else "dynamic"
        builder.set_physx_body_type(body_type)
        builder.add_multiple_convex_collisions_from_file(
            collision_file, material=material
        )
        builder.add_visual_from_file(visual_file)
        actor = builder.build(name=node_name)
        actor.set_name(node_name)
        actor.set_pose(pose)
        if update_mass and hasattr(actor.components[1], "mass"):
            node_mass = float(root.find('.//inertial/mass').get("value"))
            actor.components[1].set_mass(node_mass)
        return actor
    def create_camera(
        self,
        cam_name: str,
        pose: sapien.Pose,
        image_hw: tuple[int, int],
        fovy_deg: float,
    ) -> sapien.render.RenderCameraComponent:
        """Create a single camera in the scene.
        Args:
            cam_name (str): Name of the camera.
            pose (sapien.Pose): Camera pose p=(x, y, z), q=(w, x, y, z)
            image_hw (Tuple[int, int]): Image resolution (height, width) for cameras.
            fovy_deg (float): Field of view in degrees for cameras.
        Returns:
            sapien.render.RenderCameraComponent: The created camera.
        """
        cam_actor = self.scene.create_actor_builder().build_kinematic()
        cam_actor.set_pose(pose)
        camera = self.scene.add_mounted_camera(
            name=cam_name,
            mount=cam_actor,
            pose=sapien.Pose(p=[0, 0, 0], q=[1, 0, 0, 0]),
            width=image_hw[1],
            height=image_hw[0],
            fovy=np.deg2rad(fovy_deg),
            near=0.01,
            far=100,
        )
        self.cameras.append(camera)
        return camera
    def initialize_circular_cameras(
        self,
        num_cameras: int,
        radius: float,
        height: float,
        target_pt: list[float],
        image_hw: tuple[int, int],
        fovy_deg: float,
    ) -> list[sapien.render.RenderCameraComponent]:
        """Initialize multiple cameras arranged in a circle.
        Args:
            num_cameras (int): Number of cameras to create.
            radius (float): Radius of the camera circle.
            height (float): Fixed Z-coordinate of the cameras.
            target_pt (list[float]): 3D point (x, y, z) that cameras look at.
            image_hw (Tuple[int, int]): Image resolution (height, width) for cameras.
            fovy_deg (float): Field of view in degrees for cameras.
        Returns:
            List[sapien.render.RenderCameraComponent]: List of created cameras.
        """
        angle_step = 2 * np.pi / num_cameras
        world_up_vec = np.array([0.0, 0.0, 1.0])
        target_pt = np.array(target_pt)
        for i in range(num_cameras):
            angle = i * angle_step
            cam_x = radius * np.cos(angle)
            cam_y = radius * np.sin(angle)
            cam_z = height
            eye_pos = [cam_x, cam_y, cam_z]
            forward_vec = target_pt - eye_pos
            forward_vec = forward_vec / np.linalg.norm(forward_vec)
            temp_right_vec = np.cross(forward_vec, world_up_vec)
            if np.linalg.norm(temp_right_vec) < 1e-6:
                temp_right_vec = np.array([1.0, 0.0, 0.0])
                if np.abs(np.dot(temp_right_vec, forward_vec)) > 0.99:
                    temp_right_vec = np.array([0.0, 1.0, 0.0])
            right_vec = temp_right_vec / np.linalg.norm(temp_right_vec)
            up_vec = np.cross(right_vec, forward_vec)
            rotation_matrix = np.array([forward_vec, -right_vec, up_vec]).T
            rot = R.from_matrix(rotation_matrix)
            scipy_quat = rot.as_quat()  # (x, y, z, w)
            quat = [
                scipy_quat[3],
                scipy_quat[0],
                scipy_quat[1],
                scipy_quat[2],
            ]  # (w, x, y, z)
            self.create_camera(
                f"camera_{i}",
                sapien.Pose(p=eye_pos, q=quat),
                image_hw,
                fovy_deg,
            )
        return self.cameras
    def render_images(
        self,
        camera: sapien.render.RenderCameraComponent,
        render_keys: list[
            Literal[
                "Color",
                "Segmentation",
                "Normal",
                "Mask",
                "Depth",
                "Foreground",
            ]
        ] = None,
    ) -> dict[str, Image.Image]:
        """Render images from a given camera.
        Args:
            camera (sapien.render.RenderCameraComponent): The camera to render from.
            render_keys (List[str]): Types of images to render (e.g., Color, Segmentation).
        Returns:
            Dict[str, Image.Image]: Dictionary of rendered images.
        """
        if render_keys is None:
            render_keys = [
                "Color",
                "Segmentation",
                "Normal",
                "Mask",
                "Depth",
                "Foreground",
            ]
        results: dict[str, Image.Image] = {}
        if "Color" in render_keys:
            color = camera.get_picture("Color")
            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
            results["Color"] = Image.fromarray(color_rgb)
        if "Mask" in render_keys:
            alpha = (np.clip(color[..., 3], 0, 1) * 255).astype(np.uint8)
            results["Mask"] = Image.fromarray(alpha)
        if "Segmentation" in render_keys:
            seg_labels = camera.get_picture("Segmentation")
            label0 = seg_labels[..., 0].astype(np.uint8)
            seg_color = self.COLOR_PALETTE[label0]
            results["Segmentation"] = Image.fromarray(seg_color)
        if "Foreground" in render_keys:
            seg_labels = camera.get_picture("Segmentation")
            label0 = seg_labels[..., 0]
            mask = np.where((label0 > 1), 255, 0).astype(np.uint8)
            color = camera.get_picture("Color")
            color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
            foreground = np.concatenate([color_rgb, mask[..., None]], axis=-1)
            results["Foreground"] = Image.fromarray(foreground)
        if "Normal" in render_keys:
            normal = camera.get_picture("Normal")[..., :3]
            normal_img = (((normal + 1) / 2) * 255).astype(np.uint8)
            results["Normal"] = Image.fromarray(normal_img)
        if "Depth" in render_keys:
            position_map = camera.get_picture("Position")
            depth = -position_map[..., 2]
            alpha = torch.tensor(color[..., 3], dtype=torch.float32)
            norm_depth = DiffrastRender.normalize_map_by_mask(
                torch.tensor(depth), alpha
            )
            depth_img = (norm_depth * 255).to(torch.uint8).numpy()
            results["Depth"] = Image.fromarray(depth_img)
        return results
    def load_assets_from_layout_file(
        self,
        layout: LayoutInfo,
        z_offset: float = 0.0,
        init_quat: list[float] = [0, 0, 0, 1],
        robot_name: str = None,
        control_freq: int = 40,
        robot_init_qpos_noise: float = 0,
    ) -> None:
        """Load assets from `EmbodiedGen` layout-gen output and create actors in the scene.
        Args:
            layout (LayoutInfo): The layout information data.
            z_offset (float): Offset to apply to the Z-coordinate of non-context objects.
            init_quat (List[float]): Initial quaternion (x, y, z, w) for orientation adjustment.
            robot_name (str): Name of the robot to load, None not load robot.
            control_freq (int): Control frequency for the robot, default is 40.
            robot_init_qpos_noise (float): Noise to add to the robot's initial joint positions, default is 0.
        """
        for node in layout.assets:
            file_dir = layout.assets[node]
            file_name = f"{node.replace(' ', '_')}.urdf"
            urdf_file = os.path.join(file_dir, file_name)
            if layout.objs_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
                continue
            position = layout.position[node].copy()
            if layout.objs_mapping[node] != Scene3DItemEnum.CONTEXT.value:
                position[2] += z_offset
            use_static = (
                layout.relation.get(Scene3DItemEnum.CONTEXT.value, None)
                == node
            )
            # Combine initial quaternion with object quaternion
            x, y, z, qx, qy, qz, qw = position
            qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
            actor = self.load_actor_from_urdf(
                urdf_file,
                sapien.Pose(p=[x, y, z], q=[qw, qx, qy, qz]),
                use_static=use_static,
                update_mass=False,
            )
            self.actors[node] = actor
        if robot_name is not None:
            robot_node = layout.relation[Scene3DItemEnum.ROBOT.value]
            x, y, z, qx, qy, qz, qw = layout.position[robot_node]
            delta_z = 0.002  # Add small offset to avoid collision.
            pose = sapien.Pose([x, y, z + delta_z], [qw, qx, qy, qz])
            self.robot = self.load_exists_robot(
                robot_name, pose, control_freq, robot_init_qpos_noise
            )
        return
 class FrankaPandaGrasper(object):
    """Grasper planner for Franka Panda robot arm.
    Args:
        robot: See from `mani_skill.agents.robots.panda.panda`.
        scene: The `SapienSceneManager` object.
        sim_steps_per_control: Number of simulation steps per control step.
        control_timestep: Time step for planning.
        joint_vel_limits: Velocity limit for all joints (default: 0.9)
        joint_acc_limits: Acceleration limit for all joints (default: 0.9)
        finger_length: Length of the robot's fingers (default: 0.025)
    """
    def __init__(
        self,
        robot,
        scene: SapienSceneManager,
        sim_steps_per_control: int,
        control_timestep: float,
        joint_vel_limits: float = 0.9,
        joint_acc_limits: float = 0.9,
        finger_length: float = 0.025,
    ) -> None:
        self.agent = robot
        self.robot = robot.robot
        self.scene = scene
        self.sim_steps_per_control = sim_steps_per_control
        self.control_timestep = control_timestep
        self.joint_vel_limits = joint_vel_limits
        self.joint_acc_limits = joint_acc_limits
        self.finger_length = finger_length
        self.planner = self._setup_planner()
    def _setup_planner(self) -> mplib.Planner:
        link_names = [link.get_name() for link in self.robot.get_links()]
        joint_names = [
            joint.get_name() for joint in self.robot.get_active_joints()
        ]
        planner = mplib.Planner(
            urdf=self.agent.urdf_path,
            srdf=self.agent.urdf_path.replace(".urdf", ".srdf"),
            user_link_names=link_names,
            user_joint_names=joint_names,
            move_group="panda_hand_tcp",
            joint_vel_limits=np.ones(7) * self.joint_vel_limits,
            joint_acc_limits=np.ones(7) * self.joint_acc_limits,
        )
        planner.set_base_pose(self.robot.pose.raw_pose[0].tolist())
        return planner
    def update_state(
        self,
        result: dict,
        gripper_state: Literal[-1, 1],
        sim_steps_per_control: int,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
    ) -> dict[str, list[Image.Image]]:
        n_step = len(result["position"])
        total_frames = defaultdict(list)
        for i in tqdm(range(n_step), desc="Grasping"):
            qpos = result["position"][min(i, n_step - 1)]
            action = np.hstack([qpos, gripper_state])[None, ...]
            action = torch.from_numpy(action).float()
            frames = self.scene.step_action(
                action, sim_steps_per_control, cameras, render_keys
            )
            for camera in cameras:
                total_frames[camera.name].extend(frames[camera.name])
        return total_frames
    def control_gripper(
        self,
        sim_steps_per_control: int,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
        gripper_state: Literal[-1, 1],
        n_step: int = 10,
    ) -> dict[str, list[Image.Image]]:
        total_frames = defaultdict(list)
        qpos = self.robot.get_qpos()[0, :-2].cpu().numpy()
        for _ in range(n_step):
            action = np.hstack([qpos, gripper_state])[None, ...]
            action = torch.from_numpy(action).float()
            frames = self.scene.step_action(
                action, sim_steps_per_control, cameras, render_keys
            )
            for camera in cameras:
                total_frames[camera.name].extend(frames[camera.name])
        return total_frames
    def move_to_pose(
        self,
        pose: sapien.Pose,
        sim_steps_per_control: int,
        control_timestep: float,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[str],
        gripper_state: Literal[-1, 1],
        use_point_cloud: bool = False,
        n_max_step: int = 100,
    ) -> dict[str, list[Image.Image]]:
        # First try screw motion planning
        result = self.planner.plan_screw(
            np.concatenate([pose.p, pose.q]),
            self.robot.get_qpos().cpu().numpy()[0],
            time_step=control_timestep,
            use_point_cloud=use_point_cloud,
        )
        # Fallback to standard planning if screw fails
        if result["status"] != "Success":
            result = self.planner.plan_qpos_to_pose(
                np.concatenate([pose.p, pose.q]),
                self.robot.get_qpos().cpu().numpy()[0],
                time_step=control_timestep,
                use_point_cloud=use_point_cloud,
            )
        if result["status"] != "Success":
            return defaultdict(list)
        for key in result:
            if key in ["status", "duration"]:
                continue
            sample_ratio = (len(result[key]) // n_max_step) + 1
            result[key] = result[key][::sample_ratio]
        frames = self.update_state(
            result, gripper_state, sim_steps_per_control, cameras, render_keys
        )
        return frames
    def render_grasp(
        self,
        actor: sapien.pysapien.Entity,
        cameras: list[sapien.render.RenderCameraComponent],
        render_keys: list[
            Literal[
                "Color",
                "Foreground",
                "Segmentation",
                "Normal",
                "Mask",
                "Depth",
            ]
        ],
        reach_target_only: bool = True,
    ) -> dict[str, list[Image.Image]]:
        physx_rigid = actor.components[1]
        mesh = get_component_mesh(physx_rigid, to_world_frame=True)
        obb = mesh.bounding_box_oriented
        approaching = np.array([0, 0, -1])
        target_closing = (
            self.agent.tcp.pose.to_transformation_matrix()[0, :3, 1]
            .cpu()
            .numpy()
        )
        grasp_info = compute_grasp_info_by_obb(
            obb,
            approaching=approaching,
            target_closing=target_closing,
            depth=self.finger_length,
        )
        closing, center = grasp_info["closing"], grasp_info["center"]
        raw_tcp_pose = self.agent.tcp.pose.sp
        grasp_pose = self.agent.build_grasp_pose(approaching, closing, center)
        reach_pose = grasp_pose * sapien.Pose(p=[0, 0, -0.05])
        grasp_pose = grasp_pose * sapien.Pose(p=[0, 0, 0.01])
        total_frames = defaultdict(list)
        frames_list = []
        reach_frames = self.move_to_pose(
            reach_pose,
            self.sim_steps_per_control,
            self.control_timestep,
            cameras,
            render_keys,
            gripper_state=1,
        )
        frames_list.append(reach_frames)
        if len(reach_frames) == 0:
            logger.warning(
                f"Failed to reach the grasp pose for node `{actor.name}`, skipping grasping."
            )
            return total_frames
        if not reach_target_only:
            grasp_frames = self.move_to_pose(
                grasp_pose,
                self.sim_steps_per_control,
                self.control_timestep,
                cameras,
                render_keys,
                gripper_state=1,
            )
            frames_list.append(grasp_frames)
            close_frames = self.control_gripper(
                self.sim_steps_per_control,
                cameras,
                render_keys,
                gripper_state=-1,
            )
            frames_list.append(close_frames)
            back_frames = self.move_to_pose(
                raw_tcp_pose,
                self.sim_steps_per_control,
                self.control_timestep,
                cameras,
                render_keys,
                gripper_state=-1,
            )
            frames_list.append(back_frames)
        for frame_dict in frames_list:
            for cam_name, frames in frame_dict.items():
                total_frames[cam_name].extend(frames)
        self.agent.reset(self.agent.init_qpos)
        return total_frames
--- a/embodied_gen/utils/tags.py
+++ b/embodied_gen/utils/tags.py
@ -1 +1 @@
-VERSION = "v0.1.2"
+VERSION = "v0.1.3"
--- a/embodied_gen/validators/quality_checkers.py
+++ b/embodied_gen/validators/quality_checkers.py
@ -109,7 +109,7 @@ class MeshGeoChecker(BaseChecker):
        if self.prompt is None:
            self.prompt = """
            You are an expert in evaluating the geometry quality of generated 3D asset.
-            You will be given rendered views of a generated 3D asset with black background.
+            You will be given rendered views of a generated 3D asset, type {}, with black background.
            Your task is to evaluate the quality of the 3D asset generation,
            including geometry, structure, and appearance, based on the rendered views.
            Criteria:
@ -130,10 +130,13 @@ class MeshGeoChecker(BaseChecker):
            Image shows a chair with simplified back legs and soft edges → YES
            """
-    def query(self, image_paths: list[str | Image.Image]) -> str:
+    def query(
        self, image_paths: list[str | Image.Image], text: str = "unknown"
    ) -> str:
        input_prompt = self.prompt.format(text)
        return self.gpt_client.query(
-            text_prompt=self.prompt,
+            text_prompt=input_prompt,
            image_base64=image_paths,
        )
--- a/embodied_gen/validators/urdf_convertor.py
+++ b/embodied_gen/validators/urdf_convertor.py
@ -107,36 +107,37 @@ class URDFGenerator(object):
            already provided, use it directly), accurately describe this 3D object asset (within 15 words),
            Determine the pose of the object in the first image and estimate the true vertical height
            (vertical projection) range of the object (in meters), i.e., how tall the object appears from top
-            to bottom in the front view (first) image. also weight range (unit: kilogram), the average
+            to bottom in the first image. also weight range (unit: kilogram), the average
            static friction coefficient of the object relative to rubber and the average dynamic friction
-            coefficient of the object relative to rubber. Return response format as shown in Output Example.
+            coefficient of the object relative to rubber. Return response in format as shown in Output Example.
            Output Example:
            Category: cup
            Description: shiny golden cup with floral design
-            Height: 0.1-0.15 m
+            Pose: <short_description_within_10_words>
            Height: 0.10-0.15 m
            Weight: 0.3-0.6 kg
            Static friction coefficient: 0.6
            Dynamic friction coefficient: 0.5
-            IMPORTANT: Estimating Vertical Height from the First (Front View) Image.
+            IMPORTANT: Estimating Vertical Height from the First (Front View) Image and pose estimation based on all views.
            - The "vertical height" refers to the real-world vertical size of the object
            as projected in the first image, aligned with the image's vertical axis.
            - For flat objects like plates or disks or book, if their face is visible in the front view,
            use the diameter as the vertical height. If the edge is visible, use the thickness instead.
            - This is not necessarily the full length of the object, but how tall it appears
-            in the first image vertically, based on its pose and orientation.
+            in the first image vertically, based on its pose and orientation estimation on all views.
-            - For objects(e.g., spoons, forks, writing instruments etc.) at an angle showing in
+            - For objects(e.g., spoons, forks, writing instruments etc.) at an angle showing in images,
-            the first image, tilted at 45° will appear shorter vertically than when upright.
+                e.g., tilted at 45° will appear shorter vertically than when upright.
            Estimate the vertical projection of their real length based on its pose.
            For example:
-              - A pen standing upright in the first view (aligned with the image's vertical axis)
+              - A pen standing upright in the first image (aligned with the image's vertical axis)
              full body visible in the first image: → vertical height ≈ 0.14-0.20 m
-              - A pen lying flat in the front view (showing thickness) → vertical height ≈ 0.018-0.025 m
+              - A pen lying flat in the first image (showing thickness or as a dot) → vertical height ≈ 0.018-0.025 m
              - Tilted pen in the first image (e.g., ~45° angle): vertical height ≈ 0.07-0.12 m
-            - Use the rest views(except the first image) to help determine the object's 3D pose and orientation.
+            - Use the rest views to help determine the object's 3D pose and orientation.
            Assume the object is in real-world scale and estimate the approximate vertical height
-            (in meters) based on how large it appears vertically in the first image.
+            based on the pose estimation and how large it appears vertically in the first image.
            """
            )
@ -163,14 +164,14 @@ class URDFGenerator(object):
        description = lines[1].split(": ")[1]
        min_height, max_height = map(
            lambda x: float(x.strip().replace(",", "").split()[0]),
-            lines[2].split(": ")[1].split("-"),
+            lines[3].split(": ")[1].split("-"),
        )
        min_mass, max_mass = map(
            lambda x: float(x.strip().replace(",", "").split()[0]),
-            lines[3].split(": ")[1].split("-"),
+            lines[4].split(": ")[1].split("-"),
        )
-        mu1 = float(lines[4].split(": ")[1].replace(",", ""))
+        mu1 = float(lines[5].split(": ")[1].replace(",", ""))
-        mu2 = float(lines[5].split(": ")[1].replace(",", ""))
+        mu2 = float(lines[6].split(": ")[1].replace(",", ""))
        return {
            "category": category.lower(),
--- a/pyproject.toml
+++ b/pyproject.toml
@ -7,7 +7,7 @@ packages = ["embodied_gen"]
 [project]
 name = "embodied_gen"
-version = "v0.1.2"
+version = "v0.1.3"
 readme = "README.md"
 license = "Apache-2.0"
 license-files = ["LICENSE", "NOTICE"]
@ -32,6 +32,8 @@ backproject-cli = "embodied_gen.data.backproject_v2:entrypoint"
 img3d-cli = "embodied_gen.scripts.imageto3d:entrypoint"
 text3d-cli = "embodied_gen.scripts.textto3d:text_to_3d"
 scene3d-cli = "embodied_gen.scripts.gen_scene3d:entrypoint"
 layout-cli = "embodied_gen.scripts.gen_layout:entrypoint"
 sim-cli = "embodied_gen.scripts.simulate_sapien:entrypoint"
 [tool.pydocstyle]
 match = '(?!test_).*(?!_pb2)\.py'
--- a/requirements.txt
+++ b/requirements.txt
@ -38,4 +38,5 @@ tyro
 pyquaternion
 shapely
 sapien==3.0.0b1
 mani_skill==3.0.0b21
 typing_extensions==4.14.1
--- a/tests/test_examples/test_quality_checkers.py
+++ b/tests/test_examples/test_quality_checkers.py
@ -77,10 +77,18 @@ def panoocc_checker():
 def test_geo_checker(geo_checker):
    from embodied_gen.utils.process_media import combine_images_to_grid
    image_paths = glob(
        "outputs/layouts_gens3/task_0000/asset3d/pen/result/renders/image_color/*png"
    )
    image = combine_images_to_grid(image_paths)[0]
    # image = "outputs/utest_cases/geo_checker/color.png"
    flag, result = geo_checker(
        [
-            "apps/assets/example_image/sample_02.jpg",
+            image,
-        ]
+        ],
        text="pen",
    )
    logger.info(f"geo_checker: {flag}, {result}")
    assert isinstance(flag, bool)
@ -107,13 +115,42 @@ def test_seg_checker(seg_checker):
 def test_semantic_checker(semantic_checker):
-    flag, result = semantic_checker(
+    test_cases = [
-        text="can",
+        ("table", "outputs/utest_cases/semantic_checker/task_0001_table.png"),
-        image=["apps/assets/example_image/sample_02.jpg"],
+        ("table", "outputs/utest_cases/semantic_checker/task_0002_table.png"),
-    )
+        (
-    logger.info(f"semantic_checker: {flag}, {result}")
+            "banana",
-    assert isinstance(flag, bool)
+            "outputs/utest_cases/semantic_checker/task_0002_banana.png",
-    assert isinstance(result, str)
+        ),
        (
            "napkin",
            "outputs/utest_cases/semantic_checker/task_0003_napkin.png",
        ),
        ("table", "outputs/utest_cases/semantic_checker/task_0003_table.png"),
        ("fork", "outputs/utest_cases/semantic_checker/task_0005_fork.png"),
        ("spoon", "outputs/utest_cases/semantic_checker/task_0005_spoon.png"),
        ("table", "outputs/utest_cases/semantic_checker/task_0007_table.png"),
        ("table", "outputs/utest_cases/semantic_checker/task_0008_table.png"),
        ("table", "outputs/utest_cases/semantic_checker/task_0009_table.png"),
        ("table", "outputs/utest_cases/semantic_checker/task_0011_table.png"),
        ("desk", "outputs/utest_cases/semantic_checker/task_0012_desk.png"),
        (
            "laptop",
            "outputs/utest_cases/semantic_checker/task_0012_laptop.png",
        ),
        ("table", "outputs/utest_cases/semantic_checker/task_0014_table.png"),
        ("desk", "outputs/utest_cases/semantic_checker/task_0016_desk.png"),
        ("shelf", "outputs/utest_cases/semantic_checker/task_0018_shelf.png"),
        ("table", "outputs/utest_cases/semantic_checker/task_0000_table.png"),
    ]
    for test_case in test_cases:
        flag, result = semantic_checker(
            text=test_case[0],
            image=[test_case[1]],
        )
        logger.info(f"semantic_checker: {flag}, {result}, {test_case[1]}")
        assert isinstance(flag, bool)
        assert isinstance(result, str)
@pytest.mark.parametrize(
@ -139,25 +176,21 @@ def test_textgen_checker(textalign_checker, mesh_path, text_desc):
 def test_panoheight_estimator(pano_height_estimator):
-    image_paths = glob("outputs/bg_v3/test2/*/*.png")
+    image_paths = glob("outputs/bg_scenes/*/*.png")
    for image_path in image_paths:
        result = pano_height_estimator(image_path)
        logger.info(f"{type(result)}, {result}")
 def test_pano_checker(pano_checker):
-    # image_paths = [
+    image_paths = glob("outputs/bg_scenes/*/*.png")
    #     "outputs/bg_gen2/scene_0000/pano_image.png",
    #     "outputs/bg_gen2/scene_0001/pano_image.png",
    # ]
    image_paths = glob("outputs/bg_gen/*/*.png")
    for image_path in image_paths:
        flag, result = pano_checker(image_path)
        logger.info(f"{image_path} {flag}, {result}")
 def test_panoocc_checker(panoocc_checker):
-    image_paths = glob("outputs/bg_gen/*/*.png")
+    image_paths = glob("outputs/bg_scenes/*/*.png")
    for image_path in image_paths:
        flag, result = panoocc_checker(image_path)
        logger.info(f"{image_path} {flag}, {result}")
--- a/tests/test_examples/test_urdf_convertor.py
+++ b/tests/test_examples/test_urdf_convertor.py
@ -0,0 +1,51 @@
 # Project EmbodiedGen
 #
 # Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #       http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
 # implied. See the License for the specific language governing
 # permissions and limitations under the License.
 from embodied_gen.utils.gpt_clients import GPT_CLIENT
 from embodied_gen.validators.urdf_convertor import URDFGenerator
 def test_urdf_convertor():
    urdf_gen = URDFGenerator(GPT_CLIENT, render_view_num=4)
    mesh_paths = [
        "outputs/test_urdf/sample_0/mesh/pen.obj",
        "outputs/test_urdf/sample_1/mesh/notepad.obj",
        "outputs/test_urdf/sample_2/mesh/plate.obj",
        "outputs/test_urdf/sample_3/mesh/spoon.obj",
        "outputs/test_urdf/sample_4/mesh/notebook.obj",
        "outputs/test_urdf/sample_5/mesh/plate.obj",
        "outputs/test_urdf/sample_6/mesh/spoon.obj",
        "outputs/test_urdf/sample_7/mesh/book.obj",
        "outputs/test_urdf/sample_8/mesh/lamp.obj",
        "outputs/test_urdf/sample_9/mesh/remote_control.obj",
        "outputs/test_urdf/sample_10/mesh/keyboard.obj",
        "outputs/test_urdf/sample_11/mesh/mouse.obj",
        "outputs/test_urdf/sample_12/mesh/table.obj",
        "outputs/test_urdf/sample_13/mesh/marker.obj",
        "outputs/test_urdf/pen/result/mesh/pen.obj",
        "outputs/test_urdf/notebook/result/mesh/notebook.obj",
        "outputs/test_urdf/marker/result/mesh/marker.obj",
        "outputs/test_urdf/pen2/result/mesh/pen.obj",
    ]
    for idx, mesh_path in enumerate(mesh_paths):
        filename = mesh_path.split("/")[-1].split(".")[0]
        urdf_path = urdf_gen(
            mesh_path=mesh_path,
            output_root=f"outputs/test_urdf2/sample_{idx}",
            category=filename,
            # min_height=1.0,
            # max_height=1.2,
        )