feat(pipe): Add layout generation by text desc pipe and release v0.1.3. (#33)

* feat(pipe): Add layout generation pipe.

* tag v0.1.3

* chore: update assets

* update

---------

Co-authored-by: xinjie.wang <xinjie.wang@gpu-4090-dev015.hogpu.cc>
This commit is contained in:
Xinjie 2025-08-19 15:40:33 +08:00 committed by GitHub
parent e05095f51f
commit e06900fe57
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
28 changed files with 2369 additions and 64 deletions

View File

@ -4,8 +4,18 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0).
## [0.1.3] - 2025-08-19
### Feature
- Release `layout-cli` 🏞️ Layout(Interactive 3D Worlds) Generation, generate interactive 3D scene from task description.
## [0.1.1] - 2025-07-xx
## [0.1.2] - 2025-07-31
### Feature
- Release `scene3d-cli` 🌍 3D Scene Generation, generate a color mesh and 3DGS scene from text description.
- Release docker file and docker image.
## [0.1.1] - 2025-07-11
### Feature
- Added intelligent quality checkers and auto-retry pipeline for `image-to-3d` and `text-to-3d`.
- Added unit tests for quality checkers.

View File

@ -30,7 +30,7 @@
```sh
git clone https://github.com/HorizonRobotics/EmbodiedGen.git
cd EmbodiedGen
git checkout v0.1.2
git checkout v0.1.3
git submodule update --init --recursive --progress
conda create -n embodiedgen python=3.10.13 -y # recommended to use a new env.
conda activate embodiedgen
@ -194,8 +194,6 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \
### 💬 Generate Layout from task description
🚧 *Coming Soon*
<table>
<tr>
<td><img src="apps/assets/layout1.gif" alt="layout1" width="360"/></td>
@ -207,6 +205,47 @@ CUDA_VISIBLE_DEVICES=0 scene3d-cli \
</tr>
</table>
Text-to-image model based on SD3.5 Medium, usage requires agreement to the [model license](https://huggingface.co/stabilityai/stable-diffusion-3.5-medium). All models auto-downloaded at the first run.
You can generate any desired room as background using `scene3d-cli`. As each scene takes approximately 30 minutes to generate, we recommend pre-generating them for efficiency and adding them to `outputs/bg_scenes/scene_list.txt`.
We provided some sample background assets created with `scene3d-cli`. Download them(~4G) using `hf download xinjjj/scene3d-bg --repo-type dataset --local-dir outputs`.
Generating one interactive 3D scene from task description with `layout-cli` takes approximately 30 minutes.
```sh
layout-cli --task_descs "Place the pen in the mug on the desk" "Put the fruit on the table on the plate" \
--bg_list "outputs/bg_scenes/scene_list.txt" --output_root "outputs/layouts_gen" --insert_robot
```
<table>
<tr>
<td><img src="apps/assets/Iscene_demo1.gif" alt="Iscene_demo1" width="300"/></td>
<td><img src="apps/assets/Iscene_demo2.gif" alt="Iscene_demo2" width="450"/></td>
</tr>
</table>
Run multiple tasks defined in `task_list.txt` in the backend.
Remove `--insert_robot` if you don't consider the robot pose in layout generation.
```sh
CUDA_VISIBLE_DEVICES=0 nohup layout-cli \
--task_descs "apps/assets/example_layout/task_list.txt" \
--bg_list "outputs/bg_scenes/scene_list.txt" \
--output_root "outputs/layouts_gens" --insert_robot > layouts_gens.log &
```
Using `compose_layout.py`, you can recompose the layout of the generated interactive 3D scenes. (Support for texture editing and augmentation will be added later.)
```sh
python embodied_gen/scripts/compose_layout.py \
--layout_path "outputs/layouts_gens/task_0000/layout.json" \
--output_dir "outputs/layouts_gens/task_0000/recompose" --insert_robot
```
We provide `sim-cli`, that allows users to easily load generated layouts into an interactive 3D simulation using the SAPIEN engine (will support for more simulators in future updates).
```sh
sim-cli --layout_path "outputs/layouts_gens/task_0000/recompose/layout.json" \
--output_dir "outputs/layouts_gens/task_0000/recompose/sapien_render" --robot_name "franka"
```
### 🖼️ Real-to-Sim Digital Twin

Binary file not shown.

After

Width:  |  Height:  |  Size: 471 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 875 KiB

View File

@ -0,0 +1,18 @@
Place the pen in the mug on the desk
Get the pen from the table and put it inside the cup
Put the fruit on the table on the plate
Put the apples on the table on the plate
Put the spoon on the plate in the table
Put the green lid on the desk
Put the pen on the tray
Move the plate forward
Move the lamp to the left
Pick up the cup and put on the middle of the table
Close the laptop
Close the red jar with the lid on the table
Pick the cloth and shove it in the box
Pick up the marker from the table and put it in the bowl
Pick up the charger and move it slightly to the left
Move the jar to the left side of the desk
Pick the rubik's cube on the top of the desk
Move the mug to the right

View File

@ -189,7 +189,7 @@ os.makedirs(TMP_DIR, exist_ok=True)
lighting_css = """
<style>
#lighter_mesh canvas {
filter: brightness(1.8) !important;
filter: brightness(1.9) !important;
}
</style>
"""

View File

@ -51,17 +51,15 @@ class RenderResult:
def __post_init__(self):
if isinstance(self.rgb, torch.Tensor):
rgb = self.rgb.detach().cpu().numpy()
rgb = (rgb * 255).astype(np.uint8)
self.rgb = cv2.cvtColor(rgb, cv2.COLOR_BGR2RGB)
rgb = (self.rgb * 255).to(torch.uint8)
self.rgb = rgb.cpu().numpy()[..., ::-1]
if isinstance(self.depth, torch.Tensor):
self.depth = self.depth.detach().cpu().numpy()
self.depth = self.depth.cpu().numpy()
if isinstance(self.opacity, torch.Tensor):
opacity = self.opacity.detach().cpu().numpy()
opacity = (opacity * 255).astype(np.uint8)
self.opacity = cv2.cvtColor(opacity, cv2.COLOR_GRAY2RGB)
opacity = (self.opacity * 255).to(torch.uint8)
self.opacity = opacity.cpu().numpy()
mask = np.where(self.opacity > self.mask_threshold, 255, 0)
self.mask = mask[..., 0:1].astype(np.uint8)
self.mask = mask.astype(np.uint8)
self.rgba = np.concatenate([self.rgb, self.mask], axis=-1)

View File

@ -0,0 +1,509 @@
# Project EmbodiedGen
#
# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
import argparse
import json
import logging
import os
import re
import json_repair
from embodied_gen.utils.enum import (
LayoutInfo,
RobotItemEnum,
Scene3DItemEnum,
SpatialRelationEnum,
)
from embodied_gen.utils.gpt_clients import GPT_CLIENT, GPTclient
from embodied_gen.utils.process_media import SceneTreeVisualizer
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
__all__ = [
"LayoutDesigner",
"LAYOUT_DISASSEMBLER",
"LAYOUT_GRAPHER",
"LAYOUT_DESCRIBER",
]
DISTRACTOR_NUM = 3 # Maximum number of distractor objects allowed
LAYOUT_DISASSEMBLE_PROMPT = f"""
You are an intelligent 3D scene planner. Given a natural language
description of a robotic task, output a structured description of
an interactive 3D scene.
The output must include the following fields:
- task: A high-level task type (e.g., "single-arm pick",
"dual-arm grasping", "pick and place", "object sorting").
- {Scene3DItemEnum.ROBOT}: The name or type of robot involved. If not mentioned,
use {RobotItemEnum.FRANKA} as default.
- {Scene3DItemEnum.BACKGROUND}: The room or indoor environment where the task happens
(e.g., Kitchen, Bedroom, Living Room, Workshop, Office).
- {Scene3DItemEnum.CONTEXT}: A indoor object involved in the manipulation
(e.g., Table, Shelf, Desk, Bed, Cabinet).
- {Scene3DItemEnum.MANIPULATED_OBJS}: The main object(s) that the robot directly interacts with.
- {Scene3DItemEnum.DISTRACTOR_OBJS}: Other objects that naturally belong to the scene but are not part of the main task.
Constraints:
- The {Scene3DItemEnum.BACKGROUND} must logically match the described task.
- The {Scene3DItemEnum.CONTEXT} must fit within the {Scene3DItemEnum.BACKGROUND}. (e.g., a bedroom may include a table or bed, but not a workbench.)
- The {Scene3DItemEnum.CONTEXT} must be a concrete indoor object, such as a "table",
"shelf", "desk", or "bed". It must not be an abstract concept (e.g., "area", "space", "zone")
or structural surface (e.g., "floor", "ground"). If the input describes an interaction near
the floor or vague space, you must infer a plausible object like a "table", "cabinet", or "storage box" instead.
- {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} objects must be plausible,
and semantically compatible with the {Scene3DItemEnum.CONTEXT} and {Scene3DItemEnum.BACKGROUND}.
- {Scene3DItemEnum.DISTRACTOR_OBJS} must not confuse or overlap with the manipulated objects.
- {Scene3DItemEnum.DISTRACTOR_OBJS} number limit: {DISTRACTOR_NUM} distractors maximum.
- All {Scene3DItemEnum.BACKGROUND} are limited to indoor environments.
- {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} are rigid bodies and not include flexible objects.
- {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} must be common
household or office items or furniture, not abstract concepts, not too small like needle.
- If the input includes a plural or grouped object (e.g., "pens", "bottles", "plates", "fruit"),
you must decompose it into multiple individual instances (e.g., ["pen", "pen"], ["apple", "pear"]).
- Containers that hold objects (e.g., "bowl of apples", "box of tools") must
be separated into individual items (e.g., ["bowl", "apple", "apple"]).
- Do not include transparent objects such as "glass", "plastic", etc.
- The output must be in compact JSON format and use Markdown syntax, just like the output in the example below.
Examples:
Input:
"Pick up the marker from the table and put it in the bowl robot {RobotItemEnum.UR5}."
Output:
```json
{{
"task_desc": "Pick up the marker from the table and put it in the bowl.",
"task": "pick and place",
"{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
"{Scene3DItemEnum.BACKGROUND}": "kitchen",
"{Scene3DItemEnum.CONTEXT}": "table",
"{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker"],
"{Scene3DItemEnum.DISTRACTOR_OBJS}": ["mug", "notebook", "bowl"]
}}
```
Input:
"Put the rubik's cube on the top of the shelf."
Output:
```json
{{
"task_desc": "Put the rubik's cube on the top of the shelf.",
"task": "pick and place",
"{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
"{Scene3DItemEnum.BACKGROUND}": "bedroom",
"{Scene3DItemEnum.CONTEXT}": "shelf",
"{Scene3DItemEnum.MANIPULATED_OBJS}": ["rubik's cube"],
"{Scene3DItemEnum.DISTRACTOR_OBJS}": ["pen", "cup", "toy car"]
}}
```
Input:
"Remove all the objects from the white basket and put them on the table."
Output:
```json
{{
"task_desc": "Remove all the objects from the white basket and put them on the table, robot {RobotItemEnum.PIPER}.",
"task": "pick and place",
"{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.PIPER}",
"{Scene3DItemEnum.BACKGROUND}": "office",
"{Scene3DItemEnum.CONTEXT}": "table",
"{Scene3DItemEnum.MANIPULATED_OBJS}": ["banana", "mobile phone"],
"{Scene3DItemEnum.DISTRACTOR_OBJS}": ["plate", "white basket"]
}}
```
Input:
"Pick up the rope on the chair and put it in the box."
Output:
```json
{{
"task_desc": "Pick up the rope on the chair and put it in the box, robot {RobotItemEnum.FRANKA}.",
"task": "pick and place",
"{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
"{Scene3DItemEnum.BACKGROUND}": "living room",
"{Scene3DItemEnum.CONTEXT}": "chair",
"{Scene3DItemEnum.MANIPULATED_OBJS}": ["rope", "box"],
"{Scene3DItemEnum.DISTRACTOR_OBJS}": ["magazine"]
}}
```
Input:
"Pick up the seal tape and plastic from the counter and put them in the open drawer and close it."
Output:
```json
{{
"task_desc": "Pick up the seal tape and plastic from the counter and put them in the open drawer and close it.",
"task": "pick and place",
"robot": "franka",
"background": "kitchen",
"context": "counter",
"manipulated_objs": ["seal tape", "plastic", "opened drawer"],
"distractor_objs": ["scissors"]
}}
```
Input:
"Put the pens in the grey bowl."
Output:
```json
{{
"task_desc": "Put the pens in the grey bowl.",
"task": "pick and place",
"robot": "franka",
"background": "office",
"context": "table",
"manipulated_objs": ["pen", "pen", "grey bowl"],
"distractor_objs": ["notepad", "cup"]
}}
```
"""
LAYOUT_HIERARCHY_PROMPT = f"""
You are a 3D scene layout reasoning expert.
Your task is to generate a spatial relationship dictionary in multiway tree
that describes how objects are arranged in a 3D environment
based on a given task description and object list.
Input in JSON format containing the task description, task type,
{Scene3DItemEnum.ROBOT}, {Scene3DItemEnum.BACKGROUND}, {Scene3DItemEnum.CONTEXT},
and a list of objects, including {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS}.
### Supported Spatial Relations:
- "{SpatialRelationEnum.ON}": The child object bottom is directly on top of the parent object top.
- "{SpatialRelationEnum.INSIDE}": The child object is inside the context object.
- "{SpatialRelationEnum.IN}": The {Scene3DItemEnum.ROBOT} in the {Scene3DItemEnum.BACKGROUND}.
- "{SpatialRelationEnum.FLOOR}": The child object bottom is on the floor of the {Scene3DItemEnum.BACKGROUND}.
### Rules:
- The {Scene3DItemEnum.CONTEXT} object must be "{SpatialRelationEnum.FLOOR}" the {Scene3DItemEnum.BACKGROUND}.
- {Scene3DItemEnum.MANIPULATED_OBJS} and {Scene3DItemEnum.DISTRACTOR_OBJS} must be either
"{SpatialRelationEnum.ON}" or "{SpatialRelationEnum.INSIDE}" the {Scene3DItemEnum.CONTEXT}
- Or "{SpatialRelationEnum.FLOOR}" {Scene3DItemEnum.BACKGROUND}.
- Use "{SpatialRelationEnum.INSIDE}" only if the parent is a container-like object (e.g., shelf, rack, cabinet).
- Do not define relationship edges between objects, only for the child and parent nodes.
- {Scene3DItemEnum.ROBOT} must "{SpatialRelationEnum.IN}" the {Scene3DItemEnum.BACKGROUND}.
- Ensure that each object appears only once in the layout tree, and its spatial relationship is defined with only one parent.
- Ensure a valid multiway tree structure with a maximum depth of 2 levels suitable for a 3D scene layout representation.
- Only output the final output in JSON format, using Markdown syntax as in examples.
### Example
Input:
{{
"task_desc": "Pick up the marker from the table and put it in the bowl.",
"task": "pick and place",
"{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.FRANKA}",
"{Scene3DItemEnum.BACKGROUND}": "kitchen",
"{Scene3DItemEnum.CONTEXT}": "table",
"{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker", "bowl"],
"{Scene3DItemEnum.DISTRACTOR_OBJS}": ["mug", "chair"]
}}
Intermediate Think:
table {SpatialRelationEnum.FLOOR} kitchen
chair {SpatialRelationEnum.FLOOR} kitchen
{RobotItemEnum.FRANKA} {SpatialRelationEnum.IN} kitchen
marker {SpatialRelationEnum.ON} table
bowl {SpatialRelationEnum.ON} table
mug {SpatialRelationEnum.ON} table
Final Output:
```json
{{
"kitchen": [
["table", "{SpatialRelationEnum.FLOOR}"],
["chair", "{SpatialRelationEnum.FLOOR}"],
["{RobotItemEnum.FRANKA}", "{SpatialRelationEnum.IN}"]
],
"table": [
["marker", "{SpatialRelationEnum.ON}"],
["bowl", "{SpatialRelationEnum.ON}"],
["mug", "{SpatialRelationEnum.ON}"]
]
}}
```
Input:
{{
"task_desc": "Put the marker on top of the book.",
"task": "pick and place",
"{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
"{Scene3DItemEnum.BACKGROUND}": "office",
"{Scene3DItemEnum.CONTEXT}": "desk",
"{Scene3DItemEnum.MANIPULATED_OBJS}": ["marker", "book"],
"{Scene3DItemEnum.DISTRACTOR_OBJS}": ["pen holder", "notepad"]
}}
Intermediate Think:
desk {SpatialRelationEnum.FLOOR} office
{RobotItemEnum.UR5} {SpatialRelationEnum.IN} office
marker {SpatialRelationEnum.ON} desk
book {SpatialRelationEnum.ON} desk
pen holder {SpatialRelationEnum.ON} desk
notepad {SpatialRelationEnum.ON} desk
Final Output:
```json
{{
"office": [
["desk", "{SpatialRelationEnum.FLOOR}"],
["{RobotItemEnum.UR5}", "{SpatialRelationEnum.IN}"]
],
"desk": [
["marker", "{SpatialRelationEnum.ON}"],
["book", "{SpatialRelationEnum.ON}"],
["pen holder", "{SpatialRelationEnum.ON}"],
["notepad", "{SpatialRelationEnum.ON}"]
]
}}
```
Input:
{{
"task_desc": "Put the rubik's cube on the top of the shelf.",
"task": "pick and place",
"{Scene3DItemEnum.ROBOT}": "{RobotItemEnum.UR5}",
"{Scene3DItemEnum.BACKGROUND}": "bedroom",
"{Scene3DItemEnum.CONTEXT}": "shelf",
"{Scene3DItemEnum.MANIPULATED_OBJS}": ["rubik's cube"],
"{Scene3DItemEnum.DISTRACTOR_OBJS}": ["toy car", "pen"]
}}
Intermediate Think:
shelf {SpatialRelationEnum.FLOOR} bedroom
{RobotItemEnum.UR5} {SpatialRelationEnum.IN} bedroom
rubik's cube {SpatialRelationEnum.INSIDE} shelf
toy car {SpatialRelationEnum.INSIDE} shelf
pen {SpatialRelationEnum.INSIDE} shelf
Final Output:
```json
{{
"bedroom": [
["shelf", "{SpatialRelationEnum.FLOOR}"],
["{RobotItemEnum.UR5}", "{SpatialRelationEnum.IN}"]
],
"shelf": [
["rubik's cube", "{SpatialRelationEnum.INSIDE}"],
["toy car", "{SpatialRelationEnum.INSIDE}"],
["pen", "{SpatialRelationEnum.INSIDE}"]
]
}}
```
Input:
{{
"task_desc": "Put the marker in the cup on the counter.",
"task": "pick and place",
"robot": "franka",
"background": "kitchen",
"context": "counter",
"manipulated_objs": ["marker", "cup"],
"distractor_objs": ["plate", "spoon"]
}}
Intermediate Think:
counter {SpatialRelationEnum.FLOOR} kitchen
{RobotItemEnum.FRANKA} {SpatialRelationEnum.IN} kitchen
marker {SpatialRelationEnum.ON} counter
cup {SpatialRelationEnum.ON} counter
plate {SpatialRelationEnum.ON} counter
spoon {SpatialRelationEnum.ON} counter
Final Output:
```json
{{
"kitchen": [
["counter", "{SpatialRelationEnum.FLOOR}"],
["{RobotItemEnum.FRANKA}", "{SpatialRelationEnum.IN}"]
],
"counter": [
["marker", "{SpatialRelationEnum.ON}"],
["cup", "{SpatialRelationEnum.ON}"],
["plate", "{SpatialRelationEnum.ON}"],
["spoon", "{SpatialRelationEnum.ON}"]
]
}}
```
"""
LAYOUT_DESCRIBER_PROMPT = """
You are a 3D asset style descriptor.
Given a task description and a dictionary where the key is the object content and
the value is the object type, output a JSON dictionary with each object paired
with a concise, styled visual description suitable for 3D asset generation.
Generation Guidelines:
- For each object, brainstorm multiple style candidates before selecting the final
description. Vary phrasing, material, texture, color, and spatial details.
- Each description must be a maximum of 15 words, including color, style, materials.
- Descriptions should be visually grounded, specific, and reflect surface texture and structure.
- For objects marked as "context", explicitly mention the object is standalone, has an empty top.
- Use rich style descriptors: e.g., "scratched brown wooden desk" etc.
- Ensure all object styles align with the task's overall context and environment.
Format your output in JSON like the example below.
Example Input:
"Pick up the rope on the chair and put it in the box. {'living room': 'background', 'chair': 'context',
'rope': 'manipulated_objs', 'box': 'manipulated_objs', 'magazine': 'distractor_objs'}"
Example Output:
```json
{
"living room": "modern cozy living room with soft sunlight and light grey carpet",
"chair": "standalone dark oak chair with no surroundings and clean empty seat",
"rope": "twisted hemp rope with rough fibers and dusty beige texture",
"box": "slightly crumpled cardboard box with open flaps and brown textured surface",
"magazine": "celebrity magazine with glossy red cover and large bold title"
}
```
"""
class LayoutDesigner(object):
def __init__(
self,
gpt_client: GPTclient,
system_prompt: str,
verbose: bool = False,
) -> None:
self.prompt = system_prompt.strip()
self.verbose = verbose
self.gpt_client = gpt_client
def query(self, prompt: str, params: dict = None) -> str:
full_prompt = self.prompt + f"\n\nInput:\n\"{prompt}\""
response = self.gpt_client.query(
text_prompt=full_prompt,
params=params,
)
if self.verbose:
logger.info(f"Response: {response}")
return response
def format_response(self, response: str) -> dict:
cleaned = re.sub(r"^```json\s*|\s*```$", "", response.strip())
try:
output = json.loads(cleaned)
except json.JSONDecodeError as e:
raise json.JSONDecodeError(
f"Error: {e}, failed to parse JSON response: {response}"
)
return output
def format_response_repair(self, response: str) -> dict:
return json_repair.loads(response)
def save_output(self, output: dict, save_path: str) -> None:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
with open(save_path, 'w') as f:
json.dump(output, f, indent=4)
def __call__(
self, prompt: str, save_path: str = None, params: dict = None
) -> dict | str:
response = self.query(prompt, params=params)
output = self.format_response_repair(response)
self.save_output(output, save_path) if save_path else None
return output
LAYOUT_DISASSEMBLER = LayoutDesigner(
gpt_client=GPT_CLIENT, system_prompt=LAYOUT_DISASSEMBLE_PROMPT
)
LAYOUT_GRAPHER = LayoutDesigner(
gpt_client=GPT_CLIENT, system_prompt=LAYOUT_HIERARCHY_PROMPT
)
LAYOUT_DESCRIBER = LayoutDesigner(
gpt_client=GPT_CLIENT, system_prompt=LAYOUT_DESCRIBER_PROMPT
)
def build_scene_layout(
task_desc: str, output_path: str = None, gpt_params: dict = None
) -> LayoutInfo:
layout_relation = LAYOUT_DISASSEMBLER(task_desc, params=gpt_params)
layout_tree = LAYOUT_GRAPHER(layout_relation, params=gpt_params)
object_mapping = Scene3DItemEnum.object_mapping(layout_relation)
obj_prompt = f'{layout_relation["task_desc"]} {object_mapping}'
objs_desc = LAYOUT_DESCRIBER(obj_prompt, params=gpt_params)
layout_info = LayoutInfo(
layout_tree, layout_relation, objs_desc, object_mapping
)
if output_path is not None:
visualizer = SceneTreeVisualizer(layout_info)
visualizer.render(save_path=output_path)
logger.info(f"Scene hierarchy tree saved to {output_path}")
return layout_info
def parse_args():
parser = argparse.ArgumentParser(description="3D Scene Layout Designer")
parser.add_argument(
"--task_desc",
type=str,
default="Put the apples on the table on the plate",
help="Natural language description of the robotic task",
)
parser.add_argument(
"--save_root",
type=str,
default="outputs/layout_tree",
help="Path to save the layout output",
)
return parser.parse_args()
if __name__ == "__main__":
from embodied_gen.utils.enum import LayoutInfo
from embodied_gen.utils.process_media import SceneTreeVisualizer
args = parse_args()
params = {
"temperature": 1.0,
"top_p": 0.95,
"frequency_penalty": 0.3,
"presence_penalty": 0.5,
}
layout_relation = LAYOUT_DISASSEMBLER(args.task_desc, params=params)
layout_tree = LAYOUT_GRAPHER(layout_relation, params=params)
object_mapping = Scene3DItemEnum.object_mapping(layout_relation)
obj_prompt = f'{layout_relation["task_desc"]} {object_mapping}'
objs_desc = LAYOUT_DESCRIBER(obj_prompt, params=params)
layout_info = LayoutInfo(layout_tree, layout_relation, objs_desc)
visualizer = SceneTreeVisualizer(layout_info)
os.makedirs(args.save_root, exist_ok=True)
scene_graph_path = f"{args.save_root}/scene_tree.jpg"
visualizer.render(save_path=scene_graph_path)
with open(f"{args.save_root}/layout.json", "w") as f:
json.dump(layout_info.to_dict(), f, indent=4)
print(f"Scene hierarchy tree saved to {scene_graph_path}")
print(f"Disassembled Layout: {layout_relation}")
print(f"Layout Graph: {layout_tree}")
print(f"Layout Descriptions: {objs_desc}")

View File

@ -0,0 +1,73 @@
# Project EmbodiedGen
#
# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
import json
import os
from dataclasses import dataclass
import tyro
from embodied_gen.scripts.simulate_sapien import entrypoint as sim_cli
from embodied_gen.utils.enum import LayoutInfo
from embodied_gen.utils.geometry import bfs_placement, compose_mesh_scene
from embodied_gen.utils.log import logger
@dataclass
class LayoutPlacementConfig:
layout_path: str
output_dir: str | None = None
seed: int | None = None
max_attempts: int = 1000
output_iscene: bool = False
insert_robot: bool = False
def entrypoint(**kwargs):
if kwargs is None or len(kwargs) == 0:
args = tyro.cli(LayoutPlacementConfig)
else:
args = LayoutPlacementConfig(**kwargs)
output_dir = (
args.output_dir
if args.output_dir is not None
else os.path.dirname(args.layout_path)
)
os.makedirs(output_dir, exist_ok=True)
out_scene_path = f"{output_dir}/Iscene.glb"
out_layout_path = f"{output_dir}/layout.json"
with open(args.layout_path, "r") as f:
layout_info = LayoutInfo.from_dict(json.load(f))
layout_info = bfs_placement(layout_info, seed=args.seed)
with open(out_layout_path, "w") as f:
json.dump(layout_info.to_dict(), f, indent=4)
if args.output_iscene:
compose_mesh_scene(layout_info, out_scene_path)
sim_cli(
layout_path=out_layout_path,
output_dir=output_dir,
robot_name="franka" if args.insert_robot else None,
)
logger.info(f"Layout placement completed in {output_dir}")
if __name__ == "__main__":
entrypoint()

View File

@ -0,0 +1,156 @@
# Project EmbodiedGen
#
# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
import gc
import json
import os
from dataclasses import dataclass, field
from shutil import copytree
from time import time
from typing import Optional
import torch
import tyro
from embodied_gen.models.layout import build_scene_layout
from embodied_gen.scripts.simulate_sapien import entrypoint as sim_cli
from embodied_gen.scripts.textto3d import text_to_3d
from embodied_gen.utils.config import GptParamsConfig
from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
from embodied_gen.utils.geometry import bfs_placement, compose_mesh_scene
from embodied_gen.utils.gpt_clients import GPT_CLIENT
from embodied_gen.utils.log import logger
from embodied_gen.utils.process_media import (
load_scene_dict,
parse_text_prompts,
)
from embodied_gen.validators.quality_checkers import SemanticMatcher
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
@dataclass
class LayoutGenConfig:
task_descs: list[str]
output_root: str
bg_list: str = "outputs/bg_scenes/scene_list.txt"
n_img_sample: int = 3
text_guidance_scale: float = 7.0
img_denoise_step: int = 25
n_image_retry: int = 4
n_asset_retry: int = 3
n_pipe_retry: int = 2
seed_img: Optional[int] = None
seed_3d: Optional[int] = None
seed_layout: Optional[int] = None
keep_intermediate: bool = False
output_iscene: bool = False
insert_robot: bool = False
gpt_params: GptParamsConfig = field(
default_factory=lambda: GptParamsConfig(
temperature=1.0,
top_p=0.95,
frequency_penalty=0.3,
presence_penalty=0.5,
)
)
def entrypoint() -> None:
args = tyro.cli(LayoutGenConfig)
SCENE_MATCHER = SemanticMatcher(GPT_CLIENT)
task_descs = parse_text_prompts(args.task_descs)
scene_dict = load_scene_dict(args.bg_list)
gpt_params = args.gpt_params.to_dict()
for idx, task_desc in enumerate(task_descs):
logger.info(f"Generate Layout and 3D scene for task: {task_desc}")
output_root = f"{args.output_root}/task_{idx:04d}"
scene_graph_path = f"{output_root}/scene_tree.jpg"
start_time = time()
layout_info: LayoutInfo = build_scene_layout(
task_desc, scene_graph_path, gpt_params
)
prompts_mapping = {v: k for k, v in layout_info.objs_desc.items()}
prompts = [
v
for k, v in layout_info.objs_desc.items()
if layout_info.objs_mapping[k] != Scene3DItemEnum.BACKGROUND.value
]
for prompt in prompts:
node = prompts_mapping[prompt]
generation_log = text_to_3d(
prompts=[
prompt,
],
output_root=output_root,
asset_names=[
node,
],
n_img_sample=args.n_img_sample,
text_guidance_scale=args.text_guidance_scale,
img_denoise_step=args.img_denoise_step,
n_image_retry=args.n_image_retry,
n_asset_retry=args.n_asset_retry,
n_pipe_retry=args.n_pipe_retry,
seed_img=args.seed_img,
seed_3d=args.seed_3d,
keep_intermediate=args.keep_intermediate,
)
layout_info.assets.update(generation_log["assets"])
layout_info.quality.update(generation_log["quality"])
# Background GEN (for efficiency, temp use retrieval instead)
bg_node = layout_info.relation[Scene3DItemEnum.BACKGROUND.value]
text = layout_info.objs_desc[bg_node]
match_key = SCENE_MATCHER.query(text, str(scene_dict))
match_scene_path = f"{os.path.dirname(args.bg_list)}/{match_key}"
bg_save_dir = os.path.join(output_root, "background")
copytree(match_scene_path, bg_save_dir, dirs_exist_ok=True)
layout_info.assets[bg_node] = bg_save_dir
# BFS layout placement.
layout_info = bfs_placement(
layout_info,
limit_reach_range=True if args.insert_robot else False,
seed=args.seed_layout,
)
layout_path = f"{output_root}/layout.json"
with open(layout_path, "w") as f:
json.dump(layout_info.to_dict(), f, indent=4)
if args.output_iscene:
compose_mesh_scene(layout_info, f"{output_root}/Iscene.glb")
sim_cli(
layout_path=layout_path,
output_dir=output_root,
robot_name="franka" if args.insert_robot else None,
)
torch.cuda.empty_cache()
gc.collect()
elapsed_time = (time() - start_time) / 60
logger.info(
f"Layout generation done for {scene_graph_path}, layout result "
f"in {layout_path}, finished in {elapsed_time:.2f} mins."
)
logger.info(f"All tasks completed in {args.output_root}")
if __name__ == "__main__":
entrypoint()

View File

@ -151,6 +151,9 @@ def entrypoint(**kwargs):
seg_image.save(seg_path)
seed = args.seed
asset_node = "unknown"
if isinstance(args.asset_type, list) and args.asset_type[idx]:
asset_node = args.asset_type[idx]
for try_idx in range(args.n_retry):
logger.info(
f"Try: {try_idx + 1}/{args.n_retry}, Seed: {seed}, Prompt: {seg_path}"
@ -207,7 +210,9 @@ def entrypoint(**kwargs):
color_path = os.path.join(output_root, "color.png")
render_gs_api(aligned_gs_path, color_path)
geo_flag, geo_result = GEO_CHECKER([color_path])
geo_flag, geo_result = GEO_CHECKER(
[color_path], text=asset_node
)
logger.warning(
f"{GEO_CHECKER.__class__.__name__}: {geo_result} for {seg_path}"
)

View File

@ -0,0 +1,186 @@
# Project EmbodiedGen
#
# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
import json
import os
from collections import defaultdict
from dataclasses import dataclass, field
from typing import Literal
import imageio
import numpy as np
import torch
import tyro
from tqdm import tqdm
from embodied_gen.models.gs_model import GaussianOperator
from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
from embodied_gen.utils.geometry import quaternion_multiply
from embodied_gen.utils.log import logger
from embodied_gen.utils.process_media import alpha_blend_rgba
from embodied_gen.utils.simulation import (
SIM_COORD_ALIGN,
FrankaPandaGrasper,
SapienSceneManager,
)
@dataclass
class SapienSimConfig:
# Simulation settings.
layout_path: str
output_dir: str
sim_freq: int = 200
sim_step: int = 400
z_offset: float = 0.004
init_quat: list[float] = field(
default_factory=lambda: [0.7071, 0, 0, 0.7071]
) # xyzw
device: str = "cuda"
robot_name: str | None = None
control_freq: int = 40
# Camera settings.
render_interval: int = 10
num_cameras: int = 3
camera_radius: float = 0.9
camera_height: float = 1.1
image_hw: tuple[int, int] = (512, 512)
ray_tracing: bool = True
fovy_deg: float = 75.0
camera_target_pt: list[float] = field(
default_factory=lambda: [0.0, 0.0, 0.9]
)
render_keys: list[
Literal[
"Color", "Foreground", "Segmentation", "Normal", "Mask", "Depth"
]
] = field(default_factory=lambda: ["Foreground"])
def entrypoint(**kwargs):
if kwargs is None or len(kwargs) == 0:
cfg = tyro.cli(SapienSimConfig)
else:
cfg = SapienSimConfig(**kwargs)
scene_manager = SapienSceneManager(
cfg.sim_freq, ray_tracing=cfg.ray_tracing
)
_ = scene_manager.initialize_circular_cameras(
num_cameras=cfg.num_cameras,
radius=cfg.camera_radius,
height=cfg.camera_height,
target_pt=cfg.camera_target_pt,
image_hw=cfg.image_hw,
fovy_deg=cfg.fovy_deg,
)
with open(cfg.layout_path, "r") as f:
layout_data = json.load(f)
layout_data: LayoutInfo = LayoutInfo.from_dict(layout_data)
scene_manager.load_assets_from_layout_file(
layout_data,
cfg.z_offset,
cfg.init_quat,
cfg.robot_name,
cfg.control_freq,
)
frames = defaultdict(list)
image_cnt = 0
for step in tqdm(range(cfg.sim_step), desc="Simulation"):
scene_manager.scene.step()
if cfg.robot_name is not None:
scene_manager.robot.reset(scene_manager.robot.init_qpos)
if step % cfg.render_interval != 0:
continue
scene_manager.scene.update_render()
image_cnt += 1
for camera in scene_manager.cameras:
camera.take_picture()
images = scene_manager.render_images(
camera, render_keys=cfg.render_keys
)
frames[camera.name].append(images)
grasp_frames = defaultdict(dict)
if cfg.robot_name is not None:
sim_steps_per_control = cfg.sim_freq // cfg.control_freq
control_timestep = 1.0 / cfg.control_freq
grasper = FrankaPandaGrasper(
scene_manager.robot,
scene_manager,
sim_steps_per_control,
control_timestep,
)
for node in layout_data.relation[
Scene3DItemEnum.MANIPULATED_OBJS.value
]:
grasp_frames[node] = grasper.render_grasp(
scene_manager.actors[node],
scene_manager.cameras,
cfg.render_keys,
)
if "Foreground" not in cfg.render_keys:
return
bg_node = layout_data.relation[Scene3DItemEnum.BACKGROUND.value]
gs_path = f"{layout_data.assets[bg_node]}/gs_model.ply"
gs_model: GaussianOperator = GaussianOperator.load_from_ply(gs_path)
x, y, z, qx, qy, qz, qw = layout_data.position[bg_node]
qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], cfg.init_quat)
init_pose = torch.tensor([x, y, z, qx, qy, qz, qw])
gs_model = gs_model.get_gaussians(instance_pose=init_pose)
bg_images = dict()
for camera in scene_manager.cameras:
Ks = camera.get_intrinsic_matrix()
c2w = camera.get_model_matrix()
c2w = c2w @ SIM_COORD_ALIGN
result = gs_model.render(
torch.tensor(c2w, dtype=torch.float32).to(cfg.device),
torch.tensor(Ks, dtype=torch.float32).to(cfg.device),
image_width=cfg.image_hw[1],
image_height=cfg.image_hw[0],
)
bg_images[camera.name] = result.rgb[..., ::-1]
video_frames = []
for camera in scene_manager.cameras:
# Scene rendering
for step in tqdm(range(image_cnt), desc="Rendering"):
rgba = alpha_blend_rgba(
frames[camera.name][step]["Foreground"],
bg_images[camera.name],
)
video_frames.append(np.array(rgba))
# Grasp rendering
for node in grasp_frames:
for frame in grasp_frames[node][camera.name]:
rgba = alpha_blend_rgba(
frame["Foreground"], bg_images[camera.name]
)
video_frames.append(np.array(rgba))
os.makedirs(cfg.output_dir, exist_ok=True)
video_path = f"{cfg.output_dir}/Iscene.mp4"
imageio.mimsave(video_path, video_frames, fps=30)
logger.info(f"Interative 3D Scene Visualization saved in {video_path}")
if __name__ == "__main__":
entrypoint()

View File

@ -81,6 +81,7 @@ done
# Step 1: Text-to-Image
echo ${prompt_args}
eval python3 embodied_gen/scripts/text2image.py \
--prompts ${prompt_args} \
--output_root "${output_root}/images" \

View File

@ -617,7 +617,7 @@ class Runner:
for rgb, depth in images_cache:
depth_normalized = torch.clip(
(depth - depth_global_min)
/ (depth_global_max - depth_global_min),
/ (depth_global_max - depth_global_min + 1e-8),
0,
1,
)

View File

@ -30,7 +30,7 @@ from kornia.morphology import dilation
from PIL import Image
from embodied_gen.models.sr_model import ImageRealESRGAN
from embodied_gen.utils.config import Pano2MeshSRConfig
from embodied_gen.utils.gaussian import compute_pinhole_intrinsics
from embodied_gen.utils.geometry import compute_pinhole_intrinsics
from embodied_gen.utils.log import logger
from thirdparty.pano2room.modules.geo_predictors import PanoJointPredictor
from thirdparty.pano2room.modules.geo_predictors.PanoFusionDistancePredictor import (

View File

@ -17,15 +17,27 @@
from dataclasses import dataclass, field
from typing import List, Optional, Union
from dataclasses_json import DataClassJsonMixin
from gsplat.strategy import DefaultStrategy, MCMCStrategy
from typing_extensions import Literal, assert_never
__all__ = [
"GptParamsConfig",
"Pano2MeshSRConfig",
"GsplatTrainConfig",
]
@dataclass
class GptParamsConfig(DataClassJsonMixin):
temperature: float = 0.1
top_p: float = 0.1
frequency_penalty: float = 0.0
presence_penalty: float = 0.0
stop: int | None = None
max_tokens: int = 500
@dataclass
class Pano2MeshSRConfig:
mesh_file: str = "mesh_model.ply"

View File

@ -102,6 +102,7 @@ class LayoutInfo(DataClassJsonMixin):
tree: dict[str, list]
relation: dict[str, str | list[str]]
objs_desc: dict[str, str] = field(default_factory=dict)
objs_mapping: dict[str, str] = field(default_factory=dict)
assets: dict[str, str] = field(default_factory=dict)
quality: dict[str, str] = field(default_factory=dict)
position: dict[str, list[float]] = field(default_factory=dict)

View File

@ -35,7 +35,6 @@ __all__ = [
"set_random_seed",
"export_splats",
"create_splats_with_optimizers",
"compute_pinhole_intrinsics",
"resize_pinhole_intrinsics",
"restore_scene_scale_and_position",
]
@ -265,12 +264,12 @@ def create_splats_with_optimizers(
return splats, optimizers
def compute_pinhole_intrinsics(
image_w: int, image_h: int, fov_deg: float
def compute_intrinsics_from_fovy(
image_w: int, image_h: int, fovy_deg: float
) -> np.ndarray:
fov_rad = np.deg2rad(fov_deg)
fx = image_w / (2 * np.tan(fov_rad / 2))
fy = fx # assuming square pixels
fovy_rad = np.deg2rad(fovy_deg)
fy = image_h / (2 * np.tan(fovy_rad / 2))
fx = fy * (image_w / image_h)
cx = image_w / 2
cy = image_h / 2
K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])

View File

@ -0,0 +1,458 @@
# Project EmbodiedGen
#
# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
import os
import random
from collections import defaultdict, deque
from functools import wraps
from typing import Literal
import numpy as np
import torch
import trimesh
from matplotlib.path import Path
from pyquaternion import Quaternion
from scipy.spatial import ConvexHull
from scipy.spatial.transform import Rotation as R
from shapely.geometry import Polygon
from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
from embodied_gen.utils.log import logger
__all__ = [
"bfs_placement",
"with_seed",
"matrix_to_pose",
"pose_to_matrix",
"quaternion_multiply",
"check_reachable",
"bfs_placement",
"compose_mesh_scene",
"compute_pinhole_intrinsics",
]
def matrix_to_pose(matrix: np.ndarray) -> list[float]:
"""Convert a 4x4 transformation matrix to a pose (x, y, z, qx, qy, qz, qw).
Args:
matrix (np.ndarray): 4x4 transformation matrix.
Returns:
List[float]: Pose as [x, y, z, qx, qy, qz, qw].
"""
x, y, z = matrix[:3, 3]
rot_mat = matrix[:3, :3]
quat = R.from_matrix(rot_mat).as_quat()
qx, qy, qz, qw = quat
return [x, y, z, qx, qy, qz, qw]
def pose_to_matrix(pose: list[float]) -> np.ndarray:
"""Convert pose (x, y, z, qx, qy, qz, qw) to a 4x4 transformation matrix.
Args:
List[float]: Pose as [x, y, z, qx, qy, qz, qw].
Returns:
matrix (np.ndarray): 4x4 transformation matrix.
"""
x, y, z, qx, qy, qz, qw = pose
r = R.from_quat([qx, qy, qz, qw])
matrix = np.eye(4)
matrix[:3, :3] = r.as_matrix()
matrix[:3, 3] = [x, y, z]
return matrix
def compute_xy_bbox(
vertices: np.ndarray, col_x: int = 0, col_y: int = 2
) -> list[float]:
x_vals = vertices[:, col_x]
y_vals = vertices[:, col_y]
return x_vals.min(), x_vals.max(), y_vals.min(), y_vals.max()
def has_iou_conflict(
new_box: list[float],
placed_boxes: list[list[float]],
iou_threshold: float = 0.0,
) -> bool:
new_min_x, new_max_x, new_min_y, new_max_y = new_box
for min_x, max_x, min_y, max_y in placed_boxes:
ix1 = max(new_min_x, min_x)
iy1 = max(new_min_y, min_y)
ix2 = min(new_max_x, max_x)
iy2 = min(new_max_y, max_y)
inter_area = max(0, ix2 - ix1) * max(0, iy2 - iy1)
if inter_area > iou_threshold:
return True
return False
def with_seed(seed_attr_name: str = "seed"):
"""A parameterized decorator that temporarily sets the random seed."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
seed = kwargs.get(seed_attr_name, None)
if seed is not None:
py_state = random.getstate()
np_state = np.random.get_state()
torch_state = torch.get_rng_state()
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
try:
result = func(*args, **kwargs)
finally:
random.setstate(py_state)
np.random.set_state(np_state)
torch.set_rng_state(torch_state)
return result
else:
return func(*args, **kwargs)
return wrapper
return decorator
def compute_convex_hull_path(
vertices: np.ndarray,
z_threshold: float = 0.05,
interp_per_edge: int = 3,
margin: float = -0.02,
) -> Path:
top_vertices = vertices[
vertices[:, 1] > vertices[:, 1].max() - z_threshold
]
top_xy = top_vertices[:, [0, 2]]
if len(top_xy) < 3:
raise ValueError("Not enough points to form a convex hull")
hull = ConvexHull(top_xy)
hull_points = top_xy[hull.vertices]
polygon = Polygon(hull_points)
polygon = polygon.buffer(margin)
hull_points = np.array(polygon.exterior.coords)
dense_points = []
for i in range(len(hull_points)):
p1 = hull_points[i]
p2 = hull_points[(i + 1) % len(hull_points)]
for t in np.linspace(0, 1, interp_per_edge, endpoint=False):
pt = (1 - t) * p1 + t * p2
dense_points.append(pt)
return Path(np.array(dense_points), closed=True)
def find_parent_node(node: str, tree: dict) -> str | None:
for parent, children in tree.items():
if any(child[0] == node for child in children):
return parent
return None
def all_corners_inside(hull: Path, box: list, threshold: int = 3) -> bool:
x1, x2, y1, y2 = box
corners = [[x1, y1], [x2, y1], [x1, y2], [x2, y2]]
num_inside = sum(hull.contains_point(c) for c in corners)
return num_inside >= threshold
def compute_axis_rotation_quat(
axis: Literal["x", "y", "z"], angle_rad: float
) -> list[float]:
if axis.lower() == 'x':
q = Quaternion(axis=[1, 0, 0], angle=angle_rad)
elif axis.lower() == 'y':
q = Quaternion(axis=[0, 1, 0], angle=angle_rad)
elif axis.lower() == 'z':
q = Quaternion(axis=[0, 0, 1], angle=angle_rad)
else:
raise ValueError(f"Unsupported axis '{axis}', must be one of x, y, z")
return [q.x, q.y, q.z, q.w]
def quaternion_multiply(
init_quat: list[float], rotate_quat: list[float]
) -> list[float]:
qx, qy, qz, qw = init_quat
q1 = Quaternion(w=qw, x=qx, y=qy, z=qz)
qx, qy, qz, qw = rotate_quat
q2 = Quaternion(w=qw, x=qx, y=qy, z=qz)
quat = q2 * q1
return [quat.x, quat.y, quat.z, quat.w]
def check_reachable(
base_xyz: np.ndarray,
reach_xyz: np.ndarray,
min_reach: float = 0.25,
max_reach: float = 0.85,
) -> bool:
"""Check if the target point is within the reachable range."""
distance = np.linalg.norm(reach_xyz - base_xyz)
return min_reach < distance < max_reach
@with_seed("seed")
def bfs_placement(
layout_info: LayoutInfo,
floor_margin: float = 0,
beside_margin: float = 0.1,
max_attempts: int = 3000,
rotate_objs: bool = True,
rotate_bg: bool = True,
limit_reach_range: bool = True,
robot_dim: float = 0.12,
seed: int = None,
) -> LayoutInfo:
object_mapping = layout_info.objs_mapping
position = {} # node: [x, y, z, qx, qy, qz, qw]
parent_bbox_xy = {}
placed_boxes_map = defaultdict(list)
mesh_info = defaultdict(dict)
robot_node = layout_info.relation[Scene3DItemEnum.ROBOT.value]
for node in object_mapping:
if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
bg_quat = (
compute_axis_rotation_quat(
axis="y",
angle_rad=np.random.uniform(0, 2 * np.pi),
)
if rotate_bg
else [0, 0, 0, 1]
)
bg_quat = [round(q, 4) for q in bg_quat]
continue
mesh_path = (
f"{layout_info.assets[node]}/mesh/{node.replace(' ', '_')}.obj"
)
mesh_info[node]["path"] = mesh_path
mesh = trimesh.load(mesh_path)
vertices = mesh.vertices
z1 = np.percentile(vertices[:, 1], 1)
z2 = np.percentile(vertices[:, 1], 99)
if object_mapping[node] == Scene3DItemEnum.CONTEXT.value:
object_quat = [0, 0, 0, 1]
mesh_info[node]["surface"] = compute_convex_hull_path(vertices)
# Put robot in the CONTEXT edge.
x, y = random.choice(mesh_info[node]["surface"].vertices)
theta = np.arctan2(y, x)
quat_initial = Quaternion(axis=[0, 0, 1], angle=theta)
quat_extra = Quaternion(axis=[0, 0, 1], angle=np.pi)
quat = quat_extra * quat_initial
_pose = [x, y, z2 - z1, quat.x, quat.y, quat.z, quat.w]
position[robot_node] = [round(v, 4) for v in _pose]
node_box = [
x - robot_dim / 2,
x + robot_dim / 2,
y - robot_dim / 2,
y + robot_dim / 2,
]
placed_boxes_map[node].append(node_box)
elif rotate_objs:
# For manipulated and distractor objects, apply random rotation
angle_rad = np.random.uniform(0, 2 * np.pi)
object_quat = compute_axis_rotation_quat(
axis="y", angle_rad=angle_rad
)
object_quat_scipy = np.roll(object_quat, 1) # [w, x, y, z]
rotation = R.from_quat(object_quat_scipy).as_matrix()
vertices = np.dot(mesh.vertices, rotation.T)
z1 = np.percentile(vertices[:, 1], 1)
z2 = np.percentile(vertices[:, 1], 99)
x1, x2, y1, y2 = compute_xy_bbox(vertices)
mesh_info[node]["pose"] = [x1, x2, y1, y2, z1, z2, *object_quat]
mesh_info[node]["area"] = max(1e-5, (x2 - x1) * (y2 - y1))
root = list(layout_info.tree.keys())[0]
queue = deque([((root, None), layout_info.tree.get(root, []))])
while queue:
(node, relation), children = queue.popleft()
if node not in object_mapping:
continue
if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
position[node] = [0, 0, floor_margin, *bg_quat]
else:
x1, x2, y1, y2, z1, z2, qx, qy, qz, qw = mesh_info[node]["pose"]
if object_mapping[node] == Scene3DItemEnum.CONTEXT.value:
position[node] = [0, 0, -round(z1, 4), qx, qy, qz, qw]
parent_bbox_xy[node] = [x1, x2, y1, y2, z1, z2]
elif object_mapping[node] in [
Scene3DItemEnum.MANIPULATED_OBJS.value,
Scene3DItemEnum.DISTRACTOR_OBJS.value,
]:
parent_node = find_parent_node(node, layout_info.tree)
parent_pos = position[parent_node]
(
p_x1,
p_x2,
p_y1,
p_y2,
p_z1,
p_z2,
) = parent_bbox_xy[parent_node]
obj_dx = x2 - x1
obj_dy = y2 - y1
hull_path = mesh_info[parent_node].get("surface")
for _ in range(max_attempts):
node_x1 = random.uniform(p_x1, p_x2 - obj_dx)
node_y1 = random.uniform(p_y1, p_y2 - obj_dy)
node_box = [
node_x1,
node_x1 + obj_dx,
node_y1,
node_y1 + obj_dy,
]
if hull_path and not all_corners_inside(
hull_path, node_box
):
continue
# Make sure the manipulated object is reachable by robot.
if (
limit_reach_range
and object_mapping[node]
== Scene3DItemEnum.MANIPULATED_OBJS.value
):
cx = parent_pos[0] + node_box[0] + obj_dx / 2
cy = parent_pos[1] + node_box[2] + obj_dy / 2
cz = parent_pos[2] + p_z2 - z1
robot_pose = position[robot_node][:3]
if not check_reachable(
base_xyz=np.array(robot_pose),
reach_xyz=np.array([cx, cy, cz]),
):
continue
if not has_iou_conflict(
node_box, placed_boxes_map[parent_node]
):
z_offset = 0
break
else:
logger.warning(
f"Cannot place {node} on {parent_node} without overlap"
f" after {max_attempts} attempts, place beside {parent_node}."
)
for _ in range(max_attempts):
node_x1 = random.choice(
[
random.uniform(
p_x1 - obj_dx - beside_margin,
p_x1 - obj_dx,
),
random.uniform(p_x2, p_x2 + beside_margin),
]
)
node_y1 = random.choice(
[
random.uniform(
p_y1 - obj_dy - beside_margin,
p_y1 - obj_dy,
),
random.uniform(p_y2, p_y2 + beside_margin),
]
)
node_box = [
node_x1,
node_x1 + obj_dx,
node_y1,
node_y1 + obj_dy,
]
z_offset = -(parent_pos[2] + p_z2)
if not has_iou_conflict(
node_box, placed_boxes_map[parent_node]
):
break
placed_boxes_map[parent_node].append(node_box)
abs_cx = parent_pos[0] + node_box[0] + obj_dx / 2
abs_cy = parent_pos[1] + node_box[2] + obj_dy / 2
abs_cz = parent_pos[2] + p_z2 - z1 + z_offset
position[node] = [
round(v, 4)
for v in [abs_cx, abs_cy, abs_cz, qx, qy, qz, qw]
]
parent_bbox_xy[node] = [x1, x2, y1, y2, z1, z2]
sorted_children = sorted(
children, key=lambda x: -mesh_info[x[0]].get("area", 0)
)
for child, rel in sorted_children:
queue.append(((child, rel), layout_info.tree.get(child, [])))
layout_info.position = position
return layout_info
def compose_mesh_scene(
layout_info: LayoutInfo, out_scene_path: str, with_bg: bool = False
) -> None:
object_mapping = Scene3DItemEnum.object_mapping(layout_info.relation)
scene = trimesh.Scene()
for node in layout_info.assets:
if object_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
mesh_path = f"{layout_info.assets[node]}/mesh_model.ply"
if not with_bg:
continue
else:
mesh_path = (
f"{layout_info.assets[node]}/mesh/{node.replace(' ', '_')}.obj"
)
mesh = trimesh.load(mesh_path)
offset = np.array(layout_info.position[node])[[0, 2, 1]]
mesh.vertices += offset
scene.add_geometry(mesh, node_name=node)
os.makedirs(os.path.dirname(out_scene_path), exist_ok=True)
scene.export(out_scene_path)
logger.info(f"Composed interactive 3D layout saved in {out_scene_path}")
return
def compute_pinhole_intrinsics(
image_w: int, image_h: int, fov_deg: float
) -> np.ndarray:
fov_rad = np.deg2rad(fov_deg)
fx = image_w / (2 * np.tan(fov_rad / 2))
fy = fx # assuming square pixels
cx = image_w / 2
cy = image_h / 2
K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
return K

View File

@ -166,7 +166,7 @@ def combine_images_to_grid(
images: list[str | Image.Image],
cat_row_col: tuple[int, int] = None,
target_wh: tuple[int, int] = (512, 512),
) -> list[str | Image.Image]:
) -> list[Image.Image]:
n_images = len(images)
if n_images == 1:
return images
@ -377,6 +377,42 @@ def parse_text_prompts(prompts: list[str]) -> list[str]:
return prompts
def alpha_blend_rgba(
fg_image: Union[str, Image.Image, np.ndarray],
bg_image: Union[str, Image.Image, np.ndarray],
) -> Image.Image:
"""Alpha blends a foreground RGBA image over a background RGBA image.
Args:
fg_image: Foreground image. Can be a file path (str), a PIL Image,
or a NumPy ndarray.
bg_image: Background image. Can be a file path (str), a PIL Image,
or a NumPy ndarray.
Returns:
A PIL Image representing the alpha-blended result in RGBA mode.
"""
if isinstance(fg_image, str):
fg_image = Image.open(fg_image)
elif isinstance(fg_image, np.ndarray):
fg_image = Image.fromarray(fg_image)
if isinstance(bg_image, str):
bg_image = Image.open(bg_image)
elif isinstance(bg_image, np.ndarray):
bg_image = Image.fromarray(bg_image)
if fg_image.size != bg_image.size:
raise ValueError(
f"Image sizes not match {fg_image.size} v.s. {bg_image.size}."
)
fg = fg_image.convert("RGBA")
bg = bg_image.convert("RGBA")
return Image.alpha_composite(bg, fg)
def check_object_edge_truncated(
mask: np.ndarray, edge_threshold: int = 5
) -> bool:
@ -400,8 +436,15 @@ def check_object_edge_truncated(
if __name__ == "__main__":
merge_video_video(
"outputs/imageto3d/room_bottle7/room_bottle_007/URDF_room_bottle_007/mesh_glo_normal.mp4", # noqa
"outputs/imageto3d/room_bottle7/room_bottle_007/URDF_room_bottle_007/mesh.mp4", # noqa
"merge.mp4",
)
image_paths = [
"outputs/layouts_sim/task_0000/images/pen.png",
"outputs/layouts_sim/task_0000/images/notebook.png",
"outputs/layouts_sim/task_0000/images/mug.png",
"outputs/layouts_sim/task_0000/images/lamp.png",
"outputs/layouts_sim2/task_0014/images/cloth.png", # TODO
]
for image_path in image_paths:
image = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)
mask = image[..., -1]
flag = check_object_edge_truncated(mask)
print(flag, image_path)

View File

@ -0,0 +1,706 @@
# Project EmbodiedGen
#
# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
import logging
import os
import xml.etree.ElementTree as ET
from collections import defaultdict
from typing import Literal
import mplib
import numpy as np
import sapien.core as sapien
import torch
from PIL import Image, ImageColor
from scipy.spatial.transform import Rotation as R
from tqdm import tqdm
from embodied_gen.data.utils import DiffrastRender
from embodied_gen.utils.enum import LayoutInfo, Scene3DItemEnum
from embodied_gen.utils.geometry import quaternion_multiply
logger = logging.getLogger(__name__)
from mani_skill.examples.motionplanning.panda.utils import (
compute_grasp_info_by_obb,
)
from mani_skill.utils.geometry.trimesh_utils import get_component_mesh
COLORMAP = list(set(ImageColor.colormap.values()))
COLOR_PALETTE = np.array(
[ImageColor.getrgb(c) for c in COLORMAP], dtype=np.uint8
)
SIM_COORD_ALIGN = np.array(
[
[1.0, 0.0, 0.0, 0.0],
[0.0, -1.0, 0.0, 0.0],
[0.0, 0.0, -1.0, 0.0],
[0.0, 0.0, 0.0, 1.0],
]
) # Used to align SAPIEN, MuJoCo coordinate system with the world coordinate system
__all__ = [
"SIM_COORD_ALIGN",
"FrankaPandaGrasper",
"SimpleGrasper",
]
class SapienSceneManager:
"""A class to manage SAPIEN simulator."""
def __init__(
self, sim_freq: int, ray_tracing: bool, device: str = "cuda"
) -> None:
self.sim_freq = sim_freq
self.ray_tracing = ray_tracing
self.device = device
self.renderer = sapien.SapienRenderer()
self.scene = self._setup_scene()
self.cameras: list[sapien.render.RenderCameraComponent] = []
self.actors: dict[str, sapien.pysapien.Entity] = {}
def _setup_scene(self) -> sapien.Scene:
"""Set up the SAPIEN scene with lighting and ground."""
# Ray tracing settings
if self.ray_tracing:
sapien.render.set_camera_shader_dir("rt")
sapien.render.set_ray_tracing_samples_per_pixel(64)
sapien.render.set_ray_tracing_path_depth(10)
sapien.render.set_ray_tracing_denoiser("oidn")
# TODO: Support GPU.
# import sapien.physx as physx
# physx.enable_gpu()
# scene = sapien.Scene(systems=[sapien.physx.PhysxGpuSystem(), sapien.render.RenderSystem()])
scene = sapien.Scene()
scene.set_timestep(1 / self.sim_freq)
# Add lighting
scene.set_ambient_light([0.2, 0.2, 0.2])
scene.add_directional_light(
direction=[0, 1, -1],
color=[1.5, 1.45, 1.4],
shadow=True,
shadow_map_size=2048,
)
scene.add_directional_light(
direction=[0, -0.5, 1], color=[0.8, 0.8, 0.85], shadow=False
)
scene.add_directional_light(
direction=[0, -1, 1], color=[1.0, 1.0, 1.0], shadow=False
)
ground_material = self.renderer.create_material()
ground_material.base_color = [0.5, 0.5, 0.5, 1] # rgba, gray
ground_material.roughness = 0.7
ground_material.metallic = 0.0
scene.add_ground(0, render_material=ground_material)
return scene
def step_action(
self,
action: torch.Tensor,
sim_steps_per_control: int,
cameras: list[sapien.render.RenderCameraComponent],
render_keys: list[str],
) -> dict:
self.robot.set_action(action)
frames = defaultdict(list)
for _ in range(sim_steps_per_control):
self.scene.step()
self.scene.update_render()
for camera in cameras:
camera.take_picture()
images = self.render_images(camera, render_keys=render_keys)
frames[camera.name].append(images)
return frames
def load_exists_robot(
self,
robot_name: str,
pose: sapien.Pose,
control_freq: int = 20,
robot_init_qpos_noise: float = 0.02,
control_mode: str = "pd_joint_pos",
backend_str: tuple[str, str] = ("cpu", "gpu"),
):
from mani_skill.agents import REGISTERED_AGENTS
from mani_skill.envs.scene import ManiSkillScene
from mani_skill.envs.utils.system.backend import (
parse_sim_and_render_backend,
)
if robot_name == "franka":
robot_name = "panda" # remapping
if robot_name not in REGISTERED_AGENTS:
raise KeyError(
f"Robot `{robot_name}` not registered, chosen from {REGISTERED_AGENTS.keys()}."
)
ROBOT_CLS = REGISTERED_AGENTS[robot_name].agent_cls
# backend = parse_sim_and_render_backend("physx_cuda", "gpu")
backend = parse_sim_and_render_backend(*backend_str)
scene = ManiSkillScene(
[self.scene], device=backend_str[0], backend=backend
)
robot = ROBOT_CLS(
scene=scene,
control_freq=control_freq,
control_mode=control_mode,
initial_pose=pose,
)
# Set robot init joint rad agree(joint0 to joint6 w 2 finger).
qpos = np.array(
[
0.0,
np.pi / 8,
0,
-np.pi * 3 / 8,
0,
np.pi * 3 / 4,
np.pi / 4,
0.04,
0.04,
]
)
qpos = (
np.random.normal(
0, robot_init_qpos_noise, (len(scene.sub_scenes), len(qpos))
)
+ qpos
)
qpos[:, -2:] = 0.04
robot.reset(qpos)
robot.init_qpos = robot.robot.qpos
robot.controller.controllers["gripper"].reset()
return robot
def load_actor_from_urdf(
self,
file_path: str,
pose: sapien.Pose,
use_static: bool = False,
update_mass: bool = False,
) -> sapien.pysapien.Entity:
"""Load an actor from a URDF file into the scene.
Args:
file_path (str): Path to the URDF file.
pose (sapien.Pose): Pose of the actor.
use_static (bool): Whether to create a static actor.
update_mass (bool): Whether to update the actor's mass from the URDF.
Returns:
sapien.pysapien.Entity: The loaded actor.
"""
tree = ET.parse(file_path)
root = tree.getroot()
node_name = root.get("name")
file_dir = os.path.dirname(file_path)
visual_file = root.find('.//visual/geometry/mesh').get("filename")
collision_file = root.find('.//collision/geometry/mesh').get(
"filename"
)
visual_file = os.path.join(file_dir, visual_file)
collision_file = os.path.join(file_dir, collision_file)
static_fric = root.find('.//collision/gazebo/mu1').text
dynamic_fric = root.find('.//collision/gazebo/mu2').text
material = self.scene.create_physical_material(
static_friction=np.clip(float(static_fric), 0.1, 0.7),
dynamic_friction=np.clip(float(dynamic_fric), 0.1, 0.6),
restitution=0.05,
)
builder = self.scene.create_actor_builder()
body_type = "static" if use_static else "dynamic"
builder.set_physx_body_type(body_type)
builder.add_multiple_convex_collisions_from_file(
collision_file, material=material
)
builder.add_visual_from_file(visual_file)
actor = builder.build(name=node_name)
actor.set_name(node_name)
actor.set_pose(pose)
if update_mass and hasattr(actor.components[1], "mass"):
node_mass = float(root.find('.//inertial/mass').get("value"))
actor.components[1].set_mass(node_mass)
return actor
def create_camera(
self,
cam_name: str,
pose: sapien.Pose,
image_hw: tuple[int, int],
fovy_deg: float,
) -> sapien.render.RenderCameraComponent:
"""Create a single camera in the scene.
Args:
cam_name (str): Name of the camera.
pose (sapien.Pose): Camera pose p=(x, y, z), q=(w, x, y, z)
image_hw (Tuple[int, int]): Image resolution (height, width) for cameras.
fovy_deg (float): Field of view in degrees for cameras.
Returns:
sapien.render.RenderCameraComponent: The created camera.
"""
cam_actor = self.scene.create_actor_builder().build_kinematic()
cam_actor.set_pose(pose)
camera = self.scene.add_mounted_camera(
name=cam_name,
mount=cam_actor,
pose=sapien.Pose(p=[0, 0, 0], q=[1, 0, 0, 0]),
width=image_hw[1],
height=image_hw[0],
fovy=np.deg2rad(fovy_deg),
near=0.01,
far=100,
)
self.cameras.append(camera)
return camera
def initialize_circular_cameras(
self,
num_cameras: int,
radius: float,
height: float,
target_pt: list[float],
image_hw: tuple[int, int],
fovy_deg: float,
) -> list[sapien.render.RenderCameraComponent]:
"""Initialize multiple cameras arranged in a circle.
Args:
num_cameras (int): Number of cameras to create.
radius (float): Radius of the camera circle.
height (float): Fixed Z-coordinate of the cameras.
target_pt (list[float]): 3D point (x, y, z) that cameras look at.
image_hw (Tuple[int, int]): Image resolution (height, width) for cameras.
fovy_deg (float): Field of view in degrees for cameras.
Returns:
List[sapien.render.RenderCameraComponent]: List of created cameras.
"""
angle_step = 2 * np.pi / num_cameras
world_up_vec = np.array([0.0, 0.0, 1.0])
target_pt = np.array(target_pt)
for i in range(num_cameras):
angle = i * angle_step
cam_x = radius * np.cos(angle)
cam_y = radius * np.sin(angle)
cam_z = height
eye_pos = [cam_x, cam_y, cam_z]
forward_vec = target_pt - eye_pos
forward_vec = forward_vec / np.linalg.norm(forward_vec)
temp_right_vec = np.cross(forward_vec, world_up_vec)
if np.linalg.norm(temp_right_vec) < 1e-6:
temp_right_vec = np.array([1.0, 0.0, 0.0])
if np.abs(np.dot(temp_right_vec, forward_vec)) > 0.99:
temp_right_vec = np.array([0.0, 1.0, 0.0])
right_vec = temp_right_vec / np.linalg.norm(temp_right_vec)
up_vec = np.cross(right_vec, forward_vec)
rotation_matrix = np.array([forward_vec, -right_vec, up_vec]).T
rot = R.from_matrix(rotation_matrix)
scipy_quat = rot.as_quat() # (x, y, z, w)
quat = [
scipy_quat[3],
scipy_quat[0],
scipy_quat[1],
scipy_quat[2],
] # (w, x, y, z)
self.create_camera(
f"camera_{i}",
sapien.Pose(p=eye_pos, q=quat),
image_hw,
fovy_deg,
)
return self.cameras
def render_images(
self,
camera: sapien.render.RenderCameraComponent,
render_keys: list[
Literal[
"Color",
"Segmentation",
"Normal",
"Mask",
"Depth",
"Foreground",
]
] = None,
) -> dict[str, Image.Image]:
"""Render images from a given camera.
Args:
camera (sapien.render.RenderCameraComponent): The camera to render from.
render_keys (List[str]): Types of images to render (e.g., Color, Segmentation).
Returns:
Dict[str, Image.Image]: Dictionary of rendered images.
"""
if render_keys is None:
render_keys = [
"Color",
"Segmentation",
"Normal",
"Mask",
"Depth",
"Foreground",
]
results: dict[str, Image.Image] = {}
if "Color" in render_keys:
color = camera.get_picture("Color")
color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
results["Color"] = Image.fromarray(color_rgb)
if "Mask" in render_keys:
alpha = (np.clip(color[..., 3], 0, 1) * 255).astype(np.uint8)
results["Mask"] = Image.fromarray(alpha)
if "Segmentation" in render_keys:
seg_labels = camera.get_picture("Segmentation")
label0 = seg_labels[..., 0].astype(np.uint8)
seg_color = self.COLOR_PALETTE[label0]
results["Segmentation"] = Image.fromarray(seg_color)
if "Foreground" in render_keys:
seg_labels = camera.get_picture("Segmentation")
label0 = seg_labels[..., 0]
mask = np.where((label0 > 1), 255, 0).astype(np.uint8)
color = camera.get_picture("Color")
color_rgb = (np.clip(color[..., :3], 0, 1) * 255).astype(np.uint8)
foreground = np.concatenate([color_rgb, mask[..., None]], axis=-1)
results["Foreground"] = Image.fromarray(foreground)
if "Normal" in render_keys:
normal = camera.get_picture("Normal")[..., :3]
normal_img = (((normal + 1) / 2) * 255).astype(np.uint8)
results["Normal"] = Image.fromarray(normal_img)
if "Depth" in render_keys:
position_map = camera.get_picture("Position")
depth = -position_map[..., 2]
alpha = torch.tensor(color[..., 3], dtype=torch.float32)
norm_depth = DiffrastRender.normalize_map_by_mask(
torch.tensor(depth), alpha
)
depth_img = (norm_depth * 255).to(torch.uint8).numpy()
results["Depth"] = Image.fromarray(depth_img)
return results
def load_assets_from_layout_file(
self,
layout: LayoutInfo,
z_offset: float = 0.0,
init_quat: list[float] = [0, 0, 0, 1],
robot_name: str = None,
control_freq: int = 40,
robot_init_qpos_noise: float = 0,
) -> None:
"""Load assets from `EmbodiedGen` layout-gen output and create actors in the scene.
Args:
layout (LayoutInfo): The layout information data.
z_offset (float): Offset to apply to the Z-coordinate of non-context objects.
init_quat (List[float]): Initial quaternion (x, y, z, w) for orientation adjustment.
robot_name (str): Name of the robot to load, None not load robot.
control_freq (int): Control frequency for the robot, default is 40.
robot_init_qpos_noise (float): Noise to add to the robot's initial joint positions, default is 0.
"""
for node in layout.assets:
file_dir = layout.assets[node]
file_name = f"{node.replace(' ', '_')}.urdf"
urdf_file = os.path.join(file_dir, file_name)
if layout.objs_mapping[node] == Scene3DItemEnum.BACKGROUND.value:
continue
position = layout.position[node].copy()
if layout.objs_mapping[node] != Scene3DItemEnum.CONTEXT.value:
position[2] += z_offset
use_static = (
layout.relation.get(Scene3DItemEnum.CONTEXT.value, None)
== node
)
# Combine initial quaternion with object quaternion
x, y, z, qx, qy, qz, qw = position
qx, qy, qz, qw = quaternion_multiply([qx, qy, qz, qw], init_quat)
actor = self.load_actor_from_urdf(
urdf_file,
sapien.Pose(p=[x, y, z], q=[qw, qx, qy, qz]),
use_static=use_static,
update_mass=False,
)
self.actors[node] = actor
if robot_name is not None:
robot_node = layout.relation[Scene3DItemEnum.ROBOT.value]
x, y, z, qx, qy, qz, qw = layout.position[robot_node]
delta_z = 0.002 # Add small offset to avoid collision.
pose = sapien.Pose([x, y, z + delta_z], [qw, qx, qy, qz])
self.robot = self.load_exists_robot(
robot_name, pose, control_freq, robot_init_qpos_noise
)
return
class FrankaPandaGrasper(object):
"""Grasper planner for Franka Panda robot arm.
Args:
robot: See from `mani_skill.agents.robots.panda.panda`.
scene: The `SapienSceneManager` object.
sim_steps_per_control: Number of simulation steps per control step.
control_timestep: Time step for planning.
joint_vel_limits: Velocity limit for all joints (default: 0.9)
joint_acc_limits: Acceleration limit for all joints (default: 0.9)
finger_length: Length of the robot's fingers (default: 0.025)
"""
def __init__(
self,
robot,
scene: SapienSceneManager,
sim_steps_per_control: int,
control_timestep: float,
joint_vel_limits: float = 0.9,
joint_acc_limits: float = 0.9,
finger_length: float = 0.025,
) -> None:
self.agent = robot
self.robot = robot.robot
self.scene = scene
self.sim_steps_per_control = sim_steps_per_control
self.control_timestep = control_timestep
self.joint_vel_limits = joint_vel_limits
self.joint_acc_limits = joint_acc_limits
self.finger_length = finger_length
self.planner = self._setup_planner()
def _setup_planner(self) -> mplib.Planner:
link_names = [link.get_name() for link in self.robot.get_links()]
joint_names = [
joint.get_name() for joint in self.robot.get_active_joints()
]
planner = mplib.Planner(
urdf=self.agent.urdf_path,
srdf=self.agent.urdf_path.replace(".urdf", ".srdf"),
user_link_names=link_names,
user_joint_names=joint_names,
move_group="panda_hand_tcp",
joint_vel_limits=np.ones(7) * self.joint_vel_limits,
joint_acc_limits=np.ones(7) * self.joint_acc_limits,
)
planner.set_base_pose(self.robot.pose.raw_pose[0].tolist())
return planner
def update_state(
self,
result: dict,
gripper_state: Literal[-1, 1],
sim_steps_per_control: int,
cameras: list[sapien.render.RenderCameraComponent],
render_keys: list[str],
) -> dict[str, list[Image.Image]]:
n_step = len(result["position"])
total_frames = defaultdict(list)
for i in tqdm(range(n_step), desc="Grasping"):
qpos = result["position"][min(i, n_step - 1)]
action = np.hstack([qpos, gripper_state])[None, ...]
action = torch.from_numpy(action).float()
frames = self.scene.step_action(
action, sim_steps_per_control, cameras, render_keys
)
for camera in cameras:
total_frames[camera.name].extend(frames[camera.name])
return total_frames
def control_gripper(
self,
sim_steps_per_control: int,
cameras: list[sapien.render.RenderCameraComponent],
render_keys: list[str],
gripper_state: Literal[-1, 1],
n_step: int = 10,
) -> dict[str, list[Image.Image]]:
total_frames = defaultdict(list)
qpos = self.robot.get_qpos()[0, :-2].cpu().numpy()
for _ in range(n_step):
action = np.hstack([qpos, gripper_state])[None, ...]
action = torch.from_numpy(action).float()
frames = self.scene.step_action(
action, sim_steps_per_control, cameras, render_keys
)
for camera in cameras:
total_frames[camera.name].extend(frames[camera.name])
return total_frames
def move_to_pose(
self,
pose: sapien.Pose,
sim_steps_per_control: int,
control_timestep: float,
cameras: list[sapien.render.RenderCameraComponent],
render_keys: list[str],
gripper_state: Literal[-1, 1],
use_point_cloud: bool = False,
n_max_step: int = 100,
) -> dict[str, list[Image.Image]]:
# First try screw motion planning
result = self.planner.plan_screw(
np.concatenate([pose.p, pose.q]),
self.robot.get_qpos().cpu().numpy()[0],
time_step=control_timestep,
use_point_cloud=use_point_cloud,
)
# Fallback to standard planning if screw fails
if result["status"] != "Success":
result = self.planner.plan_qpos_to_pose(
np.concatenate([pose.p, pose.q]),
self.robot.get_qpos().cpu().numpy()[0],
time_step=control_timestep,
use_point_cloud=use_point_cloud,
)
if result["status"] != "Success":
return defaultdict(list)
for key in result:
if key in ["status", "duration"]:
continue
sample_ratio = (len(result[key]) // n_max_step) + 1
result[key] = result[key][::sample_ratio]
frames = self.update_state(
result, gripper_state, sim_steps_per_control, cameras, render_keys
)
return frames
def render_grasp(
self,
actor: sapien.pysapien.Entity,
cameras: list[sapien.render.RenderCameraComponent],
render_keys: list[
Literal[
"Color",
"Foreground",
"Segmentation",
"Normal",
"Mask",
"Depth",
]
],
reach_target_only: bool = True,
) -> dict[str, list[Image.Image]]:
physx_rigid = actor.components[1]
mesh = get_component_mesh(physx_rigid, to_world_frame=True)
obb = mesh.bounding_box_oriented
approaching = np.array([0, 0, -1])
target_closing = (
self.agent.tcp.pose.to_transformation_matrix()[0, :3, 1]
.cpu()
.numpy()
)
grasp_info = compute_grasp_info_by_obb(
obb,
approaching=approaching,
target_closing=target_closing,
depth=self.finger_length,
)
closing, center = grasp_info["closing"], grasp_info["center"]
raw_tcp_pose = self.agent.tcp.pose.sp
grasp_pose = self.agent.build_grasp_pose(approaching, closing, center)
reach_pose = grasp_pose * sapien.Pose(p=[0, 0, -0.05])
grasp_pose = grasp_pose * sapien.Pose(p=[0, 0, 0.01])
total_frames = defaultdict(list)
frames_list = []
reach_frames = self.move_to_pose(
reach_pose,
self.sim_steps_per_control,
self.control_timestep,
cameras,
render_keys,
gripper_state=1,
)
frames_list.append(reach_frames)
if len(reach_frames) == 0:
logger.warning(
f"Failed to reach the grasp pose for node `{actor.name}`, skipping grasping."
)
return total_frames
if not reach_target_only:
grasp_frames = self.move_to_pose(
grasp_pose,
self.sim_steps_per_control,
self.control_timestep,
cameras,
render_keys,
gripper_state=1,
)
frames_list.append(grasp_frames)
close_frames = self.control_gripper(
self.sim_steps_per_control,
cameras,
render_keys,
gripper_state=-1,
)
frames_list.append(close_frames)
back_frames = self.move_to_pose(
raw_tcp_pose,
self.sim_steps_per_control,
self.control_timestep,
cameras,
render_keys,
gripper_state=-1,
)
frames_list.append(back_frames)
for frame_dict in frames_list:
for cam_name, frames in frame_dict.items():
total_frames[cam_name].extend(frames)
self.agent.reset(self.agent.init_qpos)
return total_frames

View File

@ -1 +1 @@
VERSION = "v0.1.2"
VERSION = "v0.1.3"

View File

@ -109,7 +109,7 @@ class MeshGeoChecker(BaseChecker):
if self.prompt is None:
self.prompt = """
You are an expert in evaluating the geometry quality of generated 3D asset.
You will be given rendered views of a generated 3D asset with black background.
You will be given rendered views of a generated 3D asset, type {}, with black background.
Your task is to evaluate the quality of the 3D asset generation,
including geometry, structure, and appearance, based on the rendered views.
Criteria:
@ -130,10 +130,13 @@ class MeshGeoChecker(BaseChecker):
Image shows a chair with simplified back legs and soft edges YES
"""
def query(self, image_paths: list[str | Image.Image]) -> str:
def query(
self, image_paths: list[str | Image.Image], text: str = "unknown"
) -> str:
input_prompt = self.prompt.format(text)
return self.gpt_client.query(
text_prompt=self.prompt,
text_prompt=input_prompt,
image_base64=image_paths,
)

View File

@ -107,36 +107,37 @@ class URDFGenerator(object):
already provided, use it directly), accurately describe this 3D object asset (within 15 words),
Determine the pose of the object in the first image and estimate the true vertical height
(vertical projection) range of the object (in meters), i.e., how tall the object appears from top
to bottom in the front view (first) image. also weight range (unit: kilogram), the average
to bottom in the first image. also weight range (unit: kilogram), the average
static friction coefficient of the object relative to rubber and the average dynamic friction
coefficient of the object relative to rubber. Return response format as shown in Output Example.
coefficient of the object relative to rubber. Return response in format as shown in Output Example.
Output Example:
Category: cup
Description: shiny golden cup with floral design
Height: 0.1-0.15 m
Pose: <short_description_within_10_words>
Height: 0.10-0.15 m
Weight: 0.3-0.6 kg
Static friction coefficient: 0.6
Dynamic friction coefficient: 0.5
IMPORTANT: Estimating Vertical Height from the First (Front View) Image.
IMPORTANT: Estimating Vertical Height from the First (Front View) Image and pose estimation based on all views.
- The "vertical height" refers to the real-world vertical size of the object
as projected in the first image, aligned with the image's vertical axis.
- For flat objects like plates or disks or book, if their face is visible in the front view,
use the diameter as the vertical height. If the edge is visible, use the thickness instead.
- This is not necessarily the full length of the object, but how tall it appears
in the first image vertically, based on its pose and orientation.
- For objects(e.g., spoons, forks, writing instruments etc.) at an angle showing in
the first image, tilted at 45° will appear shorter vertically than when upright.
in the first image vertically, based on its pose and orientation estimation on all views.
- For objects(e.g., spoons, forks, writing instruments etc.) at an angle showing in images,
e.g., tilted at 45° will appear shorter vertically than when upright.
Estimate the vertical projection of their real length based on its pose.
For example:
- A pen standing upright in the first view (aligned with the image's vertical axis)
- A pen standing upright in the first image (aligned with the image's vertical axis)
full body visible in the first image: vertical height 0.14-0.20 m
- A pen lying flat in the front view (showing thickness) vertical height 0.018-0.025 m
- A pen lying flat in the first image (showing thickness or as a dot) vertical height 0.018-0.025 m
- Tilted pen in the first image (e.g., ~45° angle): vertical height 0.07-0.12 m
- Use the rest views(except the first image) to help determine the object's 3D pose and orientation.
- Use the rest views to help determine the object's 3D pose and orientation.
Assume the object is in real-world scale and estimate the approximate vertical height
(in meters) based on how large it appears vertically in the first image.
based on the pose estimation and how large it appears vertically in the first image.
"""
)
@ -163,14 +164,14 @@ class URDFGenerator(object):
description = lines[1].split(": ")[1]
min_height, max_height = map(
lambda x: float(x.strip().replace(",", "").split()[0]),
lines[2].split(": ")[1].split("-"),
lines[3].split(": ")[1].split("-"),
)
min_mass, max_mass = map(
lambda x: float(x.strip().replace(",", "").split()[0]),
lines[3].split(": ")[1].split("-"),
lines[4].split(": ")[1].split("-"),
)
mu1 = float(lines[4].split(": ")[1].replace(",", ""))
mu2 = float(lines[5].split(": ")[1].replace(",", ""))
mu1 = float(lines[5].split(": ")[1].replace(",", ""))
mu2 = float(lines[6].split(": ")[1].replace(",", ""))
return {
"category": category.lower(),

View File

@ -7,7 +7,7 @@ packages = ["embodied_gen"]
[project]
name = "embodied_gen"
version = "v0.1.2"
version = "v0.1.3"
readme = "README.md"
license = "Apache-2.0"
license-files = ["LICENSE", "NOTICE"]
@ -32,6 +32,8 @@ backproject-cli = "embodied_gen.data.backproject_v2:entrypoint"
img3d-cli = "embodied_gen.scripts.imageto3d:entrypoint"
text3d-cli = "embodied_gen.scripts.textto3d:text_to_3d"
scene3d-cli = "embodied_gen.scripts.gen_scene3d:entrypoint"
layout-cli = "embodied_gen.scripts.gen_layout:entrypoint"
sim-cli = "embodied_gen.scripts.simulate_sapien:entrypoint"
[tool.pydocstyle]
match = '(?!test_).*(?!_pb2)\.py'

View File

@ -38,4 +38,5 @@ tyro
pyquaternion
shapely
sapien==3.0.0b1
mani_skill==3.0.0b21
typing_extensions==4.14.1

View File

@ -77,10 +77,18 @@ def panoocc_checker():
def test_geo_checker(geo_checker):
from embodied_gen.utils.process_media import combine_images_to_grid
image_paths = glob(
"outputs/layouts_gens3/task_0000/asset3d/pen/result/renders/image_color/*png"
)
image = combine_images_to_grid(image_paths)[0]
# image = "outputs/utest_cases/geo_checker/color.png"
flag, result = geo_checker(
[
"apps/assets/example_image/sample_02.jpg",
]
image,
],
text="pen",
)
logger.info(f"geo_checker: {flag}, {result}")
assert isinstance(flag, bool)
@ -107,11 +115,40 @@ def test_seg_checker(seg_checker):
def test_semantic_checker(semantic_checker):
test_cases = [
("table", "outputs/utest_cases/semantic_checker/task_0001_table.png"),
("table", "outputs/utest_cases/semantic_checker/task_0002_table.png"),
(
"banana",
"outputs/utest_cases/semantic_checker/task_0002_banana.png",
),
(
"napkin",
"outputs/utest_cases/semantic_checker/task_0003_napkin.png",
),
("table", "outputs/utest_cases/semantic_checker/task_0003_table.png"),
("fork", "outputs/utest_cases/semantic_checker/task_0005_fork.png"),
("spoon", "outputs/utest_cases/semantic_checker/task_0005_spoon.png"),
("table", "outputs/utest_cases/semantic_checker/task_0007_table.png"),
("table", "outputs/utest_cases/semantic_checker/task_0008_table.png"),
("table", "outputs/utest_cases/semantic_checker/task_0009_table.png"),
("table", "outputs/utest_cases/semantic_checker/task_0011_table.png"),
("desk", "outputs/utest_cases/semantic_checker/task_0012_desk.png"),
(
"laptop",
"outputs/utest_cases/semantic_checker/task_0012_laptop.png",
),
("table", "outputs/utest_cases/semantic_checker/task_0014_table.png"),
("desk", "outputs/utest_cases/semantic_checker/task_0016_desk.png"),
("shelf", "outputs/utest_cases/semantic_checker/task_0018_shelf.png"),
("table", "outputs/utest_cases/semantic_checker/task_0000_table.png"),
]
for test_case in test_cases:
flag, result = semantic_checker(
text="can",
image=["apps/assets/example_image/sample_02.jpg"],
text=test_case[0],
image=[test_case[1]],
)
logger.info(f"semantic_checker: {flag}, {result}")
logger.info(f"semantic_checker: {flag}, {result}, {test_case[1]}")
assert isinstance(flag, bool)
assert isinstance(result, str)
@ -139,25 +176,21 @@ def test_textgen_checker(textalign_checker, mesh_path, text_desc):
def test_panoheight_estimator(pano_height_estimator):
image_paths = glob("outputs/bg_v3/test2/*/*.png")
image_paths = glob("outputs/bg_scenes/*/*.png")
for image_path in image_paths:
result = pano_height_estimator(image_path)
logger.info(f"{type(result)}, {result}")
def test_pano_checker(pano_checker):
# image_paths = [
# "outputs/bg_gen2/scene_0000/pano_image.png",
# "outputs/bg_gen2/scene_0001/pano_image.png",
# ]
image_paths = glob("outputs/bg_gen/*/*.png")
image_paths = glob("outputs/bg_scenes/*/*.png")
for image_path in image_paths:
flag, result = pano_checker(image_path)
logger.info(f"{image_path} {flag}, {result}")
def test_panoocc_checker(panoocc_checker):
image_paths = glob("outputs/bg_gen/*/*.png")
image_paths = glob("outputs/bg_scenes/*/*.png")
for image_path in image_paths:
flag, result = panoocc_checker(image_path)
logger.info(f"{image_path} {flag}, {result}")

View File

@ -0,0 +1,51 @@
# Project EmbodiedGen
#
# Copyright (c) 2025 Horizon Robotics. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
from embodied_gen.utils.gpt_clients import GPT_CLIENT
from embodied_gen.validators.urdf_convertor import URDFGenerator
def test_urdf_convertor():
urdf_gen = URDFGenerator(GPT_CLIENT, render_view_num=4)
mesh_paths = [
"outputs/test_urdf/sample_0/mesh/pen.obj",
"outputs/test_urdf/sample_1/mesh/notepad.obj",
"outputs/test_urdf/sample_2/mesh/plate.obj",
"outputs/test_urdf/sample_3/mesh/spoon.obj",
"outputs/test_urdf/sample_4/mesh/notebook.obj",
"outputs/test_urdf/sample_5/mesh/plate.obj",
"outputs/test_urdf/sample_6/mesh/spoon.obj",
"outputs/test_urdf/sample_7/mesh/book.obj",
"outputs/test_urdf/sample_8/mesh/lamp.obj",
"outputs/test_urdf/sample_9/mesh/remote_control.obj",
"outputs/test_urdf/sample_10/mesh/keyboard.obj",
"outputs/test_urdf/sample_11/mesh/mouse.obj",
"outputs/test_urdf/sample_12/mesh/table.obj",
"outputs/test_urdf/sample_13/mesh/marker.obj",
"outputs/test_urdf/pen/result/mesh/pen.obj",
"outputs/test_urdf/notebook/result/mesh/notebook.obj",
"outputs/test_urdf/marker/result/mesh/marker.obj",
"outputs/test_urdf/pen2/result/mesh/pen.obj",
]
for idx, mesh_path in enumerate(mesh_paths):
filename = mesh_path.split("/")[-1].split(".")[0]
urdf_path = urdf_gen(
mesh_path=mesh_path,
output_root=f"outputs/test_urdf2/sample_{idx}",
category=filename,
# min_height=1.0,
# max_height=1.2,
)