longtao.wu/EmboFlow

Fork 0

eust-w 7d7cd14233

Guardrails / repository-guardrails (push) Has been cancelled

Details

✨ feat: add dataset-aware workflow inputs

2026-03-30 14:18:57 +08:00

17 KiB

Raw Blame History

EmboFlow Workflow Execution Model

Goal

Define how EmboFlow represents, validates, executes, and observes canvas workflows.

The workflow system is the product core. The canvas is only the editing surface. The real system of record is the versioned workflow definition and its immutable run snapshots.

Core Objects

WorkflowDefinition Logical workflow identity under a project
WorkflowVersion Immutable snapshot of nodes, edges, runtime defaults, and plugin references
NodeInstance Concrete node on a workflow graph
WorkflowRun One execution of one workflow version
RunTask Executable unit derived from a node during one run
Artifact Managed output from a task or run

Workflow Layers

Each workflow version contains three layers.

Visual Layer

Used only by the editor:

node positions
collapsed state
groups
zoom defaults
viewport metadata

Logic Layer

Used for graph semantics:

nodes
edges
input/output ports
branch conditions
merge semantics
dependency graph

Runtime Layer

Used for execution:

node config values
executor settings
runtime resource limits
retry policy
code hooks
cache policy

Visual changes must not change workflow semantics. Runtime changes must produce a new workflow version.

The current V1 editor implementation keeps a mutable local draft that is initialized from the latest saved workflow version. Saving the draft creates a new immutable workflow version. Triggering a run from a dirty draft first saves a fresh workflow version, then creates the run from that saved snapshot. The V1 editor also requires binding at least one project asset before run creation, and the selected asset ids are persisted with the run snapshot.

The current local runtime also persists per-node runtime config under runtimeGraph.nodeConfigs. That config includes executor overrides, executor-specific config payloads, optional artifact metadata, and Python code-hook source for supported node categories. When a run is created, the API freezes those node configs into workflow_runs.runtimeSnapshot and copies the effective executor choice plus code-hook snapshot onto each run_task.

The current built-in delivery node library is now Docker-first by default. Unless a workflow author overrides a node runtime config, these built-ins resolve to executorType=docker with a local Python container image and networkMode=none:

source-asset
extract-archive
rename-folder
validate-structure
validate-metadata
union-assets
intersect-assets
difference-assets
export-delivery-package

This keeps most default processing isolated from the API and worker host processes while still letting individual workflows opt back into python or http.

Node Categories

V1 node categories:

Source
Transform
Inspect
Annotate
Export
Utility

V1 Built-In Node Families

asset upload/import
archive extract
folder rename
directory validation
metadata validation
video quality inspection
dataset readers for RLDS, LeRobot, HDF5, Rosbag
canonical mapping nodes
dataset writers and exporters
training config export
Python processing node

The current V1 runtime also supports project-level custom Docker nodes. A custom node is registered separately from the workflow graph, then exposed through the same node-definition surface as built-in nodes.

When the user drops one of these node definitions into the editor, the draft should immediately inherit the node's default runtime snapshot. In practice this means the seeded nodeConfig already carries the declared executor type, executor config, and contract before the user opens the right-side panel.

Node Definition Contract

Each node definition must expose:

id
name
category
version
description
inputSchema
outputSchema
configSchema
uiSchema
executorType
runtimeDefaults
permissions
capabilities
codeHookSpec

Code Hook Spec

V1 supports user code hooks only on:

Transform
Inspect
Utility

Hooks must use a constrained entrypoint instead of arbitrary script structure.

Example:

def process(input_data, context):
    return input_data

This keeps serialization, logging, and runtime control predictable.

Custom Docker Node Contract

Custom containerized nodes must implement the EmboFlow runtime contract instead of inventing their own I/O shape.

Container input:

EMBOFLOW_INPUT_PATH points to a JSON file containing the frozen task snapshot and the execution context
EMBOFLOW_OUTPUT_PATH points to the JSON file the container must write before exit

Expected context shape:

assetIds
assets
upstreamResults
run and node identifiers

Expected output shape:

{
  "result": {
    "...": "..."
  }
}

If the custom node declares an asset_set style output contract, result.assetIds must be a string array. This is what allows downstream nodes to inherit the narrowed asset set.

If the custom node declares contract.inputMode = "multi_asset_set", the canvas should treat that node as multi-input at authoring time instead of forcing the user through single-input validation rules. The graph validator should derive this capability from the seeded runtime contract, not from a hardcoded node id list alone.

The current V1 validation boundary now rejects structurally invalid custom nodes before they enter the project registry. This includes missing names, unsupported source kinds, Dockerfiles without a FROM instruction, and Source category nodes that incorrectly declare multi_asset_set input.

The current V1 worker executes trusted-local Python hooks when a run_task carries a codeHookSpec. The hook is executed through a constrained Python harness with the task snapshot and execution context passed in as JSON. Hook stdout is captured into stdoutLines, hook failures populate stderrLines, and the returned object becomes the task artifact payload.

The current V1 Docker executor now has two modes:

compatibility mode when no image is configured on the node runtime config
real container mode when executorConfig.image is set

In real container mode the worker:

creates a temp working directory
writes input.json containing the frozen task snapshot and execution context
mounts that directory into the container
sets EMBOFLOW_INPUT_PATH and EMBOFLOW_OUTPUT_PATH
captures container stdout and stderr from the Docker CLI process
parses output.json back into the task artifact payload when present

Optional hook metadata must remain optional in this path. The current V1 Docker runner now treats missing or explicit null codeHookSpec values as “no hook configured” instead of attempting to execute them. This keeps built-in Docker nodes and custom nodes on the same task schema without adding fake hook payloads.

The default Docker runtime policy is --network none. This keeps V1 safer for local processing nodes unless a later phase deliberately opens network access for containerized tasks.

The V1 worker now also carries direct upstream task previews into the execution context. This is what makes multi-input set nodes executable instead of purely visual:

union-assets merges all upstream asset ids
intersect-assets keeps only the shared asset ids
difference-assets subtracts later upstream sets from the first upstream set

When one upstream node produces a narrowed asset set, the worker treats that effective asset set as the execution input for the downstream task and writes it back to the successful run_task.

Data Flow Contract

Tasks should exchange managed references, not loose file paths.

V1 reference types:

assetRef
datasetVersionRef
artifactRef
annotationTaskRef
inlineConfig

Executors may materialize files internally, but the platform-level contract must remain reference-based.

Validation Stages

Workflow execution must validate in this order:

workflow version exists
referenced plugins exist and are enabled
node schemas are valid
edge connections are schema-compatible
runtime configuration is complete
referenced assets and datasets are accessible
code hooks pass static validation
executor and scheduler requirements are satisfiable

Validation failure must block run creation.

The current V1 API now exposes this as a real preflight step, not only as an editor convention. POST /api/runs/preflight evaluates the saved workflow version against the selected workflow input bindings and frozen runtime snapshot. POST /api/runs reuses the same checks and rejects run creation when any blocking issue remains.

Run Lifecycle

When a user executes a workflow:

resolve workflow version
validate and snapshot all runtime-relevant inputs, including bound asset and dataset references
resolve plugin versions
freeze node config and code hooks
compile graph into a DAG
create WorkflowRun
create RunTask entries
enqueue ready tasks
collect outputs, logs, and task state
finalize run status and summary

The current preflight checks include:

workflow definition and version linkage
workflow input binding presence
bound asset existence and project match
bound dataset existence and project match
resolution of dataset bindings into runnable asset ids
resolved node definition existence
source and export edge direction rules
multi-input eligibility
executor-specific required config such as Docker image or HTTP URL
non-empty code hook source when a hook is present

Run State Model

WorkflowRun Status

pending
queued
running
success
failed
cancelled
partial_success

RunTask Status

pending
queued
running
success
failed
cancelled
skipped

partial_success is used for workflows where non-blocking nodes fail but the run still produces valid outputs.

Retry And Failure Policy

Each node instance may define:

retry count
retry backoff policy
fail-fast behavior
continue-on-error behavior
manual retry eligibility

V1 should support:

fail_fast
continue_on_error
retry_n_times
manual_retry

Cache Model

V1 should support node-level cache reuse.

Recommended cache key inputs:

workflow version
node id
upstream reference summary
config summary
code hook digest
plugin version
executor version

Cache hit behavior:

reuse output artifact refs
reuse output summaries
retain previous logs reference
mark task as cache-resolved in metadata

Execution Context

Each task receives a normalized execution context containing:

workspace id
project id
workflow run id
task id
actor id
bound asset ids
bound asset metadata summary, including display name, detected formats, top-level paths, and local source path when available
node config
code hook content
input references
storage context
temp working directory
runtime resource limits

This context must be available across Python, Docker, and HTTP executors.

Observability Requirements

Each task must emit:

status transitions
start time and finish time
duration
executor metadata
resource request metadata
stdout/stderr log stream
structured task summary
artifact refs

Current V1 Implementation Notes

The current codebase keeps the low-level contract tests in memory while the executable local runtime persists workflow state to MongoDB.

The persisted local runtime now covers:

workspace and project bootstrap
asset registration and probe reporting
workflow definition and immutable version snapshots
workflow runs and task creation with worker-consumable dependency snapshots
workflow run input bindings persisted on both runs and tasks
resolved asset ids and explicit dataset ids persisted separately on both runs and tasks
project-scoped run history queries from Mongo-backed workflow_runs
worker polling of queued tasks from Mongo-backed run_tasks
run-task status transitions from queued/pending to running/success/failed
downstream task promotion when upstream nodes succeed
artifact registration and producer lookup
task-level artifact creation by the worker runtime

The React workflow editor now loads the latest persisted version from the Mongo-backed API instead of rendering only a fixed starter graph. Draft edits are local editor state until the user saves, at which point the draft is serialized into a new workflow version document. Before a run is created, the editor loads project assets, requires one to be selected, and passes that binding to the API.

The editor right panel now exposes the first writable runtime controls instead of read-only node metadata. V1 users can override the executor type per node, configure a simple executor target such as HTTP URL or Docker image, override the produced artifact title, and author Python code-hook source for supported node categories.

The runtime Runs workspace now loads recent runs for the active project. Run detail views poll active runs until they settle and let the operator inspect task-level artifacts directly through Explore links.

The worker-backed runtime now persists task execution summaries directly on run_tasks instead of treating artifacts as the only observable output. Each completed or failed task records:

startedAt and finishedAt
durationMs
appended logLines
captured stdoutLines and stderrLines
structured summary with outcome, executor, asset count, artifact ids, and failure text when present
lastResultPreview for a lightweight selected-task preview in the Runs workspace

This makes the run detail view stable even when artifacts are large or delayed and keeps task-level observability queryable without reopening every artifact payload.

The current built-in Python path now also has first-pass node semantics for two delivery-focused nodes when no custom code hook is present:

source-asset Emits a normalized summary of the bound assets from Mongo-backed asset metadata, so downstream nodes and operators see concrete display names, detected formats, top-level paths, and local source paths instead of only opaque asset ids.
validate-structure Inspects the bound asset source paths, checks the delivery-required files meta.json, intrinsics.json, and video_meta.json, recursively counts .mp4 files, and emits a stable validation summary with valid, requiredFiles, missingRequiredFiles, and videoFileCount.

This replaces the earlier placeholder "python executor processed ..." behavior for those built-in nodes and makes the default worker output useful even before custom hooks are authored.

The current runtime also aggregates execution state back onto workflow_runs. Each refresh computes:

run-level startedAt and finishedAt
run-level durationMs
summary.totalTaskCount
summary.completedTaskCount
summary.taskCounts
summary.artifactCount
summary.stdoutLineCount
summary.stderrLineCount
summary.failedTaskIds

This allows the Runs workspace to render a stable top-level run summary without client-side recomputation across every task document.

The current V1 runtime also implements the first run-control loop:

POST /api/runs/:runId/cancel Cancels queued and pending tasks for that run and prevents downstream promotion.
POST /api/runs/:runId/retry Creates a brand-new run from the original run snapshot, keeping workflow version and bound asset ids.
POST /api/runs/:runId/tasks/:taskId/retry Resets the failed or cancelled task plus its downstream subtree, increments the target task attempt count, and requeues from that node.

V1 cancellation is scheduler-level only. It does not attempt to hard-stop an executor that is already running inside the local worker loop.

The selected-task panel in the current Runs workspace also shows the frozen node definition id, executor config snapshot, and code-hook metadata, so an operator can inspect what exact runtime settings were used without reopening the workflow editor.

The API and worker runtimes now both have direct integration coverage against a real Mongo runtime through mongodb-memory-server, in addition to the older in-memory contract tests.

The first web authoring surface already follows the three-pane layout contract with:

left node library
center workflow canvas
right node configuration panel

The first explore surface currently includes built-in renderers for:

JSON artifacts
directory artifacts
video artifacts

The UI must allow:

graph-level run status
node-level log inspection
node-level artifact browsing
task retry entrypoint
direct navigation from a node to preview output

Canvas Interaction Rules

V1 editor behavior should enforce:

port-level connection rules
incompatible edge blocking
dirty-state detection
explicit save before publish/run if graph changed
per-node validation badges
run from latest saved version, not unsaved draft

Example V1 Pipelines

Delivery Normalization

Raw Folder Import
  -> Archive Extract
  -> Folder Rename
  -> Directory Validation
  -> Metadata Validation
  -> Video Quality Check
  -> Delivery Export

Dataset Conversion

Rosbag Reader
  -> Canonical Mapping
  -> Frame Filter
  -> Metadata Normalize
  -> LeRobot Writer
  -> Training Config Export

V1 Non-Goals

The V1 workflow engine does not need:

loop semantics
streaming execution
unbounded dynamic fan-out
event-driven triggers
advanced distributed DAG partitioning

The V1 goal is a stable, observable DAG executor for data engineering workflows.

17 KiB Raw Blame History