EmboFlow/design/03-workflows/workflow-execution-model.md
2026-03-27 11:09:02 +08:00

415 lines
12 KiB
Markdown

# EmboFlow Workflow Execution Model
## Goal
Define how EmboFlow represents, validates, executes, and observes canvas workflows.
The workflow system is the product core. The canvas is only the editing surface. The real system of record is the versioned workflow definition and its immutable run snapshots.
## Core Objects
- `WorkflowDefinition`
Logical workflow identity under a project
- `WorkflowVersion`
Immutable snapshot of nodes, edges, runtime defaults, and plugin references
- `NodeInstance`
Concrete node on a workflow graph
- `WorkflowRun`
One execution of one workflow version
- `RunTask`
Executable unit derived from a node during one run
- `Artifact`
Managed output from a task or run
## Workflow Layers
Each workflow version contains three layers.
### Visual Layer
Used only by the editor:
- node positions
- collapsed state
- groups
- zoom defaults
- viewport metadata
### Logic Layer
Used for graph semantics:
- nodes
- edges
- input/output ports
- branch conditions
- merge semantics
- dependency graph
### Runtime Layer
Used for execution:
- node config values
- executor settings
- runtime resource limits
- retry policy
- code hooks
- cache policy
Visual changes must not change workflow semantics. Runtime changes must produce a new workflow version.
The current V1 editor implementation keeps a mutable local draft that is initialized from the latest saved workflow version. Saving the draft creates a new immutable workflow version. Triggering a run from a dirty draft first saves a fresh workflow version, then creates the run from that saved snapshot. The V1 editor also requires binding at least one project asset before run creation, and the selected asset ids are persisted with the run snapshot.
The current local runtime also persists per-node runtime config under `runtimeGraph.nodeConfigs`. That config includes executor overrides, executor-specific config payloads, optional artifact metadata, and Python code-hook source for supported node categories. When a run is created, the API freezes those node configs into `workflow_runs.runtimeSnapshot` and copies the effective executor choice plus code-hook snapshot onto each `run_task`.
## Node Categories
V1 node categories:
- `Source`
- `Transform`
- `Inspect`
- `Annotate`
- `Export`
- `Utility`
### V1 Built-In Node Families
- asset upload/import
- archive extract
- folder rename
- directory validation
- metadata validation
- video quality inspection
- dataset readers for RLDS, LeRobot, HDF5, Rosbag
- canonical mapping nodes
- dataset writers and exporters
- training config export
- Python processing node
## Node Definition Contract
Each node definition must expose:
- `id`
- `name`
- `category`
- `version`
- `description`
- `inputSchema`
- `outputSchema`
- `configSchema`
- `uiSchema`
- `executorType`
- `runtimeDefaults`
- `permissions`
- `capabilities`
- `codeHookSpec`
### Code Hook Spec
V1 supports user code hooks only on:
- `Transform`
- `Inspect`
- `Utility`
Hooks must use a constrained entrypoint instead of arbitrary script structure.
Example:
```python
def process(input_data, context):
return input_data
```
This keeps serialization, logging, and runtime control predictable.
The current V1 worker executes trusted-local Python hooks when a `run_task` carries a `codeHookSpec`. The hook is executed through a constrained Python harness with the task snapshot and execution context passed in as JSON. Hook stdout is captured into `stdoutLines`, hook failures populate `stderrLines`, and the returned object becomes the task artifact payload.
The current V1 Docker executor now has two modes:
- compatibility mode when no image is configured on the node runtime config
- real container mode when `executorConfig.image` is set
In real container mode the worker:
- creates a temp working directory
- writes `input.json` containing the frozen task snapshot and execution context
- mounts that directory into the container
- sets `EMBOFLOW_INPUT_PATH` and `EMBOFLOW_OUTPUT_PATH`
- captures container stdout and stderr from the Docker CLI process
- parses `output.json` back into the task artifact payload when present
The default Docker runtime policy is `--network none`. This keeps V1 safer for local processing nodes unless a later phase deliberately opens network access for containerized tasks.
## Data Flow Contract
Tasks should exchange managed references, not loose file paths.
V1 reference types:
- `assetRef`
- `datasetVersionRef`
- `artifactRef`
- `annotationTaskRef`
- `inlineConfig`
Executors may materialize files internally, but the platform-level contract must remain reference-based.
## Validation Stages
Workflow execution must validate in this order:
1. workflow version exists
2. referenced plugins exist and are enabled
3. node schemas are valid
4. edge connections are schema-compatible
5. runtime configuration is complete
6. referenced assets and datasets are accessible
7. code hooks pass static validation
8. executor and scheduler requirements are satisfiable
Validation failure must block run creation.
## Run Lifecycle
When a user executes a workflow:
1. resolve workflow version
2. validate and snapshot all runtime-relevant inputs, including bound asset references
3. resolve plugin versions
4. freeze node config and code hooks
5. compile graph into a DAG
6. create `WorkflowRun`
7. create `RunTask` entries
8. enqueue ready tasks
9. collect outputs, logs, and task state
10. finalize run status and summary
## Run State Model
### WorkflowRun Status
- `pending`
- `queued`
- `running`
- `success`
- `failed`
- `cancelled`
- `partial_success`
### RunTask Status
- `pending`
- `queued`
- `running`
- `success`
- `failed`
- `cancelled`
- `skipped`
`partial_success` is used for workflows where non-blocking nodes fail but the run still produces valid outputs.
## Retry And Failure Policy
Each node instance may define:
- retry count
- retry backoff policy
- fail-fast behavior
- continue-on-error behavior
- manual retry eligibility
V1 should support:
- `fail_fast`
- `continue_on_error`
- `retry_n_times`
- `manual_retry`
## Cache Model
V1 should support node-level cache reuse.
Recommended cache key inputs:
- workflow version
- node id
- upstream reference summary
- config summary
- code hook digest
- plugin version
- executor version
Cache hit behavior:
- reuse output artifact refs
- reuse output summaries
- retain previous logs reference
- mark task as cache-resolved in metadata
## Execution Context
Each task receives a normalized execution context containing:
- workspace id
- project id
- workflow run id
- task id
- actor id
- node config
- code hook content
- input references
- storage context
- temp working directory
- runtime resource limits
This context must be available across Python, Docker, and HTTP executors.
## Observability Requirements
Each task must emit:
- status transitions
- start time and finish time
- duration
- executor metadata
- resource request metadata
- stdout/stderr log stream
- structured task summary
- artifact refs
## Current V1 Implementation Notes
The current codebase keeps the low-level contract tests in memory while the executable local runtime persists workflow state to MongoDB.
The persisted local runtime now covers:
- workspace and project bootstrap
- asset registration and probe reporting
- workflow definition and immutable version snapshots
- workflow runs and task creation with worker-consumable dependency snapshots
- workflow run asset bindings persisted on both runs and tasks
- project-scoped run history queries from Mongo-backed `workflow_runs`
- worker polling of queued tasks from Mongo-backed `run_tasks`
- run-task status transitions from `queued/pending` to `running/success/failed`
- downstream task promotion when upstream nodes succeed
- artifact registration and producer lookup
- task-level artifact creation by the worker runtime
The React workflow editor now loads the latest persisted version from the Mongo-backed API instead of rendering only a fixed starter graph. Draft edits are local editor state until the user saves, at which point the draft is serialized into a new workflow version document. Before a run is created, the editor loads project assets, requires one to be selected, and passes that binding to the API.
The editor right panel now exposes the first writable runtime controls instead of read-only node metadata. V1 users can override the executor type per node, configure a simple executor target such as HTTP URL or Docker image, override the produced artifact title, and author Python code-hook source for supported node categories.
The runtime Runs workspace now loads recent runs for the active project. Run detail views poll active runs until they settle and let the operator inspect task-level artifacts directly through Explore links.
The worker-backed runtime now persists task execution summaries directly on `run_tasks` instead of treating artifacts as the only observable output. Each completed or failed task records:
- `startedAt` and `finishedAt`
- `durationMs`
- appended `logLines`
- captured `stdoutLines` and `stderrLines`
- structured `summary` with outcome, executor, asset count, artifact ids, and failure text when present
- `lastResultPreview` for a lightweight selected-task preview in the Runs workspace
This makes the run detail view stable even when artifacts are large or delayed and keeps task-level observability queryable without reopening every artifact payload.
The current runtime also aggregates execution state back onto `workflow_runs`. Each refresh computes:
- run-level `startedAt` and `finishedAt`
- run-level `durationMs`
- `summary.totalTaskCount`
- `summary.completedTaskCount`
- `summary.taskCounts`
- `summary.artifactCount`
- `summary.stdoutLineCount`
- `summary.stderrLineCount`
- `summary.failedTaskIds`
This allows the Runs workspace to render a stable top-level run summary without client-side recomputation across every task document.
The current V1 runtime also implements the first run-control loop:
- `POST /api/runs/:runId/cancel`
Cancels queued and pending tasks for that run and prevents downstream promotion.
- `POST /api/runs/:runId/retry`
Creates a brand-new run from the original run snapshot, keeping workflow version and bound asset ids.
- `POST /api/runs/:runId/tasks/:taskId/retry`
Resets the failed or cancelled task plus its downstream subtree, increments the target task attempt count, and requeues from that node.
V1 cancellation is scheduler-level only. It does not attempt to hard-stop an executor that is already running inside the local worker loop.
The selected-task panel in the current Runs workspace also shows the frozen node definition id, executor config snapshot, and code-hook metadata, so an operator can inspect what exact runtime settings were used without reopening the workflow editor.
The API and worker runtimes now both have direct integration coverage against a real Mongo runtime through `mongodb-memory-server`, in addition to the older in-memory contract tests.
The first web authoring surface already follows the three-pane layout contract with:
- left node library
- center workflow canvas
- right node configuration panel
The first explore surface currently includes built-in renderers for:
- JSON artifacts
- directory artifacts
- video artifacts
The UI must allow:
- graph-level run status
- node-level log inspection
- node-level artifact browsing
- task retry entrypoint
- direct navigation from a node to preview output
## Canvas Interaction Rules
V1 editor behavior should enforce:
- port-level connection rules
- incompatible edge blocking
- dirty-state detection
- explicit save before publish/run if graph changed
- per-node validation badges
- run from latest saved version, not unsaved draft
## Example V1 Pipelines
### Delivery Normalization
```text
Raw Folder Import
-> Archive Extract
-> Folder Rename
-> Directory Validation
-> Metadata Validation
-> Video Quality Check
-> Delivery Export
```
### Dataset Conversion
```text
Rosbag Reader
-> Canonical Mapping
-> Frame Filter
-> Metadata Normalize
-> LeRobot Writer
-> Training Config Export
```
## V1 Non-Goals
The V1 workflow engine does not need:
- loop semantics
- streaming execution
- unbounded dynamic fan-out
- event-driven triggers
- advanced distributed DAG partitioning
The V1 goal is a stable, observable DAG executor for data engineering workflows.