EmboFlow/design/03-workflows/workflow-execution-model.md

# EmboFlow Workflow Execution Model

## Goal

Define how EmboFlow represents, validates, executes, and observes canvas workflows.

The workflow system is the product core. The canvas is only the editing surface. The real system of record is the versioned workflow definition and its immutable run snapshots.

## Core Objects

- `WorkflowDefinition`
  Logical workflow identity under a project
- `WorkflowVersion`
  Immutable snapshot of nodes, edges, runtime defaults, and plugin references
- `NodeInstance`
  Concrete node on a workflow graph
- `WorkflowRun`
  One execution of one workflow version
- `RunTask`
  Executable unit derived from a node during one run
- `Artifact`
  Managed output from a task or run

## Workflow Layers

Each workflow version contains three layers.

### Visual Layer

Used only by the editor:

- node positions
- collapsed state
- groups
- zoom defaults
- viewport metadata

### Logic Layer

Used for graph semantics:

- nodes
- edges
- input/output ports
- branch conditions
- merge semantics
- dependency graph

### Runtime Layer

Used for execution:

- node config values
- executor settings
- runtime resource limits
- retry policy
- code hooks
- cache policy

Visual changes must not change workflow semantics. Runtime changes must produce a new workflow version.

## Node Categories

V1 node categories:

- `Source`
- `Transform`
- `Inspect`
- `Annotate`
- `Export`
- `Utility`

### V1 Built-In Node Families

- asset upload/import
- archive extract
- folder rename
- directory validation
- metadata validation
- video quality inspection
- dataset readers for RLDS, LeRobot, HDF5, Rosbag
- canonical mapping nodes
- dataset writers and exporters
- training config export
- Python processing node

## Node Definition Contract

Each node definition must expose:

- `id`
- `name`
- `category`
- `version`
- `description`
- `inputSchema`
- `outputSchema`
- `configSchema`
- `uiSchema`
- `executorType`
- `runtimeDefaults`
- `permissions`
- `capabilities`
- `codeHookSpec`

### Code Hook Spec

V1 supports user code hooks only on:

- `Transform`
- `Inspect`
- `Utility`

Hooks must use a constrained entrypoint instead of arbitrary script structure.

Example:

```python
def process(input_data, context):
    return input_data
```

This keeps serialization, logging, and runtime control predictable.

## Data Flow Contract

Tasks should exchange managed references, not loose file paths.

V1 reference types:

- `assetRef`
- `datasetVersionRef`
- `artifactRef`
- `annotationTaskRef`
- `inlineConfig`

Executors may materialize files internally, but the platform-level contract must remain reference-based.

## Validation Stages

Workflow execution must validate in this order:

1. workflow version exists
2. referenced plugins exist and are enabled
3. node schemas are valid
4. edge connections are schema-compatible
5. runtime configuration is complete
6. referenced assets and datasets are accessible
7. code hooks pass static validation
8. executor and scheduler requirements are satisfiable

Validation failure must block run creation.

## Run Lifecycle

When a user executes a workflow:

1. resolve workflow version
2. snapshot all runtime-relevant inputs
3. resolve plugin versions
4. freeze node config and code hooks
5. compile graph into a DAG
6. create `WorkflowRun`
7. create `RunTask` entries
8. enqueue ready tasks
9. collect outputs, logs, and task state
10. finalize run status and summary

## Run State Model

### WorkflowRun Status

- `pending`
- `queued`
- `running`
- `success`
- `failed`
- `cancelled`
- `partial_success`

### RunTask Status

- `pending`
- `queued`
- `running`
- `success`
- `failed`
- `cancelled`
- `skipped`

`partial_success` is used for workflows where non-blocking nodes fail but the run still produces valid outputs.

## Retry And Failure Policy

Each node instance may define:

- retry count
- retry backoff policy
- fail-fast behavior
- continue-on-error behavior
- manual retry eligibility

V1 should support:

- `fail_fast`
- `continue_on_error`
- `retry_n_times`
- `manual_retry`

## Cache Model

V1 should support node-level cache reuse.

Recommended cache key inputs:

- workflow version
- node id
- upstream reference summary
- config summary
- code hook digest
- plugin version
- executor version

Cache hit behavior:

- reuse output artifact refs
- reuse output summaries
- retain previous logs reference
- mark task as cache-resolved in metadata

## Execution Context

Each task receives a normalized execution context containing:

- workspace id
- project id
- workflow run id
- task id
- actor id
- node config
- code hook content
- input references
- storage context
- temp working directory
- runtime resource limits

This context must be available across Python, Docker, and HTTP executors.

## Observability Requirements

Each task must emit:

- status transitions
- start time and finish time
- duration
- executor metadata
- resource request metadata
- stdout/stderr log stream
- structured task summary
- artifact refs

The UI must allow:

- graph-level run status
- node-level log inspection
- node-level artifact browsing
- task retry entrypoint
- direct navigation from a node to preview output

## Canvas Interaction Rules

V1 editor behavior should enforce:

- port-level connection rules
- incompatible edge blocking
- dirty-state detection
- explicit save before publish/run if graph changed
- per-node validation badges
- run from latest saved version, not unsaved draft

## Example V1 Pipelines

### Delivery Normalization

```text
Raw Folder Import
  -> Archive Extract
  -> Folder Rename
  -> Directory Validation
  -> Metadata Validation
  -> Video Quality Check
  -> Delivery Export
```

### Dataset Conversion

```text
Rosbag Reader
  -> Canonical Mapping
  -> Frame Filter
  -> Metadata Normalize
  -> LeRobot Writer
  -> Training Config Export
```

## V1 Non-Goals

The V1 workflow engine does not need:

- loop semantics
- streaming execution
- unbounded dynamic fan-out
- event-driven triggers
- advanced distributed DAG partitioning

The V1 goal is a stable, observable DAG executor for data engineering workflows.