EmboFlow/design/02-architecture/system-architecture.md

# EmboFlow System Architecture

## Architecture Style

EmboFlow V1 is a browser/server platform built as:

- Web frontend
- Modular backend control plane
- Independent worker runtime
- MongoDB as the only database
- Object storage abstraction over cloud object storage or MinIO
- Local scheduler in V1 with future migration path to Kubernetes and Volcano

The architecture should preserve clear service boundaries even if V1 is implemented as a modular monolith plus workers.

## High-Level Layers

### Frontend Layer

- Asset workspace
- Canvas workspace
- Explore workspace
- Label workspace
- Admin workspace

### Control Plane

- Identity and authorization
- Workspace and project management
- Asset and dataset metadata
- Workflow definition management
- Plugin registry and activation
- Run orchestration API
- Artifact indexing

### Execution Plane

- Workflow DAG compilation
- Task queue dispatch
- Worker execution
- Executor routing
- Log and artifact collection

### Storage Layer

- MongoDB for metadata and run state
- Object storage for files and large outputs
- Temporary local working directories for execution

## Core Domain Objects

- User
- Workspace
- Project
- Asset
- Dataset
- DatasetVersion
- WorkflowDefinition
- WorkflowVersion
- WorkflowRun
- RunTask
- Artifact
- AnnotationTask
- Annotation
- Plugin
- StorageConnection

## Raw Asset And Canonical Dataset Model

The platform must distinguish between:

- Raw Asset View
- Canonical Dataset View

Raw assets preserve source structure, file paths, metadata layout, and original naming. Canonical datasets provide a normalized semantic layer for workflow nodes and export logic.

Visualization may read raw assets directly. Conversion, orchestration, and export should primarily target canonical semantics.

## Workflow Model

Workflow definitions are versioned and contain:

- Visual graph state
- Logical node and edge graph
- Runtime configuration
- Plugin references

Workflow execution produces immutable workflow runs. A run snapshots:

- Workflow version
- Bound asset references
- Node configuration
- Injected code
- Executor settings
- Input bindings

Runs compile into task DAGs.

## Node And Plugin Model

### Node Categories

- Source
- Transform
- Inspect
- Annotate
- Export
- Utility

### Node Definition Contract

Each node definition includes:

- Metadata
- Input schema
- Output schema
- Config schema
- UI schema
- Executor type
- Runtime limits
- Optional code hook contract

### Plugin Types

- Node plugins
- Reader/writer plugins
- Renderer plugins
- Executor plugins
- Integration plugins

## Execution Architecture

### Executors

- Python executor
- Docker executor
- HTTP executor

V1 should prioritize Python and Docker. HTTP executor is useful for integrating external services.

### Schedulers

- Local scheduler in V1
- Kubernetes scheduler later
- Volcano scheduler later

Executors and schedulers are separate abstractions:

- Executor defines how logic runs
- Scheduler defines where and under what scheduling policy it runs

## Storage Architecture

### MongoDB Collections

Recommended primary collections:

- users
- workspaces
- projects
- memberships
- assets
- asset_probe_reports
- datasets
- dataset_versions
- workflow_definitions
- workflow_definition_versions
- workflow_runs
- run_tasks
- artifacts
- annotation_tasks
- annotations
- plugins
- storage_connections
- audit_logs

### Object Storage Content

- Raw uploads
- Imported archives
- Normalized export packages
- Training config packages
- Preview resources
- Logs and attachments
- Large manifests and file indexes

## Security Model

User-injected code is low-trust code and must not run in web or API processes.

V1 runtime policy:

- Built-in trusted nodes may use Python executor
- Plugin code should run in controlled runtimes
- User-injected code should default to Docker executor
- Network access should be denied by default for user code
- Input and output paths should be explicitly mounted

## Deployment Direction

V1 deployment target is a single public server using containerized application services. The architecture must still preserve future migration to multi-node environments.

## Current Runtime Implementation Notes

The current repository runtime now includes:

- a real HTTP API process backed by MongoDB
- a React and Vite web application that reads those APIs
- a local-path asset registration flow for development and dataset inspection
- a worker process that polls Mongo-backed `run_tasks`, creates task artifacts, and refreshes run status

The repository still keeps some in-memory module tests for contract stability, but the executable local stack now runs through Mongo-backed runtime services and adds HTTP integration coverage against a real Mongo runtime.