4.1 KiB
EmboFlow System Architecture
Architecture Style
EmboFlow V1 is a browser/server platform built as:
- Web frontend
- Modular backend control plane
- Independent worker runtime
- MongoDB as the only database
- Object storage abstraction over cloud object storage or MinIO
- Local scheduler in V1 with future migration path to Kubernetes and Volcano
The architecture should preserve clear service boundaries even if V1 is implemented as a modular monolith plus workers.
High-Level Layers
Frontend Layer
- Asset workspace
- Canvas workspace
- Explore workspace
- Label workspace
- Admin workspace
Control Plane
- Identity and authorization
- Workspace and project management
- Asset and dataset metadata
- Workflow definition management
- Plugin registry and activation
- Run orchestration API
- Artifact indexing
Execution Plane
- Workflow DAG compilation
- Task queue dispatch
- Worker execution
- Executor routing
- Log and artifact collection
Storage Layer
- MongoDB for metadata and run state
- Object storage for files and large outputs
- Temporary local working directories for execution
Core Domain Objects
- User
- Workspace
- Project
- Asset
- Dataset
- DatasetVersion
- WorkflowDefinition
- WorkflowVersion
- WorkflowRun
- RunTask
- Artifact
- AnnotationTask
- Annotation
- Plugin
- StorageConnection
Raw Asset And Canonical Dataset Model
The platform must distinguish between:
- Raw Asset View
- Canonical Dataset View
Raw assets preserve source structure, file paths, metadata layout, and original naming. Canonical datasets provide a normalized semantic layer for workflow nodes and export logic.
Visualization may read raw assets directly. Conversion, orchestration, and export should primarily target canonical semantics.
Workflow Model
Workflow definitions are versioned and contain:
- Visual graph state
- Logical node and edge graph
- Runtime configuration
- Plugin references
Workflow execution produces immutable workflow runs. A run snapshots:
- Workflow version
- Node configuration
- Injected code
- Executor settings
- Input bindings
Runs compile into task DAGs.
Node And Plugin Model
Node Categories
- Source
- Transform
- Inspect
- Annotate
- Export
- Utility
Node Definition Contract
Each node definition includes:
- Metadata
- Input schema
- Output schema
- Config schema
- UI schema
- Executor type
- Runtime limits
- Optional code hook contract
Plugin Types
- Node plugins
- Reader/writer plugins
- Renderer plugins
- Executor plugins
- Integration plugins
Execution Architecture
Executors
- Python executor
- Docker executor
- HTTP executor
V1 should prioritize Python and Docker. HTTP executor is useful for integrating external services.
Schedulers
- Local scheduler in V1
- Kubernetes scheduler later
- Volcano scheduler later
Executors and schedulers are separate abstractions:
- Executor defines how logic runs
- Scheduler defines where and under what scheduling policy it runs
Storage Architecture
MongoDB Collections
Recommended primary collections:
- users
- workspaces
- projects
- memberships
- assets
- asset_probe_reports
- datasets
- dataset_versions
- workflow_definitions
- workflow_definition_versions
- workflow_runs
- run_tasks
- artifacts
- annotation_tasks
- annotations
- plugins
- storage_connections
- audit_logs
Object Storage Content
- Raw uploads
- Imported archives
- Normalized export packages
- Training config packages
- Preview resources
- Logs and attachments
- Large manifests and file indexes
Security Model
User-injected code is low-trust code and must not run in web or API processes.
V1 runtime policy:
- Built-in trusted nodes may use Python executor
- Plugin code should run in controlled runtimes
- User-injected code should default to Docker executor
- Network access should be denied by default for user code
- Input and output paths should be explicitly mounted
Deployment Direction
V1 deployment target is a single public server using containerized application services. The architecture must still preserve future migration to multi-node environments.