EmboFlow/design/02-architecture/system-architecture.md

201 lines
4.1 KiB
Markdown

# EmboFlow System Architecture
## Architecture Style
EmboFlow V1 is a browser/server platform built as:
- Web frontend
- Modular backend control plane
- Independent worker runtime
- MongoDB as the only database
- Object storage abstraction over cloud object storage or MinIO
- Local scheduler in V1 with future migration path to Kubernetes and Volcano
The architecture should preserve clear service boundaries even if V1 is implemented as a modular monolith plus workers.
## High-Level Layers
### Frontend Layer
- Asset workspace
- Canvas workspace
- Explore workspace
- Label workspace
- Admin workspace
### Control Plane
- Identity and authorization
- Workspace and project management
- Asset and dataset metadata
- Workflow definition management
- Plugin registry and activation
- Run orchestration API
- Artifact indexing
### Execution Plane
- Workflow DAG compilation
- Task queue dispatch
- Worker execution
- Executor routing
- Log and artifact collection
### Storage Layer
- MongoDB for metadata and run state
- Object storage for files and large outputs
- Temporary local working directories for execution
## Core Domain Objects
- User
- Workspace
- Project
- Asset
- Dataset
- DatasetVersion
- WorkflowDefinition
- WorkflowVersion
- WorkflowRun
- RunTask
- Artifact
- AnnotationTask
- Annotation
- Plugin
- StorageConnection
## Raw Asset And Canonical Dataset Model
The platform must distinguish between:
- Raw Asset View
- Canonical Dataset View
Raw assets preserve source structure, file paths, metadata layout, and original naming. Canonical datasets provide a normalized semantic layer for workflow nodes and export logic.
Visualization may read raw assets directly. Conversion, orchestration, and export should primarily target canonical semantics.
## Workflow Model
Workflow definitions are versioned and contain:
- Visual graph state
- Logical node and edge graph
- Runtime configuration
- Plugin references
Workflow execution produces immutable workflow runs. A run snapshots:
- Workflow version
- Node configuration
- Injected code
- Executor settings
- Input bindings
Runs compile into task DAGs.
## Node And Plugin Model
### Node Categories
- Source
- Transform
- Inspect
- Annotate
- Export
- Utility
### Node Definition Contract
Each node definition includes:
- Metadata
- Input schema
- Output schema
- Config schema
- UI schema
- Executor type
- Runtime limits
- Optional code hook contract
### Plugin Types
- Node plugins
- Reader/writer plugins
- Renderer plugins
- Executor plugins
- Integration plugins
## Execution Architecture
### Executors
- Python executor
- Docker executor
- HTTP executor
V1 should prioritize Python and Docker. HTTP executor is useful for integrating external services.
### Schedulers
- Local scheduler in V1
- Kubernetes scheduler later
- Volcano scheduler later
Executors and schedulers are separate abstractions:
- Executor defines how logic runs
- Scheduler defines where and under what scheduling policy it runs
## Storage Architecture
### MongoDB Collections
Recommended primary collections:
- users
- workspaces
- projects
- memberships
- assets
- asset_probe_reports
- datasets
- dataset_versions
- workflow_definitions
- workflow_definition_versions
- workflow_runs
- run_tasks
- artifacts
- annotation_tasks
- annotations
- plugins
- storage_connections
- audit_logs
### Object Storage Content
- Raw uploads
- Imported archives
- Normalized export packages
- Training config packages
- Preview resources
- Logs and attachments
- Large manifests and file indexes
## Security Model
User-injected code is low-trust code and must not run in web or API processes.
V1 runtime policy:
- Built-in trusted nodes may use Python executor
- Plugin code should run in controlled runtimes
- User-injected code should default to Docker executor
- Network access should be denied by default for user code
- Input and output paths should be explicitly mounted
## Deployment Direction
V1 deployment target is a single public server using containerized application services. The architecture must still preserve future migration to multi-node environments.