213 lines
4.7 KiB
Markdown
213 lines
4.7 KiB
Markdown
# EmboFlow System Architecture
|
|
|
|
## Architecture Style
|
|
|
|
EmboFlow V1 is a browser/server platform built as:
|
|
|
|
- Web frontend
|
|
- Modular backend control plane
|
|
- Independent worker runtime
|
|
- MongoDB as the only database
|
|
- Object storage abstraction over cloud object storage or MinIO
|
|
- Local scheduler in V1 with future migration path to Kubernetes and Volcano
|
|
|
|
The architecture should preserve clear service boundaries even if V1 is implemented as a modular monolith plus workers.
|
|
|
|
## High-Level Layers
|
|
|
|
### Frontend Layer
|
|
|
|
- Asset workspace
|
|
- Canvas workspace
|
|
- Explore workspace
|
|
- Label workspace
|
|
- Admin workspace
|
|
|
|
### Control Plane
|
|
|
|
- Identity and authorization
|
|
- Workspace and project management
|
|
- Asset and dataset metadata
|
|
- Workflow definition management
|
|
- Plugin registry and activation
|
|
- Run orchestration API
|
|
- Artifact indexing
|
|
|
|
### Execution Plane
|
|
|
|
- Workflow DAG compilation
|
|
- Task queue dispatch
|
|
- Worker execution
|
|
- Executor routing
|
|
- Log and artifact collection
|
|
|
|
### Storage Layer
|
|
|
|
- MongoDB for metadata and run state
|
|
- Object storage for files and large outputs
|
|
- Temporary local working directories for execution
|
|
|
|
## Core Domain Objects
|
|
|
|
- User
|
|
- Workspace
|
|
- Project
|
|
- Asset
|
|
- Dataset
|
|
- DatasetVersion
|
|
- WorkflowDefinition
|
|
- WorkflowVersion
|
|
- WorkflowRun
|
|
- RunTask
|
|
- Artifact
|
|
- AnnotationTask
|
|
- Annotation
|
|
- Plugin
|
|
- StorageConnection
|
|
|
|
## Raw Asset And Canonical Dataset Model
|
|
|
|
The platform must distinguish between:
|
|
|
|
- Raw Asset View
|
|
- Canonical Dataset View
|
|
|
|
Raw assets preserve source structure, file paths, metadata layout, and original naming. Canonical datasets provide a normalized semantic layer for workflow nodes and export logic.
|
|
|
|
Visualization may read raw assets directly. Conversion, orchestration, and export should primarily target canonical semantics.
|
|
|
|
## Workflow Model
|
|
|
|
Workflow definitions are versioned and contain:
|
|
|
|
- Visual graph state
|
|
- Logical node and edge graph
|
|
- Runtime configuration
|
|
- Plugin references
|
|
|
|
Workflow execution produces immutable workflow runs. A run snapshots:
|
|
|
|
- Workflow version
|
|
- Bound asset references
|
|
- Node configuration
|
|
- Injected code
|
|
- Executor settings
|
|
- Input bindings
|
|
|
|
Runs compile into task DAGs.
|
|
|
|
## Node And Plugin Model
|
|
|
|
### Node Categories
|
|
|
|
- Source
|
|
- Transform
|
|
- Inspect
|
|
- Annotate
|
|
- Export
|
|
- Utility
|
|
|
|
### Node Definition Contract
|
|
|
|
Each node definition includes:
|
|
|
|
- Metadata
|
|
- Input schema
|
|
- Output schema
|
|
- Config schema
|
|
- UI schema
|
|
- Executor type
|
|
- Runtime limits
|
|
- Optional code hook contract
|
|
|
|
### Plugin Types
|
|
|
|
- Node plugins
|
|
- Reader/writer plugins
|
|
- Renderer plugins
|
|
- Executor plugins
|
|
- Integration plugins
|
|
|
|
## Execution Architecture
|
|
|
|
### Executors
|
|
|
|
- Python executor
|
|
- Docker executor
|
|
- HTTP executor
|
|
|
|
V1 should prioritize Python and Docker. HTTP executor is useful for integrating external services.
|
|
|
|
### Schedulers
|
|
|
|
- Local scheduler in V1
|
|
- Kubernetes scheduler later
|
|
- Volcano scheduler later
|
|
|
|
Executors and schedulers are separate abstractions:
|
|
|
|
- Executor defines how logic runs
|
|
- Scheduler defines where and under what scheduling policy it runs
|
|
|
|
## Storage Architecture
|
|
|
|
### MongoDB Collections
|
|
|
|
Recommended primary collections:
|
|
|
|
- users
|
|
- workspaces
|
|
- projects
|
|
- memberships
|
|
- assets
|
|
- asset_probe_reports
|
|
- datasets
|
|
- dataset_versions
|
|
- workflow_definitions
|
|
- workflow_definition_versions
|
|
- workflow_runs
|
|
- run_tasks
|
|
- artifacts
|
|
- annotation_tasks
|
|
- annotations
|
|
- plugins
|
|
- storage_connections
|
|
- audit_logs
|
|
|
|
### Object Storage Content
|
|
|
|
- Raw uploads
|
|
- Imported archives
|
|
- Normalized export packages
|
|
- Training config packages
|
|
- Preview resources
|
|
- Logs and attachments
|
|
- Large manifests and file indexes
|
|
|
|
## Security Model
|
|
|
|
User-injected code is low-trust code and must not run in web or API processes.
|
|
|
|
V1 runtime policy:
|
|
|
|
- Built-in trusted nodes may use Python executor
|
|
- Plugin code should run in controlled runtimes
|
|
- User-injected code should default to Docker executor
|
|
- Network access should be denied by default for user code
|
|
- Input and output paths should be explicitly mounted
|
|
|
|
## Deployment Direction
|
|
|
|
V1 deployment target is a single public server using containerized application services. The architecture must still preserve future migration to multi-node environments.
|
|
|
|
## Current Runtime Implementation Notes
|
|
|
|
The current repository runtime now includes:
|
|
|
|
- a real HTTP API process backed by MongoDB
|
|
- a React and Vite web application that reads those APIs
|
|
- a local-path asset registration flow for development and dataset inspection
|
|
- a worker process that polls Mongo-backed `run_tasks`, creates task artifacts, and refreshes run status
|
|
|
|
The repository still keeps some in-memory module tests for contract stability, but the executable local stack now runs through Mongo-backed runtime services and adds HTTP integration coverage against a real Mongo runtime.
|