EmboFlow/design/02-architecture/system-architecture.md

4.5 KiB

EmboFlow System Architecture

Architecture Style

EmboFlow V1 is a browser/server platform built as:

  • Web frontend
  • Modular backend control plane
  • Independent worker runtime
  • MongoDB as the only database
  • Object storage abstraction over cloud object storage or MinIO
  • Local scheduler in V1 with future migration path to Kubernetes and Volcano

The architecture should preserve clear service boundaries even if V1 is implemented as a modular monolith plus workers.

High-Level Layers

Frontend Layer

  • Asset workspace
  • Canvas workspace
  • Explore workspace
  • Label workspace
  • Admin workspace

Control Plane

  • Identity and authorization
  • Workspace and project management
  • Asset and dataset metadata
  • Workflow definition management
  • Plugin registry and activation
  • Run orchestration API
  • Artifact indexing

Execution Plane

  • Workflow DAG compilation
  • Task queue dispatch
  • Worker execution
  • Executor routing
  • Log and artifact collection

Storage Layer

  • MongoDB for metadata and run state
  • Object storage for files and large outputs
  • Temporary local working directories for execution

Core Domain Objects

  • User
  • Workspace
  • Project
  • Asset
  • Dataset
  • DatasetVersion
  • WorkflowDefinition
  • WorkflowVersion
  • WorkflowRun
  • RunTask
  • Artifact
  • AnnotationTask
  • Annotation
  • Plugin
  • StorageConnection

Raw Asset And Canonical Dataset Model

The platform must distinguish between:

  • Raw Asset View
  • Canonical Dataset View

Raw assets preserve source structure, file paths, metadata layout, and original naming. Canonical datasets provide a normalized semantic layer for workflow nodes and export logic.

Visualization may read raw assets directly. Conversion, orchestration, and export should primarily target canonical semantics.

Workflow Model

Workflow definitions are versioned and contain:

  • Visual graph state
  • Logical node and edge graph
  • Runtime configuration
  • Plugin references

Workflow execution produces immutable workflow runs. A run snapshots:

  • Workflow version
  • Node configuration
  • Injected code
  • Executor settings
  • Input bindings

Runs compile into task DAGs.

Node And Plugin Model

Node Categories

  • Source
  • Transform
  • Inspect
  • Annotate
  • Export
  • Utility

Node Definition Contract

Each node definition includes:

  • Metadata
  • Input schema
  • Output schema
  • Config schema
  • UI schema
  • Executor type
  • Runtime limits
  • Optional code hook contract

Plugin Types

  • Node plugins
  • Reader/writer plugins
  • Renderer plugins
  • Executor plugins
  • Integration plugins

Execution Architecture

Executors

  • Python executor
  • Docker executor
  • HTTP executor

V1 should prioritize Python and Docker. HTTP executor is useful for integrating external services.

Schedulers

  • Local scheduler in V1
  • Kubernetes scheduler later
  • Volcano scheduler later

Executors and schedulers are separate abstractions:

  • Executor defines how logic runs
  • Scheduler defines where and under what scheduling policy it runs

Storage Architecture

MongoDB Collections

Recommended primary collections:

  • users
  • workspaces
  • projects
  • memberships
  • assets
  • asset_probe_reports
  • datasets
  • dataset_versions
  • workflow_definitions
  • workflow_definition_versions
  • workflow_runs
  • run_tasks
  • artifacts
  • annotation_tasks
  • annotations
  • plugins
  • storage_connections
  • audit_logs

Object Storage Content

  • Raw uploads
  • Imported archives
  • Normalized export packages
  • Training config packages
  • Preview resources
  • Logs and attachments
  • Large manifests and file indexes

Security Model

User-injected code is low-trust code and must not run in web or API processes.

V1 runtime policy:

  • Built-in trusted nodes may use Python executor
  • Plugin code should run in controlled runtimes
  • User-injected code should default to Docker executor
  • Network access should be denied by default for user code
  • Input and output paths should be explicitly mounted

Deployment Direction

V1 deployment target is a single public server using containerized application services. The architecture must still preserve future migration to multi-node environments.

Current Runtime Implementation Notes

The current repository runtime now includes:

  • a real HTTP API process backed by MongoDB
  • a React and Vite web application that reads those APIs
  • a local-path asset registration flow for development and dataset inspection

The repository still keeps some in-memory module tests for contract stability, but the executable local stack now runs through Mongo-backed runtime services and adds HTTP integration coverage against a real Mongo runtime.