11 KiB
EmboFlow MongoDB Data Model
Goal
Define the MongoDB-only persistence model for EmboFlow V1.
The database must support:
- user and workspace isolation
- raw asset tracking
- canonical dataset versions
- workflow versioning
- workflow execution history
- plugin registration
- auditability
Storage Principles
- MongoDB stores metadata and execution state
- Object storage stores large binary files and large derived bundles
- MongoDB documents should have clear aggregate boundaries
- Large, fast-growing arrays should be split into separate collections
- Platform contracts should use references, not embedded file blobs
Current V1 Implementation Notes
The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB.
This means the implementation now validates:
- document shapes
- controller and service boundaries
- workflow/run/task separation
- artifact lookup by producer
- asset persistence and probe reports through Mongo-backed collections
while still targeting the collection model below as the persistent shape.
Primary Collections
usersworkspacesprojectsmembershipsassetsasset_probe_reportsdatasetsdataset_versionsworkflow_definitionsworkflow_definition_versionsworkflow_runsrun_tasksartifactsannotation_tasksannotationspluginsstorage_connectionscustom_nodesaudit_logs
Collection Design
users
Purpose:
- account identity
- profile
- login metadata
Core fields:
_idemaildisplayNameavatarUrlstatuslastLoginAtcreatedAtupdatedAt
workspaces
Purpose:
- resource ownership boundary
Core fields:
_idtypeaspersonalorteamnameslugownerIdstatussettingscreatedAtupdatedAt
memberships
Purpose:
- workspace and project role mapping
Core fields:
_idworkspaceIdprojectIdoptionaluserIdrolestatuscreatedAtupdatedAt
This collection should stay independent instead of embedding large member arrays on every resource.
projects
Purpose:
- project-scoped grouping for assets, workflows, runs, and outputs
Core fields:
_idworkspaceIdnameslugdescriptionstatuscreatedBycreatedAtupdatedAt
assets
Purpose:
- represent raw uploaded or imported inputs
Supported asset types:
raw_filearchivefoldervideo_collectionstandard_datasetrosbaghdf5_datasetobject_storage_prefix
Core fields:
_idworkspaceIdprojectIdtypesourceTypedisplayNamestatusstorageRefsizeBytesfileCounttopLevelPathsdetectedFormatssummarycreatedBycreatedAtupdatedAt
Do not embed full large file listings in this document.
asset_probe_reports
Purpose:
- retain richer structure-detection and validation output
Core fields:
_idassetIdreportVersiondetectedFormatCandidatesstructureSummarywarningsrecommendedNextNodesrawReportcreatedAt
datasets
Purpose:
- represent logical dataset identity
Core fields:
_idworkspaceIdprojectIdnametypestatuslatestVersionIdsummarycreatedBycreatedAtupdatedAt
custom_nodes
Purpose:
- store project-scoped custom container node definitions
Core fields:
_iddefinitionIdworkspaceIdprojectIdnameslugdescriptioncategorystatuscontractsourcecreatedBycreatedAtupdatedAt
The current V1 implementation stores the custom node source as either:
- an existing Docker image reference
- a self-contained Dockerfile body plus an image tag
The node contract is persisted with the node definition so the API can expose correct node metadata to the editor and the worker can validate runtime outputs.
dataset_versions
Purpose:
- represent immutable dataset snapshots
Core fields:
_iddatasetIdworkspaceIdprojectIdsourceAssetIdparentVersionIdversionTagcanonicalSchemaVersionmanifestRefstatssummarystatuscreatedBycreatedAt
This collection is separated because versions will grow over time.
workflow_definitions
Purpose:
- represent logical workflow identity
Core fields:
_idworkspaceIdprojectIdnameslugstatuslatestVersionNumberpublishedVersionNumbercreatedBycreatedAtupdatedAt
workflow_definition_versions
Purpose:
- represent immutable workflow snapshots
Core fields:
_idworkflowDefinitionIdworkspaceIdprojectIdversionNumbervisualGraphlogicGraphruntimeGraphpluginRefssummarycreatedBycreatedAt
Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.
workflow_runs
Purpose:
- store execution runs
- snapshot the asset bindings chosen at run creation time
- support project-scoped run history queries without re-reading workflow versions
Core fields:
_idworkflowDefinitionIdworkflowVersionIdassetIdsworkspaceIdprojectIdtriggeredBystatusruntimeSnapshotsummarystartedAtfinishedAtdurationMscreatedAt
run_tasks
Purpose:
- store one execution unit per node per run
- keep bound asset context available to the worker at dequeue time
Core fields:
_idworkflowRunIdworkflowVersionIdnodeIdnodeTypenodeDefinitionIdexecutorTypeexecutorConfigcodeHookSpecartifactTypeartifactTitlestatusattemptassetIdsupstreamNodeIdsoutputArtifactIdslogRefcacheKeycacheHitlogLinesstdoutLinesstderrLineserrorMessagesummarylastResultPreviewstartedAtfinishedAtdurationMscreatedAt
This collection should remain separate from workflow_runs because task volume grows quickly.
The current executable worker path expects run_tasks to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist:
- executor choice
- node definition id and frozen per-node runtime config
- bound asset ids at run creation time, then the effective asset ids that were actually executed after any upstream set-operation narrowing
- upstream node dependencies
- produced artifact ids
- per-task status and error message
- task log lines, stdout/stderr streams, and result preview
- structured task summaries with executor, outcome, asset count, artifact ids, and stdout/stderr counters
The current runtime also aggregates task execution back onto workflow_runs, so run documents now carry:
- a frozen
runtimeSnapshotcopied from the workflow version runtime layer at run creation time - task counts by status
- completed task count
- artifact count
- total stdout/stderr line counts
- failed task ids
- derived run duration
The current runtime control loop also mutates these collections in place for retry/cancel operations:
- cancelling a run marks queued and pending
run_tasksascancelled - retrying a run creates a new
workflow_runsdocument plus a fresh set ofrun_tasks - retrying a task resets the target node and downstream subtree on the existing run, clears task execution fields, and increments the retried task attempt count
artifacts
Purpose:
- store managed outputs and previews
Artifact types may include:
- preview bundle
- quality report
- normalized dataset package
- delivery package
- training config package
- intermediate task output
Core fields:
_idworkspaceIdprojectIdtypeproducerTypeproducerIdstorageRefpreviewablesummarylineagecreatedBycreatedAt
workflow_runs and run_tasks input binding note
The current V1 runtime now stores workflow input selection in three layers:
inputBindingsThe explicit operator-facing selection such as[{ kind: "dataset", id: "dataset-..." }]assetIdsThe resolved runnable asset ids after dataset expansion and deduplicationdatasetIdsThe explicit dataset ids that participated in the run or task
This keeps execution backward-compatible for asset-oriented nodes while preserving the higher-level project data model in run history and task detail.
annotation_tasks
Purpose:
- track assignment and state of manual labeling work
Core fields:
_idworkspaceIdprojectIdtargetTypetargetReflabelTypestatusassigneeIdsreviewerIdscreatedBycreatedAtupdatedAt
annotations
Purpose:
- persist annotation outputs
Core fields:
_idannotationTaskIdworkspaceIdprojectIdtargetRefpayloadstatuscreatedBycreatedAtupdatedAt
plugins
Purpose:
- track installable and enabled plugin versions
Core fields:
_idworkspaceIdoptional for workspace-scoped pluginsscopeasplatformorworkspacenamestatuscurrentVersionversionspermissionsmetadatacreatedAtupdatedAt
If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.
storage_connections
Purpose:
- store object storage and path registration configuration
Core fields:
_idworkspaceIdtypeprovidernamestatusconfigsecretRefcreatedBycreatedAtupdatedAt
Store secrets outside plaintext document fields where possible.
audit_logs
Purpose:
- append-only history of sensitive actions
Core fields:
_idworkspaceIdprojectIdactorIdresourceTyperesourceIdactionbeforeSummaryafterSummarymetadatacreatedAt
Reference Strategy
Use stable ids between collections.
References should be explicit:
- asset to probe report
- dataset to dataset versions
- workflow definition to workflow versions
- workflow run to run tasks
- task to artifact
- annotation task to annotations
Do not depend on implicit path-based linkage.
Index Recommendations
Always index
workspaceIdprojectIdstatuscreatedAt
Important compound indexes
memberships.workspaceId + memberships.userIdprojects.workspaceId + projects.slugassets.projectId + assets.type + assets.createdAtdatasets.projectId + datasets.namedataset_versions.datasetId + dataset_versions.createdAtworkflow_definitions.projectId + workflow_definitions.slugworkflow_definition_versions.workflowDefinitionId + versionNumberworkflow_runs.projectId + createdAtworkflow_runs.workflowDefinitionId + statusrun_tasks.workflowRunId + nodeIdartifacts.producerType + producerIdannotation_tasks.projectId + statusaudit_logs.workspaceId + createdAt
Object Storage References
MongoDB should store references such as:
- bucket
- key
- uri
- checksum
- content type
- size
It should not store:
- large binary file payloads
- full raw video content
- giant archive contents
V1 Constraints
- MongoDB is the only database
- No relational sidecar is assumed
- No GridFS-first strategy is assumed
- Large manifests may live in object storage and be referenced from MongoDB
V1 Non-Goals
The V1 model does not need:
- cross-region data distribution
- advanced event sourcing
- fully normalized analytics warehouse modeling
- high-volume search indexing inside MongoDB itself