7.9 KiB
EmboFlow MongoDB Data Model
Goal
Define the MongoDB-only persistence model for EmboFlow V1.
The database must support:
- user and workspace isolation
- raw asset tracking
- canonical dataset versions
- workflow versioning
- workflow execution history
- plugin registration
- auditability
Storage Principles
- MongoDB stores metadata and execution state
- Object storage stores large binary files and large derived bundles
- MongoDB documents should have clear aggregate boundaries
- Large, fast-growing arrays should be split into separate collections
- Platform contracts should use references, not embedded file blobs
Primary Collections
usersworkspacesprojectsmembershipsassetsasset_probe_reportsdatasetsdataset_versionsworkflow_definitionsworkflow_definition_versionsworkflow_runsrun_tasksartifactsannotation_tasksannotationspluginsstorage_connectionsaudit_logs
Collection Design
users
Purpose:
- account identity
- profile
- login metadata
Core fields:
_idemaildisplayNameavatarUrlstatuslastLoginAtcreatedAtupdatedAt
workspaces
Purpose:
- resource ownership boundary
Core fields:
_idtypeaspersonalorteamnameslugownerIdstatussettingscreatedAtupdatedAt
memberships
Purpose:
- workspace and project role mapping
Core fields:
_idworkspaceIdprojectIdoptionaluserIdrolestatuscreatedAtupdatedAt
This collection should stay independent instead of embedding large member arrays on every resource.
projects
Purpose:
- project-scoped grouping for assets, workflows, runs, and outputs
Core fields:
_idworkspaceIdnameslugdescriptionstatuscreatedBycreatedAtupdatedAt
assets
Purpose:
- represent raw uploaded or imported inputs
Supported asset types:
raw_filearchivefoldervideo_collectionstandard_datasetrosbaghdf5_datasetobject_storage_prefix
Core fields:
_idworkspaceIdprojectIdtypesourceTypedisplayNamestatusstorageRefsizeBytesfileCounttopLevelPathsdetectedFormatssummarycreatedBycreatedAtupdatedAt
Do not embed full large file listings in this document.
asset_probe_reports
Purpose:
- retain richer structure-detection and validation output
Core fields:
_idassetIdreportVersiondetectedFormatCandidatesstructureSummarywarningsrecommendedNextNodesrawReportcreatedAt
datasets
Purpose:
- represent logical dataset identity
Core fields:
_idworkspaceIdprojectIdnametypestatuslatestVersionIdsummarycreatedBycreatedAtupdatedAt
dataset_versions
Purpose:
- represent immutable dataset snapshots
Core fields:
_iddatasetIdworkspaceIdprojectIdsourceAssetIdparentVersionIdversionTagcanonicalSchemaVersionmanifestRefstatssummarystatuscreatedBycreatedAt
This collection is separated because versions will grow over time.
workflow_definitions
Purpose:
- represent logical workflow identity
Core fields:
_idworkspaceIdprojectIdnameslugstatuslatestVersionNumberpublishedVersionNumbercreatedBycreatedAtupdatedAt
workflow_definition_versions
Purpose:
- represent immutable workflow snapshots
Core fields:
_idworkflowDefinitionIdworkspaceIdprojectIdversionNumbervisualGraphlogicGraphruntimeGraphpluginRefssummarycreatedBycreatedAt
Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.
workflow_runs
Purpose:
- store execution runs
Core fields:
_idworkflowDefinitionIdworkflowVersionIdworkspaceIdprojectIdtriggeredBystatusruntimeSnapshotsummarystartedAtfinishedAtcreatedAt
run_tasks
Purpose:
- store one execution unit per node per run
Core fields:
_idworkflowRunIdworkflowVersionIdnodeIdnodeTypestatusattemptexecutorschedulerinputRefsoutputRefslogRefcacheKeycacheHiterrorSummarystartedAtfinishedAtcreatedAt
This collection should remain separate from workflow_runs because task volume grows quickly.
artifacts
Purpose:
- store managed outputs and previews
Artifact types may include:
- preview bundle
- quality report
- normalized dataset package
- delivery package
- training config package
- intermediate task output
Core fields:
_idworkspaceIdprojectIdtypeproducerTypeproducerIdstorageRefpreviewablesummarylineagecreatedBycreatedAt
annotation_tasks
Purpose:
- track assignment and state of manual labeling work
Core fields:
_idworkspaceIdprojectIdtargetTypetargetReflabelTypestatusassigneeIdsreviewerIdscreatedBycreatedAtupdatedAt
annotations
Purpose:
- persist annotation outputs
Core fields:
_idannotationTaskIdworkspaceIdprojectIdtargetRefpayloadstatuscreatedBycreatedAtupdatedAt
plugins
Purpose:
- track installable and enabled plugin versions
Core fields:
_idworkspaceIdoptional for workspace-scoped pluginsscopeasplatformorworkspacenamestatuscurrentVersionversionspermissionsmetadatacreatedAtupdatedAt
If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.
storage_connections
Purpose:
- store object storage and path registration configuration
Core fields:
_idworkspaceIdtypeprovidernamestatusconfigsecretRefcreatedBycreatedAtupdatedAt
Store secrets outside plaintext document fields where possible.
audit_logs
Purpose:
- append-only history of sensitive actions
Core fields:
_idworkspaceIdprojectIdactorIdresourceTyperesourceIdactionbeforeSummaryafterSummarymetadatacreatedAt
Reference Strategy
Use stable ids between collections.
References should be explicit:
- asset to probe report
- dataset to dataset versions
- workflow definition to workflow versions
- workflow run to run tasks
- task to artifact
- annotation task to annotations
Do not depend on implicit path-based linkage.
Index Recommendations
Always index
workspaceIdprojectIdstatuscreatedAt
Important compound indexes
memberships.workspaceId + memberships.userIdprojects.workspaceId + projects.slugassets.projectId + assets.type + assets.createdAtdatasets.projectId + datasets.namedataset_versions.datasetId + dataset_versions.createdAtworkflow_definitions.projectId + workflow_definitions.slugworkflow_definition_versions.workflowDefinitionId + versionNumberworkflow_runs.projectId + createdAtworkflow_runs.workflowDefinitionId + statusrun_tasks.workflowRunId + nodeIdartifacts.producerType + producerIdannotation_tasks.projectId + statusaudit_logs.workspaceId + createdAt
Object Storage References
MongoDB should store references such as:
- bucket
- key
- uri
- checksum
- content type
- size
It should not store:
- large binary file payloads
- full raw video content
- giant archive contents
V1 Constraints
- MongoDB is the only database
- No relational sidecar is assumed
- No GridFS-first strategy is assumed
- Large manifests may live in object storage and be referenced from MongoDB
V1 Non-Goals
The V1 model does not need:
- cross-region data distribution
- advanced event sourcing
- fully normalized analytics warehouse modeling
- high-volume search indexing inside MongoDB itself