EmboFlow/design/05-data/mongodb-data-model.md
2026-03-26 21:13:05 +08:00

542 lines
8.7 KiB
Markdown

# EmboFlow MongoDB Data Model
## Goal
Define the MongoDB-only persistence model for EmboFlow V1.
The database must support:
- user and workspace isolation
- raw asset tracking
- canonical dataset versions
- workflow versioning
- workflow execution history
- plugin registration
- auditability
## Storage Principles
- MongoDB stores metadata and execution state
- Object storage stores large binary files and large derived bundles
- MongoDB documents should have clear aggregate boundaries
- Large, fast-growing arrays should be split into separate collections
- Platform contracts should use references, not embedded file blobs
## Current V1 Implementation Notes
The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB.
This means the implementation now validates:
- document shapes
- controller and service boundaries
- workflow/run/task separation
- artifact lookup by producer
- asset persistence and probe reports through Mongo-backed collections
while still targeting the collection model below as the persistent shape.
## Primary Collections
- `users`
- `workspaces`
- `projects`
- `memberships`
- `assets`
- `asset_probe_reports`
- `datasets`
- `dataset_versions`
- `workflow_definitions`
- `workflow_definition_versions`
- `workflow_runs`
- `run_tasks`
- `artifacts`
- `annotation_tasks`
- `annotations`
- `plugins`
- `storage_connections`
- `audit_logs`
## Collection Design
### users
Purpose:
- account identity
- profile
- login metadata
Core fields:
- `_id`
- `email`
- `displayName`
- `avatarUrl`
- `status`
- `lastLoginAt`
- `createdAt`
- `updatedAt`
### workspaces
Purpose:
- resource ownership boundary
Core fields:
- `_id`
- `type` as `personal` or `team`
- `name`
- `slug`
- `ownerId`
- `status`
- `settings`
- `createdAt`
- `updatedAt`
### memberships
Purpose:
- workspace and project role mapping
Core fields:
- `_id`
- `workspaceId`
- `projectId` optional
- `userId`
- `role`
- `status`
- `createdAt`
- `updatedAt`
This collection should stay independent instead of embedding large member arrays on every resource.
### projects
Purpose:
- project-scoped grouping for assets, workflows, runs, and outputs
Core fields:
- `_id`
- `workspaceId`
- `name`
- `slug`
- `description`
- `status`
- `createdBy`
- `createdAt`
- `updatedAt`
### assets
Purpose:
- represent raw uploaded or imported inputs
Supported asset types:
- `raw_file`
- `archive`
- `folder`
- `video_collection`
- `standard_dataset`
- `rosbag`
- `hdf5_dataset`
- `object_storage_prefix`
Core fields:
- `_id`
- `workspaceId`
- `projectId`
- `type`
- `sourceType`
- `displayName`
- `status`
- `storageRef`
- `sizeBytes`
- `fileCount`
- `topLevelPaths`
- `detectedFormats`
- `summary`
- `createdBy`
- `createdAt`
- `updatedAt`
Do not embed full large file listings in this document.
### asset_probe_reports
Purpose:
- retain richer structure-detection and validation output
Core fields:
- `_id`
- `assetId`
- `reportVersion`
- `detectedFormatCandidates`
- `structureSummary`
- `warnings`
- `recommendedNextNodes`
- `rawReport`
- `createdAt`
### datasets
Purpose:
- represent logical dataset identity
Core fields:
- `_id`
- `workspaceId`
- `projectId`
- `name`
- `type`
- `status`
- `latestVersionId`
- `summary`
- `createdBy`
- `createdAt`
- `updatedAt`
### dataset_versions
Purpose:
- represent immutable dataset snapshots
Core fields:
- `_id`
- `datasetId`
- `workspaceId`
- `projectId`
- `sourceAssetId`
- `parentVersionId`
- `versionTag`
- `canonicalSchemaVersion`
- `manifestRef`
- `stats`
- `summary`
- `status`
- `createdBy`
- `createdAt`
This collection is separated because versions will grow over time.
### workflow_definitions
Purpose:
- represent logical workflow identity
Core fields:
- `_id`
- `workspaceId`
- `projectId`
- `name`
- `slug`
- `status`
- `latestVersionNumber`
- `publishedVersionNumber`
- `createdBy`
- `createdAt`
- `updatedAt`
### workflow_definition_versions
Purpose:
- represent immutable workflow snapshots
Core fields:
- `_id`
- `workflowDefinitionId`
- `workspaceId`
- `projectId`
- `versionNumber`
- `visualGraph`
- `logicGraph`
- `runtimeGraph`
- `pluginRefs`
- `summary`
- `createdBy`
- `createdAt`
Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.
### workflow_runs
Purpose:
- store execution runs
Core fields:
- `_id`
- `workflowDefinitionId`
- `workflowVersionId`
- `workspaceId`
- `projectId`
- `triggeredBy`
- `status`
- `runtimeSnapshot`
- `summary`
- `startedAt`
- `finishedAt`
- `createdAt`
### run_tasks
Purpose:
- store one execution unit per node per run
Core fields:
- `_id`
- `workflowRunId`
- `workflowVersionId`
- `nodeId`
- `nodeType`
- `executorType`
- `status`
- `attempt`
- `upstreamNodeIds`
- `outputArtifactIds`
- `logRef`
- `cacheKey`
- `cacheHit`
- `errorMessage`
- `startedAt`
- `finishedAt`
- `createdAt`
This collection should remain separate from `workflow_runs` because task volume grows quickly.
The current executable worker path expects `run_tasks` to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist:
- executor choice
- upstream node dependencies
- produced artifact ids
- per-task status and error message
### artifacts
Purpose:
- store managed outputs and previews
Artifact types may include:
- preview bundle
- quality report
- normalized dataset package
- delivery package
- training config package
- intermediate task output
Core fields:
- `_id`
- `workspaceId`
- `projectId`
- `type`
- `producerType`
- `producerId`
- `storageRef`
- `previewable`
- `summary`
- `lineage`
- `createdBy`
- `createdAt`
### annotation_tasks
Purpose:
- track assignment and state of manual labeling work
Core fields:
- `_id`
- `workspaceId`
- `projectId`
- `targetType`
- `targetRef`
- `labelType`
- `status`
- `assigneeIds`
- `reviewerIds`
- `createdBy`
- `createdAt`
- `updatedAt`
### annotations
Purpose:
- persist annotation outputs
Core fields:
- `_id`
- `annotationTaskId`
- `workspaceId`
- `projectId`
- `targetRef`
- `payload`
- `status`
- `createdBy`
- `createdAt`
- `updatedAt`
### plugins
Purpose:
- track installable and enabled plugin versions
Core fields:
- `_id`
- `workspaceId` optional for workspace-scoped plugins
- `scope` as `platform` or `workspace`
- `name`
- `status`
- `currentVersion`
- `versions`
- `permissions`
- `metadata`
- `createdAt`
- `updatedAt`
If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.
### storage_connections
Purpose:
- store object storage and path registration configuration
Core fields:
- `_id`
- `workspaceId`
- `type`
- `provider`
- `name`
- `status`
- `config`
- `secretRef`
- `createdBy`
- `createdAt`
- `updatedAt`
Store secrets outside plaintext document fields where possible.
### audit_logs
Purpose:
- append-only history of sensitive actions
Core fields:
- `_id`
- `workspaceId`
- `projectId`
- `actorId`
- `resourceType`
- `resourceId`
- `action`
- `beforeSummary`
- `afterSummary`
- `metadata`
- `createdAt`
## Reference Strategy
Use stable ids between collections.
References should be explicit:
- asset to probe report
- dataset to dataset versions
- workflow definition to workflow versions
- workflow run to run tasks
- task to artifact
- annotation task to annotations
Do not depend on implicit path-based linkage.
## Index Recommendations
### Always index
- `workspaceId`
- `projectId`
- `status`
- `createdAt`
### Important compound indexes
- `memberships.workspaceId + memberships.userId`
- `projects.workspaceId + projects.slug`
- `assets.projectId + assets.type + assets.createdAt`
- `datasets.projectId + datasets.name`
- `dataset_versions.datasetId + dataset_versions.createdAt`
- `workflow_definitions.projectId + workflow_definitions.slug`
- `workflow_definition_versions.workflowDefinitionId + versionNumber`
- `workflow_runs.projectId + createdAt`
- `workflow_runs.workflowDefinitionId + status`
- `run_tasks.workflowRunId + nodeId`
- `artifacts.producerType + producerId`
- `annotation_tasks.projectId + status`
- `audit_logs.workspaceId + createdAt`
## Object Storage References
MongoDB should store references such as:
- bucket
- key
- uri
- checksum
- content type
- size
It should not store:
- large binary file payloads
- full raw video content
- giant archive contents
## V1 Constraints
- MongoDB is the only database
- No relational sidecar is assumed
- No GridFS-first strategy is assumed
- Large manifests may live in object storage and be referenced from MongoDB
## V1 Non-Goals
The V1 model does not need:
- cross-region data distribution
- advanced event sourcing
- fully normalized analytics warehouse modeling
- high-volume search indexing inside MongoDB itself