554 lines
9.1 KiB
Markdown
554 lines
9.1 KiB
Markdown
# EmboFlow MongoDB Data Model
|
|
|
|
## Goal
|
|
|
|
Define the MongoDB-only persistence model for EmboFlow V1.
|
|
|
|
The database must support:
|
|
|
|
- user and workspace isolation
|
|
- raw asset tracking
|
|
- canonical dataset versions
|
|
- workflow versioning
|
|
- workflow execution history
|
|
- plugin registration
|
|
- auditability
|
|
|
|
## Storage Principles
|
|
|
|
- MongoDB stores metadata and execution state
|
|
- Object storage stores large binary files and large derived bundles
|
|
- MongoDB documents should have clear aggregate boundaries
|
|
- Large, fast-growing arrays should be split into separate collections
|
|
- Platform contracts should use references, not embedded file blobs
|
|
|
|
## Current V1 Implementation Notes
|
|
|
|
The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB.
|
|
|
|
This means the implementation now validates:
|
|
|
|
- document shapes
|
|
- controller and service boundaries
|
|
- workflow/run/task separation
|
|
- artifact lookup by producer
|
|
- asset persistence and probe reports through Mongo-backed collections
|
|
|
|
while still targeting the collection model below as the persistent shape.
|
|
|
|
## Primary Collections
|
|
|
|
- `users`
|
|
- `workspaces`
|
|
- `projects`
|
|
- `memberships`
|
|
- `assets`
|
|
- `asset_probe_reports`
|
|
- `datasets`
|
|
- `dataset_versions`
|
|
- `workflow_definitions`
|
|
- `workflow_definition_versions`
|
|
- `workflow_runs`
|
|
- `run_tasks`
|
|
- `artifacts`
|
|
- `annotation_tasks`
|
|
- `annotations`
|
|
- `plugins`
|
|
- `storage_connections`
|
|
- `audit_logs`
|
|
|
|
## Collection Design
|
|
|
|
### users
|
|
|
|
Purpose:
|
|
|
|
- account identity
|
|
- profile
|
|
- login metadata
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `email`
|
|
- `displayName`
|
|
- `avatarUrl`
|
|
- `status`
|
|
- `lastLoginAt`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
### workspaces
|
|
|
|
Purpose:
|
|
|
|
- resource ownership boundary
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `type` as `personal` or `team`
|
|
- `name`
|
|
- `slug`
|
|
- `ownerId`
|
|
- `status`
|
|
- `settings`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
### memberships
|
|
|
|
Purpose:
|
|
|
|
- workspace and project role mapping
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `projectId` optional
|
|
- `userId`
|
|
- `role`
|
|
- `status`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
This collection should stay independent instead of embedding large member arrays on every resource.
|
|
|
|
### projects
|
|
|
|
Purpose:
|
|
|
|
- project-scoped grouping for assets, workflows, runs, and outputs
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `name`
|
|
- `slug`
|
|
- `description`
|
|
- `status`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
### assets
|
|
|
|
Purpose:
|
|
|
|
- represent raw uploaded or imported inputs
|
|
|
|
Supported asset types:
|
|
|
|
- `raw_file`
|
|
- `archive`
|
|
- `folder`
|
|
- `video_collection`
|
|
- `standard_dataset`
|
|
- `rosbag`
|
|
- `hdf5_dataset`
|
|
- `object_storage_prefix`
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `type`
|
|
- `sourceType`
|
|
- `displayName`
|
|
- `status`
|
|
- `storageRef`
|
|
- `sizeBytes`
|
|
- `fileCount`
|
|
- `topLevelPaths`
|
|
- `detectedFormats`
|
|
- `summary`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
Do not embed full large file listings in this document.
|
|
|
|
### asset_probe_reports
|
|
|
|
Purpose:
|
|
|
|
- retain richer structure-detection and validation output
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `assetId`
|
|
- `reportVersion`
|
|
- `detectedFormatCandidates`
|
|
- `structureSummary`
|
|
- `warnings`
|
|
- `recommendedNextNodes`
|
|
- `rawReport`
|
|
- `createdAt`
|
|
|
|
### datasets
|
|
|
|
Purpose:
|
|
|
|
- represent logical dataset identity
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `name`
|
|
- `type`
|
|
- `status`
|
|
- `latestVersionId`
|
|
- `summary`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
### dataset_versions
|
|
|
|
Purpose:
|
|
|
|
- represent immutable dataset snapshots
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `datasetId`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `sourceAssetId`
|
|
- `parentVersionId`
|
|
- `versionTag`
|
|
- `canonicalSchemaVersion`
|
|
- `manifestRef`
|
|
- `stats`
|
|
- `summary`
|
|
- `status`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
|
|
This collection is separated because versions will grow over time.
|
|
|
|
### workflow_definitions
|
|
|
|
Purpose:
|
|
|
|
- represent logical workflow identity
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `name`
|
|
- `slug`
|
|
- `status`
|
|
- `latestVersionNumber`
|
|
- `publishedVersionNumber`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
### workflow_definition_versions
|
|
|
|
Purpose:
|
|
|
|
- represent immutable workflow snapshots
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workflowDefinitionId`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `versionNumber`
|
|
- `visualGraph`
|
|
- `logicGraph`
|
|
- `runtimeGraph`
|
|
- `pluginRefs`
|
|
- `summary`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
|
|
Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.
|
|
|
|
### workflow_runs
|
|
|
|
Purpose:
|
|
|
|
- store execution runs
|
|
- snapshot the asset bindings chosen at run creation time
|
|
- support project-scoped run history queries without re-reading workflow versions
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workflowDefinitionId`
|
|
- `workflowVersionId`
|
|
- `assetIds`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `triggeredBy`
|
|
- `status`
|
|
- `runtimeSnapshot`
|
|
- `summary`
|
|
- `startedAt`
|
|
- `finishedAt`
|
|
- `createdAt`
|
|
|
|
### run_tasks
|
|
|
|
Purpose:
|
|
|
|
- store one execution unit per node per run
|
|
- keep bound asset context available to the worker at dequeue time
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workflowRunId`
|
|
- `workflowVersionId`
|
|
- `nodeId`
|
|
- `nodeType`
|
|
- `executorType`
|
|
- `status`
|
|
- `attempt`
|
|
- `assetIds`
|
|
- `upstreamNodeIds`
|
|
- `outputArtifactIds`
|
|
- `logRef`
|
|
- `cacheKey`
|
|
- `cacheHit`
|
|
- `logLines`
|
|
- `errorMessage`
|
|
- `summary`
|
|
- `lastResultPreview`
|
|
- `startedAt`
|
|
- `finishedAt`
|
|
- `durationMs`
|
|
- `createdAt`
|
|
|
|
This collection should remain separate from `workflow_runs` because task volume grows quickly.
|
|
|
|
The current executable worker path expects `run_tasks` to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist:
|
|
|
|
- executor choice
|
|
- bound asset ids
|
|
- upstream node dependencies
|
|
- produced artifact ids
|
|
- per-task status and error message
|
|
- task log lines and result preview
|
|
- structured task summaries with executor, outcome, asset count, and artifact ids
|
|
|
|
### artifacts
|
|
|
|
Purpose:
|
|
|
|
- store managed outputs and previews
|
|
|
|
Artifact types may include:
|
|
|
|
- preview bundle
|
|
- quality report
|
|
- normalized dataset package
|
|
- delivery package
|
|
- training config package
|
|
- intermediate task output
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `type`
|
|
- `producerType`
|
|
- `producerId`
|
|
- `storageRef`
|
|
- `previewable`
|
|
- `summary`
|
|
- `lineage`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
|
|
### annotation_tasks
|
|
|
|
Purpose:
|
|
|
|
- track assignment and state of manual labeling work
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `targetType`
|
|
- `targetRef`
|
|
- `labelType`
|
|
- `status`
|
|
- `assigneeIds`
|
|
- `reviewerIds`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
### annotations
|
|
|
|
Purpose:
|
|
|
|
- persist annotation outputs
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `annotationTaskId`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `targetRef`
|
|
- `payload`
|
|
- `status`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
### plugins
|
|
|
|
Purpose:
|
|
|
|
- track installable and enabled plugin versions
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId` optional for workspace-scoped plugins
|
|
- `scope` as `platform` or `workspace`
|
|
- `name`
|
|
- `status`
|
|
- `currentVersion`
|
|
- `versions`
|
|
- `permissions`
|
|
- `metadata`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.
|
|
|
|
### storage_connections
|
|
|
|
Purpose:
|
|
|
|
- store object storage and path registration configuration
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `type`
|
|
- `provider`
|
|
- `name`
|
|
- `status`
|
|
- `config`
|
|
- `secretRef`
|
|
- `createdBy`
|
|
- `createdAt`
|
|
- `updatedAt`
|
|
|
|
Store secrets outside plaintext document fields where possible.
|
|
|
|
### audit_logs
|
|
|
|
Purpose:
|
|
|
|
- append-only history of sensitive actions
|
|
|
|
Core fields:
|
|
|
|
- `_id`
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `actorId`
|
|
- `resourceType`
|
|
- `resourceId`
|
|
- `action`
|
|
- `beforeSummary`
|
|
- `afterSummary`
|
|
- `metadata`
|
|
- `createdAt`
|
|
|
|
## Reference Strategy
|
|
|
|
Use stable ids between collections.
|
|
|
|
References should be explicit:
|
|
|
|
- asset to probe report
|
|
- dataset to dataset versions
|
|
- workflow definition to workflow versions
|
|
- workflow run to run tasks
|
|
- task to artifact
|
|
- annotation task to annotations
|
|
|
|
Do not depend on implicit path-based linkage.
|
|
|
|
## Index Recommendations
|
|
|
|
### Always index
|
|
|
|
- `workspaceId`
|
|
- `projectId`
|
|
- `status`
|
|
- `createdAt`
|
|
|
|
### Important compound indexes
|
|
|
|
- `memberships.workspaceId + memberships.userId`
|
|
- `projects.workspaceId + projects.slug`
|
|
- `assets.projectId + assets.type + assets.createdAt`
|
|
- `datasets.projectId + datasets.name`
|
|
- `dataset_versions.datasetId + dataset_versions.createdAt`
|
|
- `workflow_definitions.projectId + workflow_definitions.slug`
|
|
- `workflow_definition_versions.workflowDefinitionId + versionNumber`
|
|
- `workflow_runs.projectId + createdAt`
|
|
- `workflow_runs.workflowDefinitionId + status`
|
|
- `run_tasks.workflowRunId + nodeId`
|
|
- `artifacts.producerType + producerId`
|
|
- `annotation_tasks.projectId + status`
|
|
- `audit_logs.workspaceId + createdAt`
|
|
|
|
## Object Storage References
|
|
|
|
MongoDB should store references such as:
|
|
|
|
- bucket
|
|
- key
|
|
- uri
|
|
- checksum
|
|
- content type
|
|
- size
|
|
|
|
It should not store:
|
|
|
|
- large binary file payloads
|
|
- full raw video content
|
|
- giant archive contents
|
|
|
|
## V1 Constraints
|
|
|
|
- MongoDB is the only database
|
|
- No relational sidecar is assumed
|
|
- No GridFS-first strategy is assumed
|
|
- Large manifests may live in object storage and be referenced from MongoDB
|
|
|
|
## V1 Non-Goals
|
|
|
|
The V1 model does not need:
|
|
|
|
- cross-region data distribution
|
|
- advanced event sourcing
|
|
- fully normalized analytics warehouse modeling
|
|
- high-volume search indexing inside MongoDB itself
|