EmboFlow/design/05-data/mongodb-data-model.md

# EmboFlow MongoDB Data Model

## Goal

Define the MongoDB-only persistence model for EmboFlow V1.

The database must support:

- user and workspace isolation
- raw asset tracking
- canonical dataset versions
- workflow versioning
- workflow execution history
- plugin registration
- auditability

## Storage Principles

- MongoDB stores metadata and execution state
- Object storage stores large binary files and large derived bundles
- MongoDB documents should have clear aggregate boundaries
- Large, fast-growing arrays should be split into separate collections
- Platform contracts should use references, not embedded file blobs

## Current V1 Implementation Notes

The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB.

This means the implementation now validates:

- document shapes
- controller and service boundaries
- workflow/run/task separation
- artifact lookup by producer
- asset persistence and probe reports through Mongo-backed collections

while still targeting the collection model below as the persistent shape.

## Primary Collections

- `users`
- `workspaces`
- `projects`
- `memberships`
- `assets`
- `asset_probe_reports`
- `datasets`
- `dataset_versions`
- `workflow_definitions`
- `workflow_definition_versions`
- `workflow_runs`
- `run_tasks`
- `artifacts`
- `annotation_tasks`
- `annotations`
- `plugins`
- `storage_connections`
- `audit_logs`

## Collection Design

### users

Purpose:

- account identity
- profile
- login metadata

Core fields:

- `_id`
- `email`
- `displayName`
- `avatarUrl`
- `status`
- `lastLoginAt`
- `createdAt`
- `updatedAt`

### workspaces

Purpose:

- resource ownership boundary

Core fields:

- `_id`
- `type` as `personal` or `team`
- `name`
- `slug`
- `ownerId`
- `status`
- `settings`
- `createdAt`
- `updatedAt`

### memberships

Purpose:

- workspace and project role mapping

Core fields:

- `_id`
- `workspaceId`
- `projectId` optional
- `userId`
- `role`
- `status`
- `createdAt`
- `updatedAt`

This collection should stay independent instead of embedding large member arrays on every resource.

### projects

Purpose:

- project-scoped grouping for assets, workflows, runs, and outputs

Core fields:

- `_id`
- `workspaceId`
- `name`
- `slug`
- `description`
- `status`
- `createdBy`
- `createdAt`
- `updatedAt`

### assets

Purpose:

- represent raw uploaded or imported inputs

Supported asset types:

- `raw_file`
- `archive`
- `folder`
- `video_collection`
- `standard_dataset`
- `rosbag`
- `hdf5_dataset`
- `object_storage_prefix`

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `type`
- `sourceType`
- `displayName`
- `status`
- `storageRef`
- `sizeBytes`
- `fileCount`
- `topLevelPaths`
- `detectedFormats`
- `summary`
- `createdBy`
- `createdAt`
- `updatedAt`

Do not embed full large file listings in this document.

### asset_probe_reports

Purpose:

- retain richer structure-detection and validation output

Core fields:

- `_id`
- `assetId`
- `reportVersion`
- `detectedFormatCandidates`
- `structureSummary`
- `warnings`
- `recommendedNextNodes`
- `rawReport`
- `createdAt`

### datasets

Purpose:

- represent logical dataset identity

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `name`
- `type`
- `status`
- `latestVersionId`
- `summary`
- `createdBy`
- `createdAt`
- `updatedAt`

### dataset_versions

Purpose:

- represent immutable dataset snapshots

Core fields:

- `_id`
- `datasetId`
- `workspaceId`
- `projectId`
- `sourceAssetId`
- `parentVersionId`
- `versionTag`
- `canonicalSchemaVersion`
- `manifestRef`
- `stats`
- `summary`
- `status`
- `createdBy`
- `createdAt`

This collection is separated because versions will grow over time.

### workflow_definitions

Purpose:

- represent logical workflow identity

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `name`
- `slug`
- `status`
- `latestVersionNumber`
- `publishedVersionNumber`
- `createdBy`
- `createdAt`
- `updatedAt`

### workflow_definition_versions

Purpose:

- represent immutable workflow snapshots

Core fields:

- `_id`
- `workflowDefinitionId`
- `workspaceId`
- `projectId`
- `versionNumber`
- `visualGraph`
- `logicGraph`
- `runtimeGraph`
- `pluginRefs`
- `summary`
- `createdBy`
- `createdAt`

Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.

### workflow_runs

Purpose:

- store execution runs
- snapshot the asset bindings chosen at run creation time
- support project-scoped run history queries without re-reading workflow versions

Core fields:

- `_id`
- `workflowDefinitionId`
- `workflowVersionId`
- `assetIds`
- `workspaceId`
- `projectId`
- `triggeredBy`
- `status`
- `runtimeSnapshot`
- `summary`
- `startedAt`
- `finishedAt`
- `createdAt`

### run_tasks

Purpose:

- store one execution unit per node per run
- keep bound asset context available to the worker at dequeue time

Core fields:

- `_id`
- `workflowRunId`
- `workflowVersionId`
- `nodeId`
- `nodeType`
- `executorType`
- `status`
- `attempt`
- `assetIds`
- `upstreamNodeIds`
- `outputArtifactIds`
- `logRef`
- `cacheKey`
- `cacheHit`
- `logLines`
- `errorMessage`
- `summary`
- `lastResultPreview`
- `startedAt`
- `finishedAt`
- `durationMs`
- `createdAt`

This collection should remain separate from `workflow_runs` because task volume grows quickly.

The current executable worker path expects `run_tasks` to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist:

- executor choice
- bound asset ids
- upstream node dependencies
- produced artifact ids
- per-task status and error message
- task log lines and result preview
- structured task summaries with executor, outcome, asset count, and artifact ids

### artifacts

Purpose:

- store managed outputs and previews

Artifact types may include:

- preview bundle
- quality report
- normalized dataset package
- delivery package
- training config package
- intermediate task output

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `type`
- `producerType`
- `producerId`
- `storageRef`
- `previewable`
- `summary`
- `lineage`
- `createdBy`
- `createdAt`

### annotation_tasks

Purpose:

- track assignment and state of manual labeling work

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `targetType`
- `targetRef`
- `labelType`
- `status`
- `assigneeIds`
- `reviewerIds`
- `createdBy`
- `createdAt`
- `updatedAt`

### annotations

Purpose:

- persist annotation outputs

Core fields:

- `_id`
- `annotationTaskId`
- `workspaceId`
- `projectId`
- `targetRef`
- `payload`
- `status`
- `createdBy`
- `createdAt`
- `updatedAt`

### plugins

Purpose:

- track installable and enabled plugin versions

Core fields:

- `_id`
- `workspaceId` optional for workspace-scoped plugins
- `scope` as `platform` or `workspace`
- `name`
- `status`
- `currentVersion`
- `versions`
- `permissions`
- `metadata`
- `createdAt`
- `updatedAt`

If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.

### storage_connections

Purpose:

- store object storage and path registration configuration

Core fields:

- `_id`
- `workspaceId`
- `type`
- `provider`
- `name`
- `status`
- `config`
- `secretRef`
- `createdBy`
- `createdAt`
- `updatedAt`

Store secrets outside plaintext document fields where possible.

### audit_logs

Purpose:

- append-only history of sensitive actions

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `actorId`
- `resourceType`
- `resourceId`
- `action`
- `beforeSummary`
- `afterSummary`
- `metadata`
- `createdAt`

## Reference Strategy

Use stable ids between collections.

References should be explicit:

- asset to probe report
- dataset to dataset versions
- workflow definition to workflow versions
- workflow run to run tasks
- task to artifact
- annotation task to annotations

Do not depend on implicit path-based linkage.

## Index Recommendations

### Always index

- `workspaceId`
- `projectId`
- `status`
- `createdAt`

### Important compound indexes

- `memberships.workspaceId + memberships.userId`
- `projects.workspaceId + projects.slug`
- `assets.projectId + assets.type + assets.createdAt`
- `datasets.projectId + datasets.name`
- `dataset_versions.datasetId + dataset_versions.createdAt`
- `workflow_definitions.projectId + workflow_definitions.slug`
- `workflow_definition_versions.workflowDefinitionId + versionNumber`
- `workflow_runs.projectId + createdAt`
- `workflow_runs.workflowDefinitionId + status`
- `run_tasks.workflowRunId + nodeId`
- `artifacts.producerType + producerId`
- `annotation_tasks.projectId + status`
- `audit_logs.workspaceId + createdAt`

## Object Storage References

MongoDB should store references such as:

- bucket
- key
- uri
- checksum
- content type
- size

It should not store:

- large binary file payloads
- full raw video content
- giant archive contents

## V1 Constraints

- MongoDB is the only database
- No relational sidecar is assumed
- No GridFS-first strategy is assumed
- Large manifests may live in object storage and be referenced from MongoDB

## V1 Non-Goals

The V1 model does not need:

- cross-region data distribution
- advanced event sourcing
- fully normalized analytics warehouse modeling
- high-volume search indexing inside MongoDB itself