# EmboFlow MongoDB Data Model

## Goal

Define the MongoDB-only persistence model for EmboFlow V1.

The database must support:

- user and workspace isolation
- raw asset tracking
- canonical dataset versions
- workflow versioning
- workflow execution history
- plugin registration
- auditability

## Storage Principles

- MongoDB stores metadata and execution state
- Object storage stores large binary files and large derived bundles
- MongoDB documents should have clear aggregate boundaries
- Large, fast-growing arrays should be split into separate collections
- Platform contracts should use references, not embedded file blobs

## Current V1 Implementation Notes

The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB.

This means the implementation now validates:

- document shapes
- controller and service boundaries
- workflow/run/task separation
- artifact lookup by producer
- asset persistence and probe reports through Mongo-backed collections

while still targeting the collection model below as the persistent shape.

## Primary Collections

- `users`
- `workspaces`
- `projects`
- `memberships`
- `assets`
- `asset_probe_reports`
- `datasets`
- `dataset_versions`
- `workflow_definitions`
- `workflow_definition_versions`
- `workflow_runs`
- `run_tasks`
- `artifacts`
- `annotation_tasks`
- `annotations`
- `plugins`
- `storage_connections`
- `custom_nodes`
- `audit_logs`

## Collection Design

### users

Purpose:

- account identity
- profile
- login metadata

Core fields:

- `_id`
- `email`
- `displayName`
- `avatarUrl`
- `status`
- `lastLoginAt`
- `createdAt`
- `updatedAt`

### workspaces

Purpose:

- resource ownership boundary

Core fields:

- `_id`
- `type` as `personal` or `team`
- `name`
- `slug`
- `ownerId`
- `status`
- `settings`
- `createdAt`
- `updatedAt`

### memberships

Purpose:

- workspace and project role mapping

Core fields:

- `_id`
- `workspaceId`
- `projectId` optional
- `userId`
- `role`
- `status`
- `createdAt`
- `updatedAt`

This collection should stay independent instead of embedding large member arrays on every resource.

### projects

Purpose:

- project-scoped grouping for assets, workflows, runs, and outputs

Core fields:

- `_id`
- `workspaceId`
- `name`
- `slug`
- `description`
- `status`
- `createdBy`
- `createdAt`
- `updatedAt`

### assets

Purpose:

- represent raw uploaded or imported inputs

Supported asset types:

- `raw_file`
- `archive`
- `folder`
- `video_collection`
- `standard_dataset`
- `rosbag`
- `hdf5_dataset`
- `object_storage_prefix`

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `type`
- `sourceType`
- `displayName`
- `status`
- `storageRef`
- `sizeBytes`
- `fileCount`
- `topLevelPaths`
- `detectedFormats`
- `summary`
- `createdBy`
- `createdAt`
- `updatedAt`

Do not embed full large file listings in this document.

### asset_probe_reports

Purpose:

- retain richer structure-detection and validation output

Core fields:

- `_id`
- `assetId`
- `reportVersion`
- `detectedFormatCandidates`
- `structureSummary`
- `warnings`
- `recommendedNextNodes`
- `rawReport`
- `createdAt`

### datasets

Purpose:

- represent logical dataset identity

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `name`
- `type`
- `status`
- `latestVersionId`
- `summary`
- `createdBy`
- `createdAt`
- `updatedAt`

### custom_nodes

Purpose:

- store project-scoped custom container node definitions

Core fields:

- `_id`
- `definitionId`
- `workspaceId`
- `projectId`
- `name`
- `slug`
- `description`
- `category`
- `status`
- `contract`
- `source`
- `createdBy`
- `createdAt`
- `updatedAt`

The current V1 implementation stores the custom node source as either:

- an existing Docker image reference
- a self-contained Dockerfile body plus an image tag

The node contract is persisted with the node definition so the API can expose correct node metadata to the editor and the worker can validate runtime outputs.

### dataset_versions

Purpose:

- represent immutable dataset snapshots

Core fields:

- `_id`
- `datasetId`
- `workspaceId`
- `projectId`
- `sourceAssetId`
- `parentVersionId`
- `versionTag`
- `canonicalSchemaVersion`
- `manifestRef`
- `stats`
- `summary`
- `status`
- `createdBy`
- `createdAt`

This collection is separated because versions will grow over time.

### workflow_definitions

Purpose:

- represent logical workflow identity

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `name`
- `slug`
- `status`
- `latestVersionNumber`
- `publishedVersionNumber`
- `createdBy`
- `createdAt`
- `updatedAt`

### workflow_definition_versions

Purpose:

- represent immutable workflow snapshots

Core fields:

- `_id`
- `workflowDefinitionId`
- `workspaceId`
- `projectId`
- `versionNumber`
- `visualGraph`
- `logicGraph`
- `runtimeGraph`
- `pluginRefs`
- `summary`
- `createdBy`
- `createdAt`

Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.

### workflow_runs

Purpose:

- store execution runs
- snapshot the asset bindings chosen at run creation time
- support project-scoped run history queries without re-reading workflow versions

Core fields:

- `_id`
- `workflowDefinitionId`
- `workflowVersionId`
- `assetIds`
- `workspaceId`
- `projectId`
- `triggeredBy`
- `status`
- `runtimeSnapshot`
- `summary`
- `startedAt`
- `finishedAt`
- `durationMs`
- `createdAt`

### run_tasks

Purpose:

- store one execution unit per node per run
- keep bound asset context available to the worker at dequeue time

Core fields:

- `_id`
- `workflowRunId`
- `workflowVersionId`
- `nodeId`
- `nodeType`
- `nodeDefinitionId`
- `executorType`
- `executorConfig`
- `codeHookSpec`
- `artifactType`
- `artifactTitle`
- `status`
- `attempt`
- `assetIds`
- `upstreamNodeIds`
- `outputArtifactIds`
- `logRef`
- `cacheKey`
- `cacheHit`
- `logLines`
- `stdoutLines`
- `stderrLines`
- `errorMessage`
- `summary`
- `lastResultPreview`
- `startedAt`
- `finishedAt`
- `durationMs`
- `createdAt`

This collection should remain separate from `workflow_runs` because task volume grows quickly.

The current executable worker path expects `run_tasks` to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist:

- executor choice
- node definition id and frozen per-node runtime config
- bound asset ids at run creation time, then the effective asset ids that were actually executed after any upstream set-operation narrowing
- upstream node dependencies
- produced artifact ids
- per-task status and error message
- task log lines, stdout/stderr streams, and result preview
- structured task summaries with executor, outcome, asset count, artifact ids, and stdout/stderr counters

The current runtime also aggregates task execution back onto `workflow_runs`, so run documents now carry:

- a frozen `runtimeSnapshot` copied from the workflow version runtime layer at run creation time
- task counts by status
- completed task count
- artifact count
- total stdout/stderr line counts
- failed task ids
- derived run duration

The current runtime control loop also mutates these collections in place for retry/cancel operations:

- cancelling a run marks queued and pending `run_tasks` as `cancelled`
- retrying a run creates a new `workflow_runs` document plus a fresh set of `run_tasks`
- retrying a task resets the target node and downstream subtree on the existing run, clears task execution fields, and increments the retried task attempt count

### artifacts

Purpose:

- store managed outputs and previews

Artifact types may include:

- preview bundle
- quality report
- normalized dataset package
- delivery package
- training config package
- intermediate task output

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `type`
- `producerType`
- `producerId`
- `storageRef`
- `previewable`
- `summary`
- `lineage`
- `createdBy`
- `createdAt`

### workflow_runs and run_tasks input binding note

The current V1 runtime now stores workflow input selection in three layers:

- `inputBindings`
  The explicit operator-facing selection such as `[{ kind: "dataset", id: "dataset-..." }]`
- `assetIds`
  The resolved runnable asset ids after dataset expansion and deduplication
- `datasetIds`
  The explicit dataset ids that participated in the run or task

This keeps execution backward-compatible for asset-oriented nodes while preserving the higher-level project data model in run history and task detail.

### annotation_tasks

Purpose:

- track assignment and state of manual labeling work

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `targetType`
- `targetRef`
- `labelType`
- `status`
- `assigneeIds`
- `reviewerIds`
- `createdBy`
- `createdAt`
- `updatedAt`

### annotations

Purpose:

- persist annotation outputs

Core fields:

- `_id`
- `annotationTaskId`
- `workspaceId`
- `projectId`
- `targetRef`
- `payload`
- `status`
- `createdBy`
- `createdAt`
- `updatedAt`

### plugins

Purpose:

- track installable and enabled plugin versions

Core fields:

- `_id`
- `workspaceId` optional for workspace-scoped plugins
- `scope` as `platform` or `workspace`
- `name`
- `status`
- `currentVersion`
- `versions`
- `permissions`
- `metadata`
- `createdAt`
- `updatedAt`

If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.

### storage_connections

Purpose:

- store object storage and path registration configuration

Core fields:

- `_id`
- `workspaceId`
- `type`
- `provider`
- `name`
- `status`
- `config`
- `secretRef`
- `createdBy`
- `createdAt`
- `updatedAt`

Store secrets outside plaintext document fields where possible.

### audit_logs

Purpose:

- append-only history of sensitive actions

Core fields:

- `_id`
- `workspaceId`
- `projectId`
- `actorId`
- `resourceType`
- `resourceId`
- `action`
- `beforeSummary`
- `afterSummary`
- `metadata`
- `createdAt`

## Reference Strategy

Use stable ids between collections.

References should be explicit:

- asset to probe report
- dataset to dataset versions
- workflow definition to workflow versions
- workflow run to run tasks
- task to artifact
- annotation task to annotations

Do not depend on implicit path-based linkage.

## Index Recommendations

### Always index

- `workspaceId`
- `projectId`
- `status`
- `createdAt`

### Important compound indexes

- `memberships.workspaceId + memberships.userId`
- `projects.workspaceId + projects.slug`
- `assets.projectId + assets.type + assets.createdAt`
- `datasets.projectId + datasets.name`
- `dataset_versions.datasetId + dataset_versions.createdAt`
- `workflow_definitions.projectId + workflow_definitions.slug`
- `workflow_definition_versions.workflowDefinitionId + versionNumber`
- `workflow_runs.projectId + createdAt`
- `workflow_runs.workflowDefinitionId + status`
- `run_tasks.workflowRunId + nodeId`
- `artifacts.producerType + producerId`
- `annotation_tasks.projectId + status`
- `audit_logs.workspaceId + createdAt`

## Object Storage References

MongoDB should store references such as:

- bucket
- key
- uri
- checksum
- content type
- size

It should not store:

- large binary file payloads
- full raw video content
- giant archive contents

## V1 Constraints

- MongoDB is the only database
- No relational sidecar is assumed
- No GridFS-first strategy is assumed
- Large manifests may live in object storage and be referenced from MongoDB

## V1 Non-Goals

The V1 model does not need:

- cross-region data distribution
- advanced event sourcing
- fully normalized analytics warehouse modeling
- high-volume search indexing inside MongoDB itself