EmboFlow/design/05-data/mongodb-data-model.md
2026-03-30 03:02:44 +08:00

11 KiB

EmboFlow MongoDB Data Model

Goal

Define the MongoDB-only persistence model for EmboFlow V1.

The database must support:

  • user and workspace isolation
  • raw asset tracking
  • canonical dataset versions
  • workflow versioning
  • workflow execution history
  • plugin registration
  • auditability

Storage Principles

  • MongoDB stores metadata and execution state
  • Object storage stores large binary files and large derived bundles
  • MongoDB documents should have clear aggregate boundaries
  • Large, fast-growing arrays should be split into separate collections
  • Platform contracts should use references, not embedded file blobs

Current V1 Implementation Notes

The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB.

This means the implementation now validates:

  • document shapes
  • controller and service boundaries
  • workflow/run/task separation
  • artifact lookup by producer
  • asset persistence and probe reports through Mongo-backed collections

while still targeting the collection model below as the persistent shape.

Primary Collections

  • users
  • workspaces
  • projects
  • memberships
  • assets
  • asset_probe_reports
  • datasets
  • dataset_versions
  • workflow_definitions
  • workflow_definition_versions
  • workflow_runs
  • run_tasks
  • artifacts
  • annotation_tasks
  • annotations
  • plugins
  • storage_connections
  • custom_nodes
  • audit_logs

Collection Design

users

Purpose:

  • account identity
  • profile
  • login metadata

Core fields:

  • _id
  • email
  • displayName
  • avatarUrl
  • status
  • lastLoginAt
  • createdAt
  • updatedAt

workspaces

Purpose:

  • resource ownership boundary

Core fields:

  • _id
  • type as personal or team
  • name
  • slug
  • ownerId
  • status
  • settings
  • createdAt
  • updatedAt

memberships

Purpose:

  • workspace and project role mapping

Core fields:

  • _id
  • workspaceId
  • projectId optional
  • userId
  • role
  • status
  • createdAt
  • updatedAt

This collection should stay independent instead of embedding large member arrays on every resource.

projects

Purpose:

  • project-scoped grouping for assets, workflows, runs, and outputs

Core fields:

  • _id
  • workspaceId
  • name
  • slug
  • description
  • status
  • createdBy
  • createdAt
  • updatedAt

assets

Purpose:

  • represent raw uploaded or imported inputs

Supported asset types:

  • raw_file
  • archive
  • folder
  • video_collection
  • standard_dataset
  • rosbag
  • hdf5_dataset
  • object_storage_prefix

Core fields:

  • _id
  • workspaceId
  • projectId
  • type
  • sourceType
  • displayName
  • status
  • storageRef
  • sizeBytes
  • fileCount
  • topLevelPaths
  • detectedFormats
  • summary
  • createdBy
  • createdAt
  • updatedAt

Do not embed full large file listings in this document.

asset_probe_reports

Purpose:

  • retain richer structure-detection and validation output

Core fields:

  • _id
  • assetId
  • reportVersion
  • detectedFormatCandidates
  • structureSummary
  • warnings
  • recommendedNextNodes
  • rawReport
  • createdAt

datasets

Purpose:

  • represent logical dataset identity

Core fields:

  • _id
  • workspaceId
  • projectId
  • name
  • type
  • status
  • latestVersionId
  • summary
  • createdBy
  • createdAt
  • updatedAt

custom_nodes

Purpose:

  • store project-scoped custom container node definitions

Core fields:

  • _id
  • definitionId
  • workspaceId
  • projectId
  • name
  • slug
  • description
  • category
  • status
  • contract
  • source
  • createdBy
  • createdAt
  • updatedAt

The current V1 implementation stores the custom node source as either:

  • an existing Docker image reference
  • a self-contained Dockerfile body plus an image tag

The node contract is persisted with the node definition so the API can expose correct node metadata to the editor and the worker can validate runtime outputs.

dataset_versions

Purpose:

  • represent immutable dataset snapshots

Core fields:

  • _id
  • datasetId
  • workspaceId
  • projectId
  • sourceAssetId
  • parentVersionId
  • versionTag
  • canonicalSchemaVersion
  • manifestRef
  • stats
  • summary
  • status
  • createdBy
  • createdAt

This collection is separated because versions will grow over time.

workflow_definitions

Purpose:

  • represent logical workflow identity

Core fields:

  • _id
  • workspaceId
  • projectId
  • name
  • slug
  • status
  • latestVersionNumber
  • publishedVersionNumber
  • createdBy
  • createdAt
  • updatedAt

workflow_definition_versions

Purpose:

  • represent immutable workflow snapshots

Core fields:

  • _id
  • workflowDefinitionId
  • workspaceId
  • projectId
  • versionNumber
  • visualGraph
  • logicGraph
  • runtimeGraph
  • pluginRefs
  • summary
  • createdBy
  • createdAt

Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.

workflow_runs

Purpose:

  • store execution runs
  • snapshot the asset bindings chosen at run creation time
  • support project-scoped run history queries without re-reading workflow versions

Core fields:

  • _id
  • workflowDefinitionId
  • workflowVersionId
  • assetIds
  • workspaceId
  • projectId
  • triggeredBy
  • status
  • runtimeSnapshot
  • summary
  • startedAt
  • finishedAt
  • durationMs
  • createdAt

run_tasks

Purpose:

  • store one execution unit per node per run
  • keep bound asset context available to the worker at dequeue time

Core fields:

  • _id
  • workflowRunId
  • workflowVersionId
  • nodeId
  • nodeType
  • nodeDefinitionId
  • executorType
  • executorConfig
  • codeHookSpec
  • artifactType
  • artifactTitle
  • status
  • attempt
  • assetIds
  • upstreamNodeIds
  • outputArtifactIds
  • logRef
  • cacheKey
  • cacheHit
  • logLines
  • stdoutLines
  • stderrLines
  • errorMessage
  • summary
  • lastResultPreview
  • startedAt
  • finishedAt
  • durationMs
  • createdAt

This collection should remain separate from workflow_runs because task volume grows quickly.

The current executable worker path expects run_tasks to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist:

  • executor choice
  • node definition id and frozen per-node runtime config
  • bound asset ids at run creation time, then the effective asset ids that were actually executed after any upstream set-operation narrowing
  • upstream node dependencies
  • produced artifact ids
  • per-task status and error message
  • task log lines, stdout/stderr streams, and result preview
  • structured task summaries with executor, outcome, asset count, artifact ids, and stdout/stderr counters

The current runtime also aggregates task execution back onto workflow_runs, so run documents now carry:

  • a frozen runtimeSnapshot copied from the workflow version runtime layer at run creation time
  • task counts by status
  • completed task count
  • artifact count
  • total stdout/stderr line counts
  • failed task ids
  • derived run duration

The current runtime control loop also mutates these collections in place for retry/cancel operations:

  • cancelling a run marks queued and pending run_tasks as cancelled
  • retrying a run creates a new workflow_runs document plus a fresh set of run_tasks
  • retrying a task resets the target node and downstream subtree on the existing run, clears task execution fields, and increments the retried task attempt count

artifacts

Purpose:

  • store managed outputs and previews

Artifact types may include:

  • preview bundle
  • quality report
  • normalized dataset package
  • delivery package
  • training config package
  • intermediate task output

Core fields:

  • _id
  • workspaceId
  • projectId
  • type
  • producerType
  • producerId
  • storageRef
  • previewable
  • summary
  • lineage
  • createdBy
  • createdAt

annotation_tasks

Purpose:

  • track assignment and state of manual labeling work

Core fields:

  • _id
  • workspaceId
  • projectId
  • targetType
  • targetRef
  • labelType
  • status
  • assigneeIds
  • reviewerIds
  • createdBy
  • createdAt
  • updatedAt

annotations

Purpose:

  • persist annotation outputs

Core fields:

  • _id
  • annotationTaskId
  • workspaceId
  • projectId
  • targetRef
  • payload
  • status
  • createdBy
  • createdAt
  • updatedAt

plugins

Purpose:

  • track installable and enabled plugin versions

Core fields:

  • _id
  • workspaceId optional for workspace-scoped plugins
  • scope as platform or workspace
  • name
  • status
  • currentVersion
  • versions
  • permissions
  • metadata
  • createdAt
  • updatedAt

If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.

storage_connections

Purpose:

  • store object storage and path registration configuration

Core fields:

  • _id
  • workspaceId
  • type
  • provider
  • name
  • status
  • config
  • secretRef
  • createdBy
  • createdAt
  • updatedAt

Store secrets outside plaintext document fields where possible.

audit_logs

Purpose:

  • append-only history of sensitive actions

Core fields:

  • _id
  • workspaceId
  • projectId
  • actorId
  • resourceType
  • resourceId
  • action
  • beforeSummary
  • afterSummary
  • metadata
  • createdAt

Reference Strategy

Use stable ids between collections.

References should be explicit:

  • asset to probe report
  • dataset to dataset versions
  • workflow definition to workflow versions
  • workflow run to run tasks
  • task to artifact
  • annotation task to annotations

Do not depend on implicit path-based linkage.

Index Recommendations

Always index

  • workspaceId
  • projectId
  • status
  • createdAt

Important compound indexes

  • memberships.workspaceId + memberships.userId
  • projects.workspaceId + projects.slug
  • assets.projectId + assets.type + assets.createdAt
  • datasets.projectId + datasets.name
  • dataset_versions.datasetId + dataset_versions.createdAt
  • workflow_definitions.projectId + workflow_definitions.slug
  • workflow_definition_versions.workflowDefinitionId + versionNumber
  • workflow_runs.projectId + createdAt
  • workflow_runs.workflowDefinitionId + status
  • run_tasks.workflowRunId + nodeId
  • artifacts.producerType + producerId
  • annotation_tasks.projectId + status
  • audit_logs.workspaceId + createdAt

Object Storage References

MongoDB should store references such as:

  • bucket
  • key
  • uri
  • checksum
  • content type
  • size

It should not store:

  • large binary file payloads
  • full raw video content
  • giant archive contents

V1 Constraints

  • MongoDB is the only database
  • No relational sidecar is assumed
  • No GridFS-first strategy is assumed
  • Large manifests may live in object storage and be referenced from MongoDB

V1 Non-Goals

The V1 model does not need:

  • cross-region data distribution
  • advanced event sourcing
  • fully normalized analytics warehouse modeling
  • high-volume search indexing inside MongoDB itself