EmboFlow/design/05-data/mongodb-data-model.md

7.9 KiB

EmboFlow MongoDB Data Model

Goal

Define the MongoDB-only persistence model for EmboFlow V1.

The database must support:

  • user and workspace isolation
  • raw asset tracking
  • canonical dataset versions
  • workflow versioning
  • workflow execution history
  • plugin registration
  • auditability

Storage Principles

  • MongoDB stores metadata and execution state
  • Object storage stores large binary files and large derived bundles
  • MongoDB documents should have clear aggregate boundaries
  • Large, fast-growing arrays should be split into separate collections
  • Platform contracts should use references, not embedded file blobs

Primary Collections

  • users
  • workspaces
  • projects
  • memberships
  • assets
  • asset_probe_reports
  • datasets
  • dataset_versions
  • workflow_definitions
  • workflow_definition_versions
  • workflow_runs
  • run_tasks
  • artifacts
  • annotation_tasks
  • annotations
  • plugins
  • storage_connections
  • audit_logs

Collection Design

users

Purpose:

  • account identity
  • profile
  • login metadata

Core fields:

  • _id
  • email
  • displayName
  • avatarUrl
  • status
  • lastLoginAt
  • createdAt
  • updatedAt

workspaces

Purpose:

  • resource ownership boundary

Core fields:

  • _id
  • type as personal or team
  • name
  • slug
  • ownerId
  • status
  • settings
  • createdAt
  • updatedAt

memberships

Purpose:

  • workspace and project role mapping

Core fields:

  • _id
  • workspaceId
  • projectId optional
  • userId
  • role
  • status
  • createdAt
  • updatedAt

This collection should stay independent instead of embedding large member arrays on every resource.

projects

Purpose:

  • project-scoped grouping for assets, workflows, runs, and outputs

Core fields:

  • _id
  • workspaceId
  • name
  • slug
  • description
  • status
  • createdBy
  • createdAt
  • updatedAt

assets

Purpose:

  • represent raw uploaded or imported inputs

Supported asset types:

  • raw_file
  • archive
  • folder
  • video_collection
  • standard_dataset
  • rosbag
  • hdf5_dataset
  • object_storage_prefix

Core fields:

  • _id
  • workspaceId
  • projectId
  • type
  • sourceType
  • displayName
  • status
  • storageRef
  • sizeBytes
  • fileCount
  • topLevelPaths
  • detectedFormats
  • summary
  • createdBy
  • createdAt
  • updatedAt

Do not embed full large file listings in this document.

asset_probe_reports

Purpose:

  • retain richer structure-detection and validation output

Core fields:

  • _id
  • assetId
  • reportVersion
  • detectedFormatCandidates
  • structureSummary
  • warnings
  • recommendedNextNodes
  • rawReport
  • createdAt

datasets

Purpose:

  • represent logical dataset identity

Core fields:

  • _id
  • workspaceId
  • projectId
  • name
  • type
  • status
  • latestVersionId
  • summary
  • createdBy
  • createdAt
  • updatedAt

dataset_versions

Purpose:

  • represent immutable dataset snapshots

Core fields:

  • _id
  • datasetId
  • workspaceId
  • projectId
  • sourceAssetId
  • parentVersionId
  • versionTag
  • canonicalSchemaVersion
  • manifestRef
  • stats
  • summary
  • status
  • createdBy
  • createdAt

This collection is separated because versions will grow over time.

workflow_definitions

Purpose:

  • represent logical workflow identity

Core fields:

  • _id
  • workspaceId
  • projectId
  • name
  • slug
  • status
  • latestVersionNumber
  • publishedVersionNumber
  • createdBy
  • createdAt
  • updatedAt

workflow_definition_versions

Purpose:

  • represent immutable workflow snapshots

Core fields:

  • _id
  • workflowDefinitionId
  • workspaceId
  • projectId
  • versionNumber
  • visualGraph
  • logicGraph
  • runtimeGraph
  • pluginRefs
  • summary
  • createdBy
  • createdAt

Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.

workflow_runs

Purpose:

  • store execution runs

Core fields:

  • _id
  • workflowDefinitionId
  • workflowVersionId
  • workspaceId
  • projectId
  • triggeredBy
  • status
  • runtimeSnapshot
  • summary
  • startedAt
  • finishedAt
  • createdAt

run_tasks

Purpose:

  • store one execution unit per node per run

Core fields:

  • _id
  • workflowRunId
  • workflowVersionId
  • nodeId
  • nodeType
  • status
  • attempt
  • executor
  • scheduler
  • inputRefs
  • outputRefs
  • logRef
  • cacheKey
  • cacheHit
  • errorSummary
  • startedAt
  • finishedAt
  • createdAt

This collection should remain separate from workflow_runs because task volume grows quickly.

artifacts

Purpose:

  • store managed outputs and previews

Artifact types may include:

  • preview bundle
  • quality report
  • normalized dataset package
  • delivery package
  • training config package
  • intermediate task output

Core fields:

  • _id
  • workspaceId
  • projectId
  • type
  • producerType
  • producerId
  • storageRef
  • previewable
  • summary
  • lineage
  • createdBy
  • createdAt

annotation_tasks

Purpose:

  • track assignment and state of manual labeling work

Core fields:

  • _id
  • workspaceId
  • projectId
  • targetType
  • targetRef
  • labelType
  • status
  • assigneeIds
  • reviewerIds
  • createdBy
  • createdAt
  • updatedAt

annotations

Purpose:

  • persist annotation outputs

Core fields:

  • _id
  • annotationTaskId
  • workspaceId
  • projectId
  • targetRef
  • payload
  • status
  • createdBy
  • createdAt
  • updatedAt

plugins

Purpose:

  • track installable and enabled plugin versions

Core fields:

  • _id
  • workspaceId optional for workspace-scoped plugins
  • scope as platform or workspace
  • name
  • status
  • currentVersion
  • versions
  • permissions
  • metadata
  • createdAt
  • updatedAt

If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.

storage_connections

Purpose:

  • store object storage and path registration configuration

Core fields:

  • _id
  • workspaceId
  • type
  • provider
  • name
  • status
  • config
  • secretRef
  • createdBy
  • createdAt
  • updatedAt

Store secrets outside plaintext document fields where possible.

audit_logs

Purpose:

  • append-only history of sensitive actions

Core fields:

  • _id
  • workspaceId
  • projectId
  • actorId
  • resourceType
  • resourceId
  • action
  • beforeSummary
  • afterSummary
  • metadata
  • createdAt

Reference Strategy

Use stable ids between collections.

References should be explicit:

  • asset to probe report
  • dataset to dataset versions
  • workflow definition to workflow versions
  • workflow run to run tasks
  • task to artifact
  • annotation task to annotations

Do not depend on implicit path-based linkage.

Index Recommendations

Always index

  • workspaceId
  • projectId
  • status
  • createdAt

Important compound indexes

  • memberships.workspaceId + memberships.userId
  • projects.workspaceId + projects.slug
  • assets.projectId + assets.type + assets.createdAt
  • datasets.projectId + datasets.name
  • dataset_versions.datasetId + dataset_versions.createdAt
  • workflow_definitions.projectId + workflow_definitions.slug
  • workflow_definition_versions.workflowDefinitionId + versionNumber
  • workflow_runs.projectId + createdAt
  • workflow_runs.workflowDefinitionId + status
  • run_tasks.workflowRunId + nodeId
  • artifacts.producerType + producerId
  • annotation_tasks.projectId + status
  • audit_logs.workspaceId + createdAt

Object Storage References

MongoDB should store references such as:

  • bucket
  • key
  • uri
  • checksum
  • content type
  • size

It should not store:

  • large binary file payloads
  • full raw video content
  • giant archive contents

V1 Constraints

  • MongoDB is the only database
  • No relational sidecar is assumed
  • No GridFS-first strategy is assumed
  • Large manifests may live in object storage and be referenced from MongoDB

V1 Non-Goals

The V1 model does not need:

  • cross-region data distribution
  • advanced event sourcing
  • fully normalized analytics warehouse modeling
  • high-volume search indexing inside MongoDB itself