# EmboFlow MongoDB Data Model ## Goal Define the MongoDB-only persistence model for EmboFlow V1. The database must support: - user and workspace isolation - raw asset tracking - canonical dataset versions - workflow versioning - workflow execution history - plugin registration - auditability ## Storage Principles - MongoDB stores metadata and execution state - Object storage stores large binary files and large derived bundles - MongoDB documents should have clear aggregate boundaries - Large, fast-growing arrays should be split into separate collections - Platform contracts should use references, not embedded file blobs ## Current V1 Implementation Notes The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB. This means the implementation now validates: - document shapes - controller and service boundaries - workflow/run/task separation - artifact lookup by producer - asset persistence and probe reports through Mongo-backed collections while still targeting the collection model below as the persistent shape. ## Primary Collections - `users` - `workspaces` - `projects` - `memberships` - `assets` - `asset_probe_reports` - `datasets` - `dataset_versions` - `workflow_definitions` - `workflow_definition_versions` - `workflow_runs` - `run_tasks` - `artifacts` - `annotation_tasks` - `annotations` - `plugins` - `storage_connections` - `custom_nodes` - `audit_logs` ## Collection Design ### users Purpose: - account identity - profile - login metadata Core fields: - `_id` - `email` - `displayName` - `avatarUrl` - `status` - `lastLoginAt` - `createdAt` - `updatedAt` ### workspaces Purpose: - resource ownership boundary Core fields: - `_id` - `type` as `personal` or `team` - `name` - `slug` - `ownerId` - `status` - `settings` - `createdAt` - `updatedAt` ### memberships Purpose: - workspace and project role mapping Core fields: - `_id` - `workspaceId` - `projectId` optional - `userId` - `role` - `status` - `createdAt` - `updatedAt` This collection should stay independent instead of embedding large member arrays on every resource. ### projects Purpose: - project-scoped grouping for assets, workflows, runs, and outputs Core fields: - `_id` - `workspaceId` - `name` - `slug` - `description` - `status` - `createdBy` - `createdAt` - `updatedAt` ### assets Purpose: - represent raw uploaded or imported inputs Supported asset types: - `raw_file` - `archive` - `folder` - `video_collection` - `standard_dataset` - `rosbag` - `hdf5_dataset` - `object_storage_prefix` Core fields: - `_id` - `workspaceId` - `projectId` - `type` - `sourceType` - `displayName` - `status` - `storageRef` - `sizeBytes` - `fileCount` - `topLevelPaths` - `detectedFormats` - `summary` - `createdBy` - `createdAt` - `updatedAt` Do not embed full large file listings in this document. ### asset_probe_reports Purpose: - retain richer structure-detection and validation output Core fields: - `_id` - `assetId` - `reportVersion` - `detectedFormatCandidates` - `structureSummary` - `warnings` - `recommendedNextNodes` - `rawReport` - `createdAt` ### datasets Purpose: - represent logical dataset identity Core fields: - `_id` - `workspaceId` - `projectId` - `name` - `type` - `status` - `latestVersionId` - `summary` - `createdBy` - `createdAt` - `updatedAt` ### custom_nodes Purpose: - store project-scoped custom container node definitions Core fields: - `_id` - `definitionId` - `workspaceId` - `projectId` - `name` - `slug` - `description` - `category` - `status` - `contract` - `source` - `createdBy` - `createdAt` - `updatedAt` The current V1 implementation stores the custom node source as either: - an existing Docker image reference - a self-contained Dockerfile body plus an image tag The node contract is persisted with the node definition so the API can expose correct node metadata to the editor and the worker can validate runtime outputs. ### dataset_versions Purpose: - represent immutable dataset snapshots Core fields: - `_id` - `datasetId` - `workspaceId` - `projectId` - `sourceAssetId` - `parentVersionId` - `versionTag` - `canonicalSchemaVersion` - `manifestRef` - `stats` - `summary` - `status` - `createdBy` - `createdAt` This collection is separated because versions will grow over time. ### workflow_definitions Purpose: - represent logical workflow identity Core fields: - `_id` - `workspaceId` - `projectId` - `name` - `slug` - `status` - `latestVersionNumber` - `publishedVersionNumber` - `createdBy` - `createdAt` - `updatedAt` ### workflow_definition_versions Purpose: - represent immutable workflow snapshots Core fields: - `_id` - `workflowDefinitionId` - `workspaceId` - `projectId` - `versionNumber` - `visualGraph` - `logicGraph` - `runtimeGraph` - `pluginRefs` - `summary` - `createdBy` - `createdAt` Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries. ### workflow_runs Purpose: - store execution runs - snapshot the asset bindings chosen at run creation time - support project-scoped run history queries without re-reading workflow versions Core fields: - `_id` - `workflowDefinitionId` - `workflowVersionId` - `assetIds` - `workspaceId` - `projectId` - `triggeredBy` - `status` - `runtimeSnapshot` - `summary` - `startedAt` - `finishedAt` - `durationMs` - `createdAt` ### run_tasks Purpose: - store one execution unit per node per run - keep bound asset context available to the worker at dequeue time Core fields: - `_id` - `workflowRunId` - `workflowVersionId` - `nodeId` - `nodeType` - `nodeDefinitionId` - `executorType` - `executorConfig` - `codeHookSpec` - `artifactType` - `artifactTitle` - `status` - `attempt` - `assetIds` - `upstreamNodeIds` - `outputArtifactIds` - `logRef` - `cacheKey` - `cacheHit` - `logLines` - `stdoutLines` - `stderrLines` - `errorMessage` - `summary` - `lastResultPreview` - `startedAt` - `finishedAt` - `durationMs` - `createdAt` This collection should remain separate from `workflow_runs` because task volume grows quickly. The current executable worker path expects `run_tasks` to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist: - executor choice - node definition id and frozen per-node runtime config - bound asset ids at run creation time, then the effective asset ids that were actually executed after any upstream set-operation narrowing - upstream node dependencies - produced artifact ids - per-task status and error message - task log lines, stdout/stderr streams, and result preview - structured task summaries with executor, outcome, asset count, artifact ids, and stdout/stderr counters The current runtime also aggregates task execution back onto `workflow_runs`, so run documents now carry: - a frozen `runtimeSnapshot` copied from the workflow version runtime layer at run creation time - task counts by status - completed task count - artifact count - total stdout/stderr line counts - failed task ids - derived run duration The current runtime control loop also mutates these collections in place for retry/cancel operations: - cancelling a run marks queued and pending `run_tasks` as `cancelled` - retrying a run creates a new `workflow_runs` document plus a fresh set of `run_tasks` - retrying a task resets the target node and downstream subtree on the existing run, clears task execution fields, and increments the retried task attempt count ### artifacts Purpose: - store managed outputs and previews Artifact types may include: - preview bundle - quality report - normalized dataset package - delivery package - training config package - intermediate task output Core fields: - `_id` - `workspaceId` - `projectId` - `type` - `producerType` - `producerId` - `storageRef` - `previewable` - `summary` - `lineage` - `createdBy` - `createdAt` ### workflow_runs and run_tasks input binding note The current V1 runtime now stores workflow input selection in three layers: - `inputBindings` The explicit operator-facing selection such as `[{ kind: "dataset", id: "dataset-..." }]` - `assetIds` The resolved runnable asset ids after dataset expansion and deduplication - `datasetIds` The explicit dataset ids that participated in the run or task This keeps execution backward-compatible for asset-oriented nodes while preserving the higher-level project data model in run history and task detail. ### annotation_tasks Purpose: - track assignment and state of manual labeling work Core fields: - `_id` - `workspaceId` - `projectId` - `targetType` - `targetRef` - `labelType` - `status` - `assigneeIds` - `reviewerIds` - `createdBy` - `createdAt` - `updatedAt` ### annotations Purpose: - persist annotation outputs Core fields: - `_id` - `annotationTaskId` - `workspaceId` - `projectId` - `targetRef` - `payload` - `status` - `createdBy` - `createdAt` - `updatedAt` ### plugins Purpose: - track installable and enabled plugin versions Core fields: - `_id` - `workspaceId` optional for workspace-scoped plugins - `scope` as `platform` or `workspace` - `name` - `status` - `currentVersion` - `versions` - `permissions` - `metadata` - `createdAt` - `updatedAt` If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded. ### storage_connections Purpose: - store object storage and path registration configuration Core fields: - `_id` - `workspaceId` - `type` - `provider` - `name` - `status` - `config` - `secretRef` - `createdBy` - `createdAt` - `updatedAt` Store secrets outside plaintext document fields where possible. ### audit_logs Purpose: - append-only history of sensitive actions Core fields: - `_id` - `workspaceId` - `projectId` - `actorId` - `resourceType` - `resourceId` - `action` - `beforeSummary` - `afterSummary` - `metadata` - `createdAt` ## Reference Strategy Use stable ids between collections. References should be explicit: - asset to probe report - dataset to dataset versions - workflow definition to workflow versions - workflow run to run tasks - task to artifact - annotation task to annotations Do not depend on implicit path-based linkage. ## Index Recommendations ### Always index - `workspaceId` - `projectId` - `status` - `createdAt` ### Important compound indexes - `memberships.workspaceId + memberships.userId` - `projects.workspaceId + projects.slug` - `assets.projectId + assets.type + assets.createdAt` - `datasets.projectId + datasets.name` - `dataset_versions.datasetId + dataset_versions.createdAt` - `workflow_definitions.projectId + workflow_definitions.slug` - `workflow_definition_versions.workflowDefinitionId + versionNumber` - `workflow_runs.projectId + createdAt` - `workflow_runs.workflowDefinitionId + status` - `run_tasks.workflowRunId + nodeId` - `artifacts.producerType + producerId` - `annotation_tasks.projectId + status` - `audit_logs.workspaceId + createdAt` ## Object Storage References MongoDB should store references such as: - bucket - key - uri - checksum - content type - size It should not store: - large binary file payloads - full raw video content - giant archive contents ## V1 Constraints - MongoDB is the only database - No relational sidecar is assumed - No GridFS-first strategy is assumed - Large manifests may live in object storage and be referenced from MongoDB ## V1 Non-Goals The V1 model does not need: - cross-region data distribution - advanced event sourcing - fully normalized analytics warehouse modeling - high-volume search indexing inside MongoDB itself