# EmboFlow MongoDB Data Model ## Goal Define the MongoDB-only persistence model for EmboFlow V1. The database must support: - user and workspace isolation - raw asset tracking - canonical dataset versions - workflow versioning - workflow execution history - plugin registration - auditability ## Storage Principles - MongoDB stores metadata and execution state - Object storage stores large binary files and large derived bundles - MongoDB documents should have clear aggregate boundaries - Large, fast-growing arrays should be split into separate collections - Platform contracts should use references, not embedded file blobs ## Current V1 Implementation Notes The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB. This means the implementation now validates: - document shapes - controller and service boundaries - workflow/run/task separation - artifact lookup by producer - asset persistence and probe reports through Mongo-backed collections while still targeting the collection model below as the persistent shape. ## Primary Collections - `users` - `workspaces` - `projects` - `memberships` - `assets` - `asset_probe_reports` - `datasets` - `dataset_versions` - `workflow_definitions` - `workflow_definition_versions` - `workflow_runs` - `run_tasks` - `artifacts` - `annotation_tasks` - `annotations` - `plugins` - `storage_connections` - `audit_logs` ## Collection Design ### users Purpose: - account identity - profile - login metadata Core fields: - `_id` - `email` - `displayName` - `avatarUrl` - `status` - `lastLoginAt` - `createdAt` - `updatedAt` ### workspaces Purpose: - resource ownership boundary Core fields: - `_id` - `type` as `personal` or `team` - `name` - `slug` - `ownerId` - `status` - `settings` - `createdAt` - `updatedAt` ### memberships Purpose: - workspace and project role mapping Core fields: - `_id` - `workspaceId` - `projectId` optional - `userId` - `role` - `status` - `createdAt` - `updatedAt` This collection should stay independent instead of embedding large member arrays on every resource. ### projects Purpose: - project-scoped grouping for assets, workflows, runs, and outputs Core fields: - `_id` - `workspaceId` - `name` - `slug` - `description` - `status` - `createdBy` - `createdAt` - `updatedAt` ### assets Purpose: - represent raw uploaded or imported inputs Supported asset types: - `raw_file` - `archive` - `folder` - `video_collection` - `standard_dataset` - `rosbag` - `hdf5_dataset` - `object_storage_prefix` Core fields: - `_id` - `workspaceId` - `projectId` - `type` - `sourceType` - `displayName` - `status` - `storageRef` - `sizeBytes` - `fileCount` - `topLevelPaths` - `detectedFormats` - `summary` - `createdBy` - `createdAt` - `updatedAt` Do not embed full large file listings in this document. ### asset_probe_reports Purpose: - retain richer structure-detection and validation output Core fields: - `_id` - `assetId` - `reportVersion` - `detectedFormatCandidates` - `structureSummary` - `warnings` - `recommendedNextNodes` - `rawReport` - `createdAt` ### datasets Purpose: - represent logical dataset identity Core fields: - `_id` - `workspaceId` - `projectId` - `name` - `type` - `status` - `latestVersionId` - `summary` - `createdBy` - `createdAt` - `updatedAt` ### dataset_versions Purpose: - represent immutable dataset snapshots Core fields: - `_id` - `datasetId` - `workspaceId` - `projectId` - `sourceAssetId` - `parentVersionId` - `versionTag` - `canonicalSchemaVersion` - `manifestRef` - `stats` - `summary` - `status` - `createdBy` - `createdAt` This collection is separated because versions will grow over time. ### workflow_definitions Purpose: - represent logical workflow identity Core fields: - `_id` - `workspaceId` - `projectId` - `name` - `slug` - `status` - `latestVersionNumber` - `publishedVersionNumber` - `createdBy` - `createdAt` - `updatedAt` ### workflow_definition_versions Purpose: - represent immutable workflow snapshots Core fields: - `_id` - `workflowDefinitionId` - `workspaceId` - `projectId` - `versionNumber` - `visualGraph` - `logicGraph` - `runtimeGraph` - `pluginRefs` - `summary` - `createdBy` - `createdAt` Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries. ### workflow_runs Purpose: - store execution runs - snapshot the asset bindings chosen at run creation time Core fields: - `_id` - `workflowDefinitionId` - `workflowVersionId` - `assetIds` - `workspaceId` - `projectId` - `triggeredBy` - `status` - `runtimeSnapshot` - `summary` - `startedAt` - `finishedAt` - `createdAt` ### run_tasks Purpose: - store one execution unit per node per run - keep bound asset context available to the worker at dequeue time Core fields: - `_id` - `workflowRunId` - `workflowVersionId` - `nodeId` - `nodeType` - `executorType` - `status` - `attempt` - `assetIds` - `upstreamNodeIds` - `outputArtifactIds` - `logRef` - `cacheKey` - `cacheHit` - `errorMessage` - `startedAt` - `finishedAt` - `createdAt` This collection should remain separate from `workflow_runs` because task volume grows quickly. The current executable worker path expects `run_tasks` to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist: - executor choice - bound asset ids - upstream node dependencies - produced artifact ids - per-task status and error message ### artifacts Purpose: - store managed outputs and previews Artifact types may include: - preview bundle - quality report - normalized dataset package - delivery package - training config package - intermediate task output Core fields: - `_id` - `workspaceId` - `projectId` - `type` - `producerType` - `producerId` - `storageRef` - `previewable` - `summary` - `lineage` - `createdBy` - `createdAt` ### annotation_tasks Purpose: - track assignment and state of manual labeling work Core fields: - `_id` - `workspaceId` - `projectId` - `targetType` - `targetRef` - `labelType` - `status` - `assigneeIds` - `reviewerIds` - `createdBy` - `createdAt` - `updatedAt` ### annotations Purpose: - persist annotation outputs Core fields: - `_id` - `annotationTaskId` - `workspaceId` - `projectId` - `targetRef` - `payload` - `status` - `createdBy` - `createdAt` - `updatedAt` ### plugins Purpose: - track installable and enabled plugin versions Core fields: - `_id` - `workspaceId` optional for workspace-scoped plugins - `scope` as `platform` or `workspace` - `name` - `status` - `currentVersion` - `versions` - `permissions` - `metadata` - `createdAt` - `updatedAt` If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded. ### storage_connections Purpose: - store object storage and path registration configuration Core fields: - `_id` - `workspaceId` - `type` - `provider` - `name` - `status` - `config` - `secretRef` - `createdBy` - `createdAt` - `updatedAt` Store secrets outside plaintext document fields where possible. ### audit_logs Purpose: - append-only history of sensitive actions Core fields: - `_id` - `workspaceId` - `projectId` - `actorId` - `resourceType` - `resourceId` - `action` - `beforeSummary` - `afterSummary` - `metadata` - `createdAt` ## Reference Strategy Use stable ids between collections. References should be explicit: - asset to probe report - dataset to dataset versions - workflow definition to workflow versions - workflow run to run tasks - task to artifact - annotation task to annotations Do not depend on implicit path-based linkage. ## Index Recommendations ### Always index - `workspaceId` - `projectId` - `status` - `createdAt` ### Important compound indexes - `memberships.workspaceId + memberships.userId` - `projects.workspaceId + projects.slug` - `assets.projectId + assets.type + assets.createdAt` - `datasets.projectId + datasets.name` - `dataset_versions.datasetId + dataset_versions.createdAt` - `workflow_definitions.projectId + workflow_definitions.slug` - `workflow_definition_versions.workflowDefinitionId + versionNumber` - `workflow_runs.projectId + createdAt` - `workflow_runs.workflowDefinitionId + status` - `run_tasks.workflowRunId + nodeId` - `artifacts.producerType + producerId` - `annotation_tasks.projectId + status` - `audit_logs.workspaceId + createdAt` ## Object Storage References MongoDB should store references such as: - bucket - key - uri - checksum - content type - size It should not store: - large binary file payloads - full raw video content - giant archive contents ## V1 Constraints - MongoDB is the only database - No relational sidecar is assumed - No GridFS-first strategy is assumed - Large manifests may live in object storage and be referenced from MongoDB ## V1 Non-Goals The V1 model does not need: - cross-region data distribution - advanced event sourcing - fully normalized analytics warehouse modeling - high-volume search indexing inside MongoDB itself