longtao.wu/EmboFlow

Fork 0

eust-w f41816bbd9 🎉 feat: initialize foundation docs guardrails and workspace skeleton

2026-03-26 17:18:40 +08:00

7.9 KiB

Raw Blame History

EmboFlow MongoDB Data Model

Goal

Define the MongoDB-only persistence model for EmboFlow V1.

The database must support:

user and workspace isolation
raw asset tracking
canonical dataset versions
workflow versioning
workflow execution history
plugin registration
auditability

Storage Principles

MongoDB stores metadata and execution state
Object storage stores large binary files and large derived bundles
MongoDB documents should have clear aggregate boundaries
Large, fast-growing arrays should be split into separate collections
Platform contracts should use references, not embedded file blobs

Primary Collections

users
workspaces
projects
memberships
assets
asset_probe_reports
datasets
dataset_versions
workflow_definitions
workflow_definition_versions
workflow_runs
run_tasks
artifacts
annotation_tasks
annotations
plugins
storage_connections
audit_logs

Collection Design

users

Purpose:

account identity
profile
login metadata

Core fields:

_id
email
displayName
avatarUrl
status
lastLoginAt
createdAt
updatedAt

workspaces

Purpose:

resource ownership boundary

Core fields:

_id
type as personal or team
name
slug
ownerId
status
settings
createdAt
updatedAt

memberships

Purpose:

workspace and project role mapping

Core fields:

_id
workspaceId
projectId optional
userId
role
status
createdAt
updatedAt

This collection should stay independent instead of embedding large member arrays on every resource.

projects

Purpose:

project-scoped grouping for assets, workflows, runs, and outputs

Core fields:

_id
workspaceId
name
slug
description
status
createdBy
createdAt
updatedAt

assets

Purpose:

represent raw uploaded or imported inputs

Supported asset types:

raw_file
archive
folder
video_collection
standard_dataset
rosbag
hdf5_dataset
object_storage_prefix

Core fields:

_id
workspaceId
projectId
type
sourceType
displayName
status
storageRef
sizeBytes
fileCount
topLevelPaths
detectedFormats
summary
createdBy
createdAt
updatedAt

Do not embed full large file listings in this document.

asset_probe_reports

Purpose:

retain richer structure-detection and validation output

Core fields:

_id
assetId
reportVersion
detectedFormatCandidates
structureSummary
warnings
recommendedNextNodes
rawReport
createdAt

datasets

Purpose:

represent logical dataset identity

Core fields:

_id
workspaceId
projectId
name
type
status
latestVersionId
summary
createdBy
createdAt
updatedAt

dataset_versions

Purpose:

represent immutable dataset snapshots

Core fields:

_id
datasetId
workspaceId
projectId
sourceAssetId
parentVersionId
versionTag
canonicalSchemaVersion
manifestRef
stats
summary
status
createdBy
createdAt

This collection is separated because versions will grow over time.

workflow_definitions

Purpose:

represent logical workflow identity

Core fields:

_id
workspaceId
projectId
name
slug
status
latestVersionNumber
publishedVersionNumber
createdBy
createdAt
updatedAt

workflow_definition_versions

Purpose:

represent immutable workflow snapshots

Core fields:

_id
workflowDefinitionId
workspaceId
projectId
versionNumber
visualGraph
logicGraph
runtimeGraph
pluginRefs
summary
createdBy
createdAt

Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.

workflow_runs

Purpose:

store execution runs

Core fields:

_id
workflowDefinitionId
workflowVersionId
workspaceId
projectId
triggeredBy
status
runtimeSnapshot
summary
startedAt
finishedAt
createdAt

run_tasks

Purpose:

store one execution unit per node per run

Core fields:

_id
workflowRunId
workflowVersionId
nodeId
nodeType
status
attempt
executor
scheduler
inputRefs
outputRefs
logRef
cacheKey
cacheHit
errorSummary
startedAt
finishedAt
createdAt

This collection should remain separate from workflow_runs because task volume grows quickly.

artifacts

Purpose:

store managed outputs and previews

Artifact types may include:

preview bundle
quality report
normalized dataset package
delivery package
training config package
intermediate task output

Core fields:

_id
workspaceId
projectId
type
producerType
producerId
storageRef
previewable
summary
lineage
createdBy
createdAt

annotation_tasks

Purpose:

track assignment and state of manual labeling work

Core fields:

_id
workspaceId
projectId
targetType
targetRef
labelType
status
assigneeIds
reviewerIds
createdBy
createdAt
updatedAt

annotations

Purpose:

persist annotation outputs

Core fields:

_id
annotationTaskId
workspaceId
projectId
targetRef
payload
status
createdBy
createdAt
updatedAt

plugins

Purpose:

track installable and enabled plugin versions

Core fields:

_id
workspaceId optional for workspace-scoped plugins
scope as platform or workspace
name
status
currentVersion
versions
permissions
metadata
createdAt
updatedAt

If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.

storage_connections

Purpose:

store object storage and path registration configuration

Core fields:

_id
workspaceId
type
provider
name
status
config
secretRef
createdBy
createdAt
updatedAt

Store secrets outside plaintext document fields where possible.

audit_logs

Purpose:

append-only history of sensitive actions

Core fields:

_id
workspaceId
projectId
actorId
resourceType
resourceId
action
beforeSummary
afterSummary
metadata
createdAt

Reference Strategy

Use stable ids between collections.

References should be explicit:

asset to probe report
dataset to dataset versions
workflow definition to workflow versions
workflow run to run tasks
task to artifact
annotation task to annotations

Do not depend on implicit path-based linkage.

Index Recommendations

Always index

workspaceId
projectId
status
createdAt

Important compound indexes

memberships.workspaceId + memberships.userId
projects.workspaceId + projects.slug
assets.projectId + assets.type + assets.createdAt
datasets.projectId + datasets.name
dataset_versions.datasetId + dataset_versions.createdAt
workflow_definitions.projectId + workflow_definitions.slug
workflow_definition_versions.workflowDefinitionId + versionNumber
workflow_runs.projectId + createdAt
workflow_runs.workflowDefinitionId + status
run_tasks.workflowRunId + nodeId
artifacts.producerType + producerId
annotation_tasks.projectId + status
audit_logs.workspaceId + createdAt

Object Storage References

MongoDB should store references such as:

bucket
key
uri
checksum
content type
size

It should not store:

large binary file payloads
full raw video content
giant archive contents

V1 Constraints

MongoDB is the only database
No relational sidecar is assumed
No GridFS-first strategy is assumed
Large manifests may live in object storage and be referenced from MongoDB

V1 Non-Goals

The V1 model does not need:

cross-region data distribution
advanced event sourcing
fully normalized analytics warehouse modeling
high-volume search indexing inside MongoDB itself

7.9 KiB Raw Blame History

EmboFlow MongoDB Data Model

Goal

Storage Principles

Primary Collections

Collection Design

users

workspaces

memberships

projects

assets

asset_probe_reports

datasets

dataset_versions

workflow_definitions

workflow_definition_versions

workflow_runs

run_tasks

artifacts

annotation_tasks

annotations

plugins

storage_connections

audit_logs

Reference Strategy

Index Recommendations

Always index

Important compound indexes

Object Storage References

V1 Constraints

V1 Non-Goals

7.9 KiB

Raw Blame History