longtao.wu/EmboFlow

Fork 0

eust-w 7d7cd14233

Guardrails / repository-guardrails (push) Has been cancelled

Details

✨ feat: add dataset-aware workflow inputs

2026-03-30 14:18:57 +08:00

11 KiB

Raw Permalink Blame History

EmboFlow MongoDB Data Model

Goal

Define the MongoDB-only persistence model for EmboFlow V1.

The database must support:

user and workspace isolation
raw asset tracking
canonical dataset versions
workflow versioning
workflow execution history
plugin registration
auditability

Storage Principles

MongoDB stores metadata and execution state
Object storage stores large binary files and large derived bundles
MongoDB documents should have clear aggregate boundaries
Large, fast-growing arrays should be split into separate collections
Platform contracts should use references, not embedded file blobs

Current V1 Implementation Notes

The first code pass stabilized these collection boundaries with in-memory services. The executable local runtime now persists the core objects below into MongoDB.

This means the implementation now validates:

document shapes
controller and service boundaries
workflow/run/task separation
artifact lookup by producer
asset persistence and probe reports through Mongo-backed collections

while still targeting the collection model below as the persistent shape.

Primary Collections

users
workspaces
projects
memberships
assets
asset_probe_reports
datasets
dataset_versions
workflow_definitions
workflow_definition_versions
workflow_runs
run_tasks
artifacts
annotation_tasks
annotations
plugins
storage_connections
custom_nodes
audit_logs

Collection Design

users

Purpose:

account identity
profile
login metadata

Core fields:

_id
email
displayName
avatarUrl
status
lastLoginAt
createdAt
updatedAt

workspaces

Purpose:

resource ownership boundary

Core fields:

_id
type as personal or team
name
slug
ownerId
status
settings
createdAt
updatedAt

memberships

Purpose:

workspace and project role mapping

Core fields:

_id
workspaceId
projectId optional
userId
role
status
createdAt
updatedAt

This collection should stay independent instead of embedding large member arrays on every resource.

projects

Purpose:

project-scoped grouping for assets, workflows, runs, and outputs

Core fields:

_id
workspaceId
name
slug
description
status
createdBy
createdAt
updatedAt

assets

Purpose:

represent raw uploaded or imported inputs

Supported asset types:

raw_file
archive
folder
video_collection
standard_dataset
rosbag
hdf5_dataset
object_storage_prefix

Core fields:

_id
workspaceId
projectId
type
sourceType
displayName
status
storageRef
sizeBytes
fileCount
topLevelPaths
detectedFormats
summary
createdBy
createdAt
updatedAt

Do not embed full large file listings in this document.

asset_probe_reports

Purpose:

retain richer structure-detection and validation output

Core fields:

_id
assetId
reportVersion
detectedFormatCandidates
structureSummary
warnings
recommendedNextNodes
rawReport
createdAt

datasets

Purpose:

represent logical dataset identity

Core fields:

_id
workspaceId
projectId
name
type
status
latestVersionId
summary
createdBy
createdAt
updatedAt

custom_nodes

Purpose:

store project-scoped custom container node definitions

Core fields:

_id
definitionId
workspaceId
projectId
name
slug
description
category
status
contract
source
createdBy
createdAt
updatedAt

The current V1 implementation stores the custom node source as either:

an existing Docker image reference
a self-contained Dockerfile body plus an image tag

The node contract is persisted with the node definition so the API can expose correct node metadata to the editor and the worker can validate runtime outputs.

dataset_versions

Purpose:

represent immutable dataset snapshots

Core fields:

_id
datasetId
workspaceId
projectId
sourceAssetId
parentVersionId
versionTag
canonicalSchemaVersion
manifestRef
stats
summary
status
createdBy
createdAt

This collection is separated because versions will grow over time.

workflow_definitions

Purpose:

represent logical workflow identity

Core fields:

_id
workspaceId
projectId
name
slug
status
latestVersionNumber
publishedVersionNumber
createdBy
createdAt
updatedAt

workflow_definition_versions

Purpose:

represent immutable workflow snapshots

Core fields:

_id
workflowDefinitionId
workspaceId
projectId
versionNumber
visualGraph
logicGraph
runtimeGraph
pluginRefs
summary
createdBy
createdAt

Splitting versions from workflow head metadata avoids oversized documents and simplifies history queries.

workflow_runs

Purpose:

store execution runs
snapshot the asset bindings chosen at run creation time
support project-scoped run history queries without re-reading workflow versions

Core fields:

_id
workflowDefinitionId
workflowVersionId
assetIds
workspaceId
projectId
triggeredBy
status
runtimeSnapshot
summary
startedAt
finishedAt
durationMs
createdAt

run_tasks

Purpose:

store one execution unit per node per run
keep bound asset context available to the worker at dequeue time

Core fields:

_id
workflowRunId
workflowVersionId
nodeId
nodeType
nodeDefinitionId
executorType
executorConfig
codeHookSpec
artifactType
artifactTitle
status
attempt
assetIds
upstreamNodeIds
outputArtifactIds
logRef
cacheKey
cacheHit
logLines
stdoutLines
stderrLines
errorMessage
summary
lastResultPreview
startedAt
finishedAt
durationMs
createdAt

This collection should remain separate from workflow_runs because task volume grows quickly.

The current executable worker path expects run_tasks to be self-sufficient enough for dequeue and dependency promotion. That means V1 runtime tasks already persist:

executor choice
node definition id and frozen per-node runtime config
bound asset ids at run creation time, then the effective asset ids that were actually executed after any upstream set-operation narrowing
upstream node dependencies
produced artifact ids
per-task status and error message
task log lines, stdout/stderr streams, and result preview
structured task summaries with executor, outcome, asset count, artifact ids, and stdout/stderr counters

The current runtime also aggregates task execution back onto workflow_runs, so run documents now carry:

a frozen runtimeSnapshot copied from the workflow version runtime layer at run creation time
task counts by status
completed task count
artifact count
total stdout/stderr line counts
failed task ids
derived run duration

The current runtime control loop also mutates these collections in place for retry/cancel operations:

cancelling a run marks queued and pending run_tasks as cancelled
retrying a run creates a new workflow_runs document plus a fresh set of run_tasks
retrying a task resets the target node and downstream subtree on the existing run, clears task execution fields, and increments the retried task attempt count

artifacts

Purpose:

store managed outputs and previews

Artifact types may include:

preview bundle
quality report
normalized dataset package
delivery package
training config package
intermediate task output

Core fields:

_id
workspaceId
projectId
type
producerType
producerId
storageRef
previewable
summary
lineage
createdBy
createdAt

workflow_runs and run_tasks input binding note

The current V1 runtime now stores workflow input selection in three layers:

inputBindings The explicit operator-facing selection such as [{ kind: "dataset", id: "dataset-..." }]
assetIds The resolved runnable asset ids after dataset expansion and deduplication
datasetIds The explicit dataset ids that participated in the run or task

This keeps execution backward-compatible for asset-oriented nodes while preserving the higher-level project data model in run history and task detail.

annotation_tasks

Purpose:

track assignment and state of manual labeling work

Core fields:

_id
workspaceId
projectId
targetType
targetRef
labelType
status
assigneeIds
reviewerIds
createdBy
createdAt
updatedAt

annotations

Purpose:

persist annotation outputs

Core fields:

_id
annotationTaskId
workspaceId
projectId
targetRef
payload
status
createdBy
createdAt
updatedAt

plugins

Purpose:

track installable and enabled plugin versions

Core fields:

_id
workspaceId optional for workspace-scoped plugins
scope as platform or workspace
name
status
currentVersion
versions
permissions
metadata
createdAt
updatedAt

If plugin version payloads become large, split versions into a separate collection later. V1 can keep them nested if bounded.

storage_connections

Purpose:

store object storage and path registration configuration

Core fields:

_id
workspaceId
type
provider
name
status
config
secretRef
createdBy
createdAt
updatedAt

Store secrets outside plaintext document fields where possible.

audit_logs

Purpose:

append-only history of sensitive actions

Core fields:

_id
workspaceId
projectId
actorId
resourceType
resourceId
action
beforeSummary
afterSummary
metadata
createdAt

Reference Strategy

Use stable ids between collections.

References should be explicit:

asset to probe report
dataset to dataset versions
workflow definition to workflow versions
workflow run to run tasks
task to artifact
annotation task to annotations

Do not depend on implicit path-based linkage.

Index Recommendations

Always index

workspaceId
projectId
status
createdAt

Important compound indexes

memberships.workspaceId + memberships.userId
projects.workspaceId + projects.slug
assets.projectId + assets.type + assets.createdAt
datasets.projectId + datasets.name
dataset_versions.datasetId + dataset_versions.createdAt
workflow_definitions.projectId + workflow_definitions.slug
workflow_definition_versions.workflowDefinitionId + versionNumber
workflow_runs.projectId + createdAt
workflow_runs.workflowDefinitionId + status
run_tasks.workflowRunId + nodeId
artifacts.producerType + producerId
annotation_tasks.projectId + status
audit_logs.workspaceId + createdAt

Object Storage References

MongoDB should store references such as:

bucket
key
uri
checksum
content type
size

It should not store:

large binary file payloads
full raw video content
giant archive contents

V1 Constraints

MongoDB is the only database
No relational sidecar is assumed
No GridFS-first strategy is assumed
Large manifests may live in object storage and be referenced from MongoDB

V1 Non-Goals

The V1 model does not need:

cross-region data distribution
advanced event sourcing
fully normalized analytics warehouse modeling
high-volume search indexing inside MongoDB itself

11 KiB Raw Permalink Blame History

EmboFlow MongoDB Data Model

Goal

Storage Principles

Current V1 Implementation Notes

Primary Collections

Collection Design

users

workspaces

memberships

projects

assets

asset_probe_reports

datasets

custom_nodes

dataset_versions

workflow_definitions

workflow_definition_versions

workflow_runs

run_tasks

artifacts

workflow_runs and run_tasks input binding note

annotation_tasks

annotations

plugins

storage_connections

audit_logs

Reference Strategy

Index Recommendations

Always index

Important compound indexes

Object Storage References

V1 Constraints

V1 Non-Goals

11 KiB

Raw Permalink Blame History