Data Workflows
This document describes the different workflows for building MCBO evaluation graphs, from simple single-file scenarios to large-scale dataset management.
Quick Reference
Workflow |
Command |
Use Case |
|---|---|---|
Config-by-convention |
|
Standard workflow |
Single CSV bootstrap |
|
Hand-curated metadata |
Multi-study build |
|
Per-study CSVs |
Incremental add |
|
Add new studies over time |
Configuration by Convention
MCBO tools use a standardized directory layout. When you provide a --data-dir argument,
paths are automatically resolved:
<data-dir>/
├── graph.ttl # Output: merged evaluation graph (TBox + ABox)
├── mcbo-instances.ttl # Output: instance data (ABox)
├── sample_metadata.csv # Input: single CSV (for bootstrap)
├── expression/ # Input: per-study expression matrices
│ ├── study_001.csv
│ └── study_002.csv
├── studies/ # Input: study directories (for multi-study build)
│ ├── study_001/
│ │ ├── sample_metadata.csv
│ │ └── expression_matrix.csv
│ └── study_002/
│ └── sample_metadata.csv
└── results/ # Output: CQ evaluation results
├── cq1.tsv
└── SUMMARY.txt
The ontology (TBox) is always at ontology/mcbo.owl.ttl relative to the repository root.
Workflow 1: Demo Data (Getting Started)
# Using Makefile (recommended)
make demo
# Manual steps
mcbo-build-graph build --data-dir data.sample
mcbo-run-eval --data-dir data.sample
mcbo-stats --data-dir data.sample
Workflow 2: Bootstrap from Single CSV
Best for: Initial dataset creation with hand-curated metadata covering multiple studies.
# Just metadata
mcbo-build-graph bootstrap \
--csv .data/sample_metadata.csv \
--output .data/graph.ttl
# With per-study expression matrices
mcbo-build-graph bootstrap \
--csv .data/sample_metadata.csv \
--expression-dir .data/expression/ \
--output .data/graph.ttl
# With single expression matrix
mcbo-build-graph bootstrap \
--csv .data/sample_metadata.csv \
--expression-matrix .data/expression_matrix.csv \
--output .data/graph.ttl
Workflow 3: Multi-Study Build
Best for: When each study has its own directory with metadata and optional expression data.
# Build from study directories
mcbo-build-graph build \
--studies-dir .data/studies \
--output .data/graph.ttl
# Using config-by-convention
mcbo-build-graph build --data-dir .data
Each study directory should contain:
sample_metadata.csv(required) - sample/run metadataexpression_matrix.csv(optional) - gene expression data
Workflow 4: Incremental Study Addition
Best for: Growing datasets where you add studies over time without rebuilding everything.
# Add first study
mcbo-build-graph add-study \
--study-dir .data/studies/study_001 \
--instances .data/mcbo-instances.ttl
# Add subsequent studies (appends to existing instances)
mcbo-build-graph add-study \
--study-dir .data/studies/study_002 \
--instances .data/mcbo-instances.ttl
# When ready, merge with ontology
mcbo-build-graph merge \
--instances .data/mcbo-instances.ttl \
--output .data/graph.ttl
Benefits for Large Datasets
Incremental updates: Add new studies without reprocessing existing data
Partial rebuilds: Only regenerate graph when needed
Memory efficiency: Process one study at a time
Git-friendly: Instance files can be tracked separately
Workflow 5: Evaluation and Statistics
# Run all competency questions
mcbo-run-eval --data-dir .data
# Verify graph parses without running queries
mcbo-run-eval --data-dir .data --verify
# Generate statistics
mcbo-stats --data-dir .data
# Fail if any CQ returns 0 results
mcbo-run-eval --data-dir .data --fail-on-empty
Large Dataset Best Practices
Directory Organization
Keep real data separate from demo data:
mcbo/
├── data.sample/ # Demo data (committed)
│ └── ...
└── .data/ # Real data (git-ignored)
└── ...
Incremental Processing
For datasets with 1000+ processes:
# Process studies in batches
for study in .data/studies/study_*; do
mcbo-build-graph add-study \
--study-dir "$study" \
--instances .data/mcbo-instances.ttl
done
# Generate final graph
mcbo-build-graph merge --data-dir .data
Memory Considerations
Large expression matrices can consume significant memory. Strategies:
Split expression data into per-study files
Use the
--expression-dirapproach instead of single large matricesProcess studies incrementally with
add-study
Validation Checkpoints
# After each major step, verify the graph
mcbo-run-eval --graph .data/mcbo-instances.ttl --verify
# Check for QC issues
java -jar .robot/robot.jar query \
--input .data/graph.ttl \
--query sparql/orphan_classes.rq \
reports/robot/orphan_classes.tsv
Backup Strategy
# Before major changes
cp .data/mcbo-instances.ttl .data/mcbo-instances.backup.ttl
# Version with timestamp
cp .data/graph.ttl .data/graph.$(date +%Y%m%d).ttl
File Naming Conventions
File |
Description |
|---|---|
|
Ontology schema (TBox) |
|
Instance data (ABox) |
|
Merged evaluation graph (TBox + ABox) |
|
Input metadata CSV |
|
Input expression data CSV |
|
CQ query results |
Troubleshooting
Graph doesn’t parse
mcbo-run-eval --graph .data/graph.ttl --verify
# Shows: FAIL: Graph parsing failed - <error>
Missing expression data
Check that sample IDs in expression matrix match SampleAccession in metadata CSV.
Memory errors
Split into smaller batches using incremental workflow.
CQ returns 0 results
Check required columns in metadata CSV. See CLI Reference for column requirements.