CLI Reference
MCBO provides four command-line tools for working with bioprocessing data and the ontology.
Available Commands
Command |
Description |
|---|---|
|
Convert CSV metadata to RDF instances (with optional expression data) |
|
Build graphs from studies or single CSV (bootstrap, build, merge, add-study) |
|
Run SPARQL competency queries |
|
Generate graph statistics |
mcbo-build-graph
Build and manage evaluation graphs.
Subcommands
bootstrap - Create graph from single CSV
mcbo-build-graph bootstrap \
--csv .data/sample_metadata.csv \
--output .data/graph.ttl
build - Build from study directories
mcbo-build-graph build \
--studies-dir .data/studies \
--output .data/graph.ttl
# Or with config-by-convention
mcbo-build-graph build --data-dir .data
add-study - Add a study incrementally
mcbo-build-graph add-study \
--study-dir .data/studies/my_new_study \
--instances .data/mcbo-instances.ttl
merge - Merge instances with ontology
mcbo-build-graph merge \
--ontology ontology/mcbo.owl.ttl \
--instances .data/mcbo-instances.ttl \
--output .data/graph.ttl
Options
--data-dir DIR Use config-by-convention (auto-resolves paths)
--csv FILE Input CSV file (for bootstrap)
--studies-dir DIR Directory containing study subdirectories
--study-dir DIR Single study directory to add
--instances FILE Path to instances TTL file
--output FILE Output graph file
--ontology FILE Ontology TTL file (default: ontology/mcbo.owl.ttl)
--expression-dir DIR Directory with per-study expression matrices
--expression-matrix FILE Single expression matrix file
mcbo-csv-to-rdf
Low-level CSV to RDF conversion.
mcbo-csv-to-rdf \
--csv_file .data/sample_metadata.csv \
--output_file .data/mcbo-instances.ttl
With expression data:
mcbo-csv-to-rdf \
--csv_file .data/sample_metadata.csv \
--output_file .data/mcbo-instances.ttl \
--expression_dir .data/expression/
Options
--csv_file FILE Input CSV metadata file (required)
--output_file FILE Output TTL file (required)
--expression_matrix FILE Single expression matrix CSV
--expression_dir DIR Directory with per-study expression CSVs
mcbo-run-eval
Run SPARQL competency question queries.
# Using config-by-convention
mcbo-run-eval --data-dir data.sample
# Using explicit paths
mcbo-run-eval \
--graph data.sample/graph.ttl \
--queries eval/queries \
--results data.sample/results
Options
--data-dir DIR Use config-by-convention
--graph FILE Input graph TTL file
--queries DIR Directory with .rq query files (default: eval/queries)
--results DIR Output directory for TSV results
--verify Only verify graph parses, don't run queries
--fail-on-empty Exit with error if any CQ returns 0 results
mcbo-stats
Generate statistics about a graph.
mcbo-stats --data-dir data.sample
# Or with explicit path
mcbo-stats --graph .data/graph.ttl
Output includes:
Total cell culture process instances (by type: Batch, Fed-batch, Perfusion, Unknown)
Total bioprocess sample instances
Config-by-Convention
All CLI tools support --data-dir for automatic path resolution:
# These are equivalent:
mcbo-run-eval --data-dir data.sample
mcbo-run-eval --graph data.sample/graph.ttl --results data.sample/results
# Convention: <data-dir>/ contains:
# graph.ttl - merged evaluation graph
# mcbo-instances.ttl - instance data (ABox)
# results/ - CQ query results
Data Dictionary
This is the authoritative reference for CSV column definitions. All metadata files (
sample_metadata.csv) use these 36 columns.
Column Overview
Category |
Count |
Purpose |
|---|---|---|
4 |
Sample/run/study IDs |
|
8 |
Source database and sequencing metadata |
|
6 |
Cell line characteristics |
|
5 |
Temperature, pH, nutrients |
|
6 |
Process type and production metrics |
|
3 |
What the cell line produces |
|
4 |
Time-point and viability data |
Identifiers
Column |
Type |
CQs |
Definition |
|---|---|---|---|
|
string |
all |
Unique identifier for a bioprocess run. Typically an SRA run accession (e.g., ERR4319927) or internal ID. Primary key for joining with expression data. |
|
string |
all |
Unique identifier for a biological sample. Typically an SRA sample accession (e.g., ERS4805133). One run may produce one sample. |
|
string |
all |
Identifier grouping related samples into a study (e.g., study_dhiman). Used to organize data by publication or project. |
|
string |
— |
Human-readable descriptive name combining cell line, conditions, and other metadata. For display purposes. |
Dataset Provenance
Column |
Type |
CQs |
Definition |
|---|---|---|---|
|
string |
— |
Primary database accession for the dataset (e.g., ERP122753, SRP066848). Links to external repositories. |
|
string |
— |
Human-readable version of the dataset accession. Often same as DatasetAccession. |
|
string |
— |
Short name for the dataset, typically author surname (e.g., Dhiman, vanWijk, Hefzi). |
|
string |
— |
Single-letter or short abbreviation for the dataset (e.g., D, vW, H). Used in compact displays. |
|
string |
— |
Lead author or principal investigator name for the study. |
|
string |
— |
RNA-seq library preparation method. Values: rRNA (ribosomal depletion), PolyA (poly-A selection). |
|
boolean |
— |
Whether sequencing used paired-end reads. TRUE = paired-end, FALSE = single-end. |
|
string |
— |
Origin of the data. Values: SRA (public repository), In-House (internal data). |
Cell Line
Column |
Type |
CQs |
Definition |
|---|---|---|---|
|
string |
1-8 |
Cell line name used in the bioprocess. Common values: CHO-K1, CHO-S, CHO-DG44, CHO-DXB11, HEK293. CHO lines are classified as |
|
boolean |
— |
Whether this is a host/parental cell line (not producing recombinant product). TRUE = host line, FALSE = producer line. |
|
enum |
— |
Commercial availability. Values: Commercial, Non-Commercial. |
|
string |
— |
Specific vendor or source of the cell line (e.g., ATCC, Horizon, Life Technologies, Thermo Fisher). |
|
string |
— |
Genetic selection system used. Values: GS+/- (glutamine synthetase heterozygous), GS-/- (knockout), DHFR, ProcessEvolved. |
|
enum |
— |
Qualitative growth rate assessment. Values: Low, Medium, High. |
Culture Conditions
Column |
Type |
CQs |
Definition |
|---|---|---|---|
|
decimal |
1 |
Culture temperature in degrees Celsius. Typical range: 31-37°C. Lower temperatures often used for productivity. |
|
decimal |
1 |
Culture medium pH. Typical range: 6.8-7.4. Critical for cell viability and product quality. |
|
decimal |
1 |
Dissolved oxygen as percentage of air saturation. Typical range: 20-60%. Affects metabolism and productivity. |
|
boolean |
— |
Whether glutamine was supplemented in the medium. TRUE = present, FALSE = absent or glutamine-free medium. |
|
decimal |
3 |
Glutamine concentration in millimolar (mM). Typical range: 0-8 mM. Key nutrient affecting growth and ammonia production. |
Process & Productivity
Column |
Type |
CQs |
Definition |
|---|---|---|---|
|
enum |
5 |
Bioreactor operating mode. Values: Batch (no feeding), FedBatch (nutrient feeding), Perfusion (continuous media exchange). Maps to |
|
enum |
4, 6 |
Growth phase at sample collection. Values: EarlyExp, MidExp, LateExp (exponential sub-phases), Stationary. |
|
enum |
1, 6 |
Qualitative productivity assessment. Values: VeryHigh, High, Medium, Low. CQ1 filters for High/VeryHigh. |
|
boolean |
— |
Whether the cell line shows stable transgene expression over passages. TRUE = stable, FALSE = unstable. |
|
decimal |
8 |
Product concentration in mg/L. Final or harvest titer of recombinant protein/antibody. |
|
string |
8 |
Product quality attribute being assessed. Values: Glycosylation, Aggregation, ChargeVariants, etc. |
Product
Column |
Type |
CQs |
Definition |
|---|---|---|---|
|
boolean |
2 |
Whether this cell line produces recombinant product. TRUE = producer (creates |
|
string |
2, 8 |
The recombinant product being produced. Three categories: (1) Gene symbol (e.g., AMBP, CCL20) → creates |
|
string |
— |
Ensembl stable gene identifier for the product gene. Only applicable when ProductType is a gene symbol. Format: ENSG followed by 11 digits (e.g., ENSG00000106927 for AMBP). Links to Ensembl database for cross-referencing genomic data. Creates |
Sample State
Column |
Type |
CQs |
Definition |
|---|---|---|---|
|
integer |
3 |
Day of culture when sample was collected. Day 0 = inoculation. CQ3 specifically queries day 6 samples. |
|
decimal |
3 |
Viable cell concentration in cells/mL at collection. Typical range: 1e6 - 2e7 cells/mL. |
|
decimal |
7 |
Percentage of cells that are viable at collection. Range: 0-100%. CQ7 compares >90% vs <50% viability. |
|
string |
4, 8 |
Identifier for a specific clone within a cell line. Used to compare expression between clones (CQ4) and link to quality data (CQ8). |
ProductType Classification
The ProductType column determines how the product is modeled in RDF:
ProductType Value |
RDF Class |
Example |
Notes |
|---|---|---|---|
Gene symbol (all caps, 2-10 chars) |
|
AMBP, CCL20, FN1 |
Creates gene via |
Antibody terms |
|
mAb, IgG, BsAb |
Subclass of ProteinProduct |
Control terms |
(skipped) |
Control, WT, Mock |
No product created |
Expression Data
Gene expression comes from separate matrix files (not metadata columns):
SampleAccession,ACTB,GAPDH,TP53
DEMO001_SAMPLE_A,1000,800,250
First column:
SampleAccession(must match metadata)Other columns: gene symbols with expression values
Ensembl IDs for expression genes: use
gene_annotations.csv
CQ-to-Column Quick Reference
Each competency query (CQ) uses specific columns. See Data Dictionary for full definitions.
CQ |
Question |
Required Columns |
|---|---|---|
CQ1 |
Culture conditions for high productivity |
|
CQ2 |
CHO lines overexpressing genes |
|
CQ3 |
Nutrients for viability at day 6 |
|
CQ4 |
Expression between clones |
|
CQ5 |
Process type counts |
|
CQ6 |
Genes correlated with productivity |
|
CQ7 |
Genes by viability threshold |
|
CQ8 |
Cell lines for quality profiles |
|
Design Decision: Single Table vs. Normalized Schema
We chose a single flat table (sample_metadata.csv) because:
Curation simplicity: Domain experts can edit in Excel/Google Sheets without joins
1:1 relationships: Most bioprocessing studies have one run → one sample
Sparse data: Not every study has every column; flat tables handle this naturally
Expression is separate: The high-dimensional gene expression data is already in its own matrix file
Trade-offs accepted:
Wide tables with many columns (36)
Some column redundancy across rows (e.g., same CellLine repeated)
Not ideal for complex many-to-many relationships
If you need normalized schema later:
The RDF output IS normalized (each entity is a distinct node)
You could create
studies.csv,runs.csv,samples.csvand modifybuild_graph. pyFor now, the flat approach works well for <1000 samples per study
Implementation Notes
Empty values are OK: The csv_to_rdf.py converter handles missing/NA values gracefully. Adding empty columns won’t break existing data.
Multi-valued fields: If a sample has multiple quality types, use semicolon-separated values (e.g.,
"Glycosylation;Aggregation").Gene expression data: See Expression Data section above. Creates one
mcbo:GeneExpressionMeasurementper gene-sample pair.Quality measurements:
QualityTypevalues include Glycosylation, Aggregation, ChargeVariants, etc.
Running as Python Modules
Commands can also be run as Python modules:
python -m mcbo.csv_to_rdf --help
python -m mcbo.build_graph --help
python -m mcbo.run_eval --help
python -m mcbo.stats_eval_graph --help