CLI Reference

MCBO provides four command-line tools for working with bioprocessing data and the ontology.

Available Commands

Command	Description
`mcbo-csv-to-rdf`	Convert CSV metadata to RDF instances (with optional expression data)
`mcbo-build-graph`	Build graphs from studies or single CSV (bootstrap, build, merge, add-study)
`mcbo-run-eval`	Run SPARQL competency queries
`mcbo-stats`	Generate graph statistics

mcbo-build-graph

Build and manage evaluation graphs.

Subcommands

bootstrap - Create graph from single CSV

mcbo-build-graph bootstrap \
  --csv .data/sample_metadata.csv \
  --output .data/graph.ttl

build - Build from study directories

mcbo-build-graph build \
  --studies-dir .data/studies \
  --output .data/graph.ttl

# Or with config-by-convention
mcbo-build-graph build --data-dir .data

add-study - Add a study incrementally

mcbo-build-graph add-study \
  --study-dir .data/studies/my_new_study \
  --instances .data/mcbo-instances.ttl

merge - Merge instances with ontology

mcbo-build-graph merge \
  --ontology ontology/mcbo.owl.ttl \
  --instances .data/mcbo-instances.ttl \
  --output .data/graph.ttl

Options

--data-dir DIR       Use config-by-convention (auto-resolves paths)
--csv FILE           Input CSV file (for bootstrap)
--studies-dir DIR    Directory containing study subdirectories
--study-dir DIR      Single study directory to add
--instances FILE     Path to instances TTL file
--output FILE        Output graph file
--ontology FILE      Ontology TTL file (default: ontology/mcbo.owl.ttl)
--expression-dir DIR Directory with per-study expression matrices
--expression-matrix FILE  Single expression matrix file

mcbo-csv-to-rdf

Low-level CSV to RDF conversion.

mcbo-csv-to-rdf \
  --csv_file .data/sample_metadata.csv \
  --output_file .data/mcbo-instances.ttl

With expression data:

mcbo-csv-to-rdf \
  --csv_file .data/sample_metadata.csv \
  --output_file .data/mcbo-instances.ttl \
  --expression_dir .data/expression/

Options

--csv_file FILE          Input CSV metadata file (required)
--output_file FILE       Output TTL file (required)
--expression_matrix FILE Single expression matrix CSV
--expression_dir DIR     Directory with per-study expression CSVs

mcbo-run-eval

Run SPARQL competency question queries.

# Using config-by-convention
mcbo-run-eval --data-dir data.sample

# Using explicit paths
mcbo-run-eval \
  --graph data.sample/graph.ttl \
  --queries eval/queries \
  --results data.sample/results

Options

--data-dir DIR      Use config-by-convention
--graph FILE        Input graph TTL file
--queries DIR       Directory with .rq query files (default: eval/queries)
--results DIR       Output directory for TSV results
--verify            Only verify graph parses, don't run queries
--fail-on-empty     Exit with error if any CQ returns 0 results

mcbo-stats

Generate statistics about a graph.

mcbo-stats --data-dir data.sample

# Or with explicit path
mcbo-stats --graph .data/graph.ttl

Output includes:

Total cell culture process instances (by type: Batch, Fed-batch, Perfusion, Unknown)
Total bioprocess sample instances

Config-by-Convention

All CLI tools support --data-dir for automatic path resolution:

# These are equivalent:
mcbo-run-eval --data-dir data.sample
mcbo-run-eval --graph data.sample/graph.ttl --results data.sample/results

# Convention: <data-dir>/ contains:
#   graph.ttl           - merged evaluation graph
#   mcbo-instances.ttl  - instance data (ABox)
#   results/            - CQ query results

Data Dictionary

This is the authoritative reference for CSV column definitions. All metadata files (sample_metadata.csv) use these 36 columns.

Column Overview

Category	Count	Purpose
Identifiers	4	Sample/run/study IDs
Dataset Provenance	8	Source database and sequencing metadata
Cell Line	6	Cell line characteristics
Culture Conditions	5	Temperature, pH, nutrients
Process & Productivity	6	Process type and production metrics
Product	3	What the cell line produces
Sample State	4	Time-point and viability data

Identifiers

Column	Type	CQs	Definition
`RunAccession`	string	all	Unique identifier for a bioprocess run. Typically an SRA run accession (e.g., ERR4319927) or internal ID. Primary key for joining with expression data.
`SampleAccession`	string	all	Unique identifier for a biological sample. Typically an SRA sample accession (e.g., ERS4805133). One run may produce one sample.
`StudyID`	string	all	Identifier grouping related samples into a study (e.g., study_dhiman). Used to organize data by publication or project.
`FullSampleName`	string	—	Human-readable descriptive name combining cell line, conditions, and other metadata. For display purposes.

Dataset Provenance

Column	Type	CQs	Definition
`DatasetAccession`	string	—	Primary database accession for the dataset (e.g., ERP122753, SRP066848). Links to external repositories.
`DatasetReadable`	string	—	Human-readable version of the dataset accession. Often same as DatasetAccession.
`DatasetName`	string	—	Short name for the dataset, typically author surname (e.g., Dhiman, vanWijk, Hefzi).
`DatasetAbbrev`	string	—	Single-letter or short abbreviation for the dataset (e.g., D, vW, H). Used in compact displays.
`Author`	string	—	Lead author or principal investigator name for the study.
`LibraryStrategy`	string	—	RNA-seq library preparation method. Values: rRNA (ribosomal depletion), PolyA (poly-A selection).
`PairedEnd`	boolean	—	Whether sequencing used paired-end reads. TRUE = paired-end, FALSE = single-end.
`Source`	string	—	Origin of the data. Values: SRA (public repository), In-House (internal data).

Cell Line

Column	Type	CQs	Definition
`CellLine`	string	1-8	Cell line name used in the bioprocess. Common values: CHO-K1, CHO-S, CHO-DG44, CHO-DXB11, HEK293. CHO lines are classified as `mcbo:CHOCellLine`.
`Host`	boolean	—	Whether this is a host/parental cell line (not producing recombinant product). TRUE = host line, FALSE = producer line.
`CellLineSource`	enum	—	Commercial availability. Values: Commercial, Non-Commercial.
`CellLineExact`	string	—	Specific vendor or source of the cell line (e.g., ATCC, Horizon, Life Technologies, Thermo Fisher).
`SelectionMarker`	string	—	Genetic selection system used. Values: GS+/- (glutamine synthetase heterozygous), GS-/- (knockout), DHFR, ProcessEvolved.
`Growth`	enum	—	Qualitative growth rate assessment. Values: Low, Medium, High.

Culture Conditions

Column	Type	CQs	Definition
`Temperature`	decimal	1	Culture temperature in degrees Celsius. Typical range: 31-37°C. Lower temperatures often used for productivity.
`pH`	decimal	1	Culture medium pH. Typical range: 6.8-7.4. Critical for cell viability and product quality.
`DissolvedOxygen`	decimal	1	Dissolved oxygen as percentage of air saturation. Typical range: 20-60%. Affects metabolism and productivity.
`Glutamine`	boolean	—	Whether glutamine was supplemented in the medium. TRUE = present, FALSE = absent or glutamine-free medium.
`GlutamineConcentration`	decimal	3	Glutamine concentration in millimolar (mM). Typical range: 0-8 mM. Key nutrient affecting growth and ammonia production.

Process & Productivity

Column	Type	CQs	Definition
`ProcessType`	enum	5	Bioreactor operating mode. Values: Batch (no feeding), FedBatch (nutrient feeding), Perfusion (continuous media exchange). Maps to `mcbo:BatchCultureProcess`, `mcbo:FedBatchCultureProcess`, `mcbo:PerfusionCultureProcess`.
`CulturePhase`	enum	4, 6	Growth phase at sample collection. Values: EarlyExp, MidExp, LateExp (exponential sub-phases), Stationary.
`Productivity`	enum	1, 6	Qualitative productivity assessment. Values: VeryHigh, High, Medium, Low. CQ1 filters for High/VeryHigh.
`Stability`	boolean	—	Whether the cell line shows stable transgene expression over passages. TRUE = stable, FALSE = unstable.
`TiterValue`	decimal	8	Product concentration in mg/L. Final or harvest titer of recombinant protein/antibody.
`QualityType`	string	8	Product quality attribute being assessed. Values: Glycosylation, Aggregation, ChargeVariants, etc.

Product

Column	Type	CQs	Definition
`Producer`	boolean	2	Whether this cell line produces recombinant product. TRUE = producer (creates `mcbo:overexpressesGene` link), FALSE = host/control.
`ProductType`	string	2, 8	The recombinant product being produced. Three categories: (1) Gene symbol (e.g., AMBP, CCL20) → creates `mcbo:ProteinProduct` with `encodedByGene` link; (2) Antibody term (mAb, IgG, BsAb) → creates `mcbo:AntibodyProduct`; (3) Control (Control, WT, Mock) → skipped.
`EnsemblGeneID`	string	—	Ensembl stable gene identifier for the product gene. Only applicable when ProductType is a gene symbol. Format: ENSG followed by 11 digits (e.g., ENSG00000106927 for AMBP). Links to Ensembl database for cross-referencing genomic data. Creates `mcbo:hasEnsemblGeneID` property on the gene.

Sample State

Column	Type	CQs	Definition
`CollectionDay`	integer	3	Day of culture when sample was collected. Day 0 = inoculation. CQ3 specifically queries day 6 samples.
`ViableCellDensity`	decimal	3	Viable cell concentration in cells/mL at collection. Typical range: 1e6 - 2e7 cells/mL.
`ViabilityPercentage`	decimal	7	Percentage of cells that are viable at collection. Range: 0-100%. CQ7 compares >90% vs <50% viability.
`CloneID`	string	4, 8	Identifier for a specific clone within a cell line. Used to compare expression between clones (CQ4) and link to quality data (CQ8).

ProductType Classification

The ProductType column determines how the product is modeled in RDF:

ProductType Value	RDF Class	Example	Notes
Gene symbol (all caps, 2-10 chars)	`mcbo:ProteinProduct`	AMBP, CCL20, FN1	Creates gene via `encodedByGene`; add `EnsemblGeneID`
Antibody terms	`mcbo:AntibodyProduct`	mAb, IgG, BsAb	Subclass of ProteinProduct
Control terms	(skipped)	Control, WT, Mock	No product created

Expression Data

Gene expression comes from separate matrix files (not metadata columns):

SampleAccession,ACTB,GAPDH,TP53
DEMO001_SAMPLE_A,1000,800,250

First column: SampleAccession (must match metadata)
Other columns: gene symbols with expression values
Ensembl IDs for expression genes: use gene_annotations.csv

CQ-to-Column Quick Reference

Each competency query (CQ) uses specific columns. See Data Dictionary for full definitions.

CQ	Question	Required Columns
CQ1	Culture conditions for high productivity	`Temperature`, `pH`, `DissolvedOxygen`, `Productivity`
CQ2	CHO lines overexpressing genes	`CellLine` (CHO*), `Producer`, `ProductType`
CQ3	Nutrients for viability at day 6	`CellLine`, `GlutamineConcentration`, `CollectionDay`, `ViableCellDensity`
CQ4	Expression between clones	`CellLine`, `CloneID`, `CulturePhase` + expression matrix
CQ5	Process type counts	`ProcessType`
CQ6	Genes correlated with productivity	`CulturePhase`, `Productivity` + expression matrix
CQ7	Genes by viability threshold	`ViabilityPercentage` + expression matrix
CQ8	Cell lines for quality profiles	`CellLine`, `CloneID`, `TiterValue`, `QualityType`

Design Decision: Single Table vs. Normalized Schema

We chose a single flat table (sample_metadata.csv) because:

Curation simplicity: Domain experts can edit in Excel/Google Sheets without joins
1:1 relationships: Most bioprocessing studies have one run → one sample
Sparse data: Not every study has every column; flat tables handle this naturally
Expression is separate: The high-dimensional gene expression data is already in its own matrix file

Trade-offs accepted:

Wide tables with many columns (36)
Some column redundancy across rows (e.g., same CellLine repeated)
Not ideal for complex many-to-many relationships

If you need normalized schema later:

The RDF output IS normalized (each entity is a distinct node)
You could create studies.csv, runs.csv, samples.csv and modify build_graph. py
For now, the flat approach works well for <1000 samples per study

Implementation Notes

Empty values are OK: The csv_to_rdf.py converter handles missing/NA values gracefully. Adding empty columns won’t break existing data.
Multi-valued fields: If a sample has multiple quality types, use semicolon-separated values (e.g., "Glycosylation;Aggregation").
Gene expression data: See Expression Data section above. Creates one mcbo:GeneExpressionMeasurement per gene-sample pair.
Quality measurements: QualityType values include Glycosylation, Aggregation, ChargeVariants, etc.

Running as Python Modules

Commands can also be run as Python modules:

python -m mcbo.csv_to_rdf --help
python -m mcbo.build_graph --help
python -m mcbo.run_eval --help
python -m mcbo.stats_eval_graph --help