Development Guide

This guide covers contributing to MCBO, running quality control checks, and understanding the evaluation framework.

Repository Structure

mcbo/
├── ontology/           # MCBO ontology (TBox)
│   └── mcbo.owl.ttl
├── python/             # Python package (pip install -e python/)
│   ├── mcbo/           # Core library + CLI modules
│   │   ├── __init__.py      # Package exports
│   │   ├── namespaces.py    # Shared RDF namespaces
│   │   ├── graph_utils.py   # Graph loading/creation utilities
│   │   ├── csv_to_rdf.py    # CSV-to-RDF conversion
│   │   ├── build_graph.py   # Graph building
│   │   ├── run_eval.py      # SPARQL evaluation
│   │   └── stats_eval_graph.py  # Statistics
│   └── pyproject.toml  # Package metadata and entry points
├── scripts/            # Shell scripts
│   └── run_all_checks.sh    # Full QC + evaluation runner
├── eval/               # Competency question evaluation
│   └── queries/        # SPARQL query files (*.rq)
├── sparql/             # QC queries for ROBOT
├── reports/            # QC reports
│   └── robot/
├── data.sample/        # Demo data (public)
└── .data/              # Real data (git-ignored)

Quality Control

ROBOT QC Queries

MCBO uses ROBOT for ontology quality control.

Run all QC checks:

make qc

This executes three QC queries:

Query	Purpose
`orphan_classes.rq`	Finds classes without parent classes (orphans)
`duplicate_labels.rq`	Finds classes with duplicate `rdfs:label` values
`missing_definitions.rq`	Finds classes missing `obo:IAO_0000115` (definition) annotations

Manual QC Commands

# Check for orphan classes
java -jar .robot/robot.jar query \
  --input ontology/mcbo.owl.ttl \
  --query sparql/orphan_classes.rq \
  reports/robot/orphan_classes.tsv

# Check for duplicate labels
java -jar .robot/robot.jar query \
  --input ontology/mcbo.owl.ttl \
  --query sparql/duplicate_labels.rq \
  reports/robot/duplicate_labels.tsv

# Check for missing definitions
java -jar .robot/robot.jar query \
  --input ontology/mcbo.owl.ttl \
  --query sparql/missing_definitions.rq \
  reports/robot/missing_definitions.tsv

Interpreting Results

QC passes if the output TSV contains only a header row (no data rows)
QC warns if the output TSV contains data rows (issues found)

View results:

# Count issues (subtract 1 for header)
wc -l reports/robot/*.tsv

# View specific issues
cat reports/robot/orphan_classes.tsv

Competency Question Evaluation

Query Directory

All 8 CQ queries are in eval/queries/:

CQ	Description
cq1	Culture conditions (pH, DO, temperature) for peak recombinant protein productivity
cq2	Cell lines engineered to overexpress gene Y
cq3	Nutrient concentrations associated with high viable cell density
cq4	Expression variation of gene X between clones
cq5	Differentially expressed pathways under Fed-batch vs Perfusion
cq6	Top genes correlated with recombinant protein productivity in stationary phase
cq7	Genes with highest fold change between high/low viability cells
cq8	Cell lines suited for specific glycosylation profiles

Evaluation Results

Real Data Statistics (724 cell culture processes):

Cell Culture Process Instances: 724
  Batch culture process: 518
  Fed-batch culture process: 135
  Perfusion culture process: 49
  Unknown culture process: 22

Bioprocess Sample Instances: 326

Demo vs Real Data Comparison:

CQ	Real Data (724)	Demo Data (10)	Notes
CQ1	161	13	Culture conditions for productivity
CQ2	3	2	Overexpression engineering
CQ3	0	4	Requires CollectionDay, ViableCellDensity
CQ4	0	144	Requires expression matrix
CQ5	4	3	Process type distribution
CQ6	0	38	Requires expression data
CQ7	0	7	Requires ViabilityPercentage
CQ8	0	3	Requires TiterValue, QualityType

CQs returning 0 on real data reflect ongoing curation; the queries are validated and functional. The demo data includes all required fields to demonstrate complete functionality.

Running Evaluations

# Demo data
mcbo-run-eval --data-dir data.sample

# Real data
mcbo-run-eval --data-dir .data

# Verify graph parses without running queries
mcbo-run-eval --data-dir data.sample --verify

Alternative Query Runners

ROBOT:

robot query \
  --input data.sample/graph.ttl \
  --query eval/queries/cq1.rq \
  --output data.sample/results/cq1.tsv

Apache Jena (arq):

arq --data data.sample/graph.ttl --query eval/queries/cq1.rq

CI/CD Pipeline

GitHub Actions Workflow

The repository includes a CI/CD workflow at .github/workflows/qc.yml that runs:

Ontology parsing verification
ROBOT QC queries
Demo data build and evaluation

Run the CI pipeline locally:

make ci

This executes:

make install   # Install mcbo package
make qc        # Run ROBOT QC checks
make demo      # Build and evaluate demo data
make verify-demo  # Verify graph parses

Running All Checks

For a complete QC and evaluation run:

bash scripts/run_all_checks.sh

This runs:

Ontology parsing verification
All ROBOT QC queries
Demo data build and evaluation
Real data build and evaluation (if .data/ exists)

Contributing

Submitting Changes

Fork the repository
Create a feature branch
Make your changes
Run QC checks: make qc
Run demo evaluation: make demo
Submit a pull request

Term Requests

To request new ontology terms:

Go to GitHub Issues
Click “New Issue”
Select “MCBO Term Request”
Fill in the template

Coding Standards

Python code should follow PEP 8
Use type hints where practical
Add docstrings to public functions
Keep functions focused and small

Testing Changes

Before submitting:

# Install package in development mode
pip install -e python/

# Run full QC + demo
make all

# Verify no regressions
cat data.sample/results/SUMMARY.txt

Makefile Targets

Target	Description
`make all`	Run demo + qc (default)
`make demo`	Build and evaluate demo data
`make real`	Build and evaluate real data (.data/)
`make qc`	Run ROBOT QC checks on ontology
`make clean`	Remove generated files
`make install`	Install mcbo package
`make robot`	Download ROBOT jar
`make ci`	Run full CI pipeline
`make docs`	Build Sphinx documentation

See make help for the complete list.