# MCBO LLM Agent The MCBO Agent is an LLM-powered system that answers competency questions about bioprocessing data using tool-calling and SPARQL queries. ## Table of Contents 1. [Quick Start](#quick-start) 2. [Installation](#installation) 3. [Usage](#usage) 4. [MCP Server](#mcp-server) 5. [Customization Guide](#customization-guide) 6. [Architecture](#architecture) --- ## Quick Start ```bash # 1. Install agent dependencies make install-agent # 2. Set your API key (choose one) export OPENAI_API_KEY=sk-... # OpenAI # OR export ANTHROPIC_API_KEY=sk-ant-... # Anthropic # 3. Run a query mcbo-agent-eval --data-dir data.sample --cq CQ1 # Or with local LLM (no API key needed) make install-ollama mcbo-agent-eval --data-dir data.sample --cq CQ1 --provider ollama ``` --- ## Installation ### Option 1: OpenAI (Recommended for accuracy) ```bash # Install agent dependencies make install-agent # Set API key export OPENAI_API_KEY=sk-... # Test mcbo-agent-eval --data-dir data.sample --cq CQ1 --provider openai ``` ### Option 2: Anthropic Claude ```bash # Install agent dependencies make install-agent # Set API key export ANTHROPIC_API_KEY=sk-ant-api03-... # Test mcbo-agent-eval --data-dir data.sample --cq CQ1 --provider anthropic ``` ### Option 3: Ollama (Local, Free, Private) Best for: Privacy-sensitive data, offline use, no API costs. ```bash # Install Ollama (Linux) curl -fsSL https://ollama.ai/install.sh | sh # Install Ollama (macOS) brew install ollama # Pull a model (qwen2.5:3b is fast and good at tool calling) ollama pull qwen2.5:3b # Start Ollama server (in background) ollama serve & # Install agent deps make install-agent # Test mcbo-agent-eval --data-dir data.sample --cq CQ1 --provider ollama --model qwen2.5:3b ``` **Recommended Ollama Models:** | Model | Size | Speed | Accuracy | GPU VRAM | |-------|------|-------|----------|----------| | `qwen2.5:3b` | 2GB | ⚡ Fast | Good | 4GB+ | | `qwen2.5:7b` | 4.5GB | Medium | Better | 8GB+ | | `mistral:7b` | 4.5GB | Medium | Good | 8GB+ | | `llama3.1:8b` | 4.7GB | Medium | Better | 8GB+ | --- ## Usage ### Running Competency Questions ```bash # Run a predefined CQ (CQ1-CQ8) mcbo-agent-eval --data-dir data.sample --cq CQ1 # Run a natural language question mcbo-agent-eval --data-dir data.sample \ --cq "What genes are differentially expressed under Fed-batch vs Perfusion in HEK293?" # Run on your real data mcbo-agent-eval --data-dir .data --cq CQ1 # Verbose mode (see tool calls) mcbo-agent-eval --data-dir data.sample --cq CQ1 --verbose # Limit iterations mcbo-agent-eval --data-dir data.sample --cq CQ1 --max-iterations 5 ``` ### Predefined Competency Questions | CQ | Question | |----|----------| | CQ1 | Under what culture conditions (pH, dissolved oxygen, temperature) do the cells reach peak recombinant protein productivity? | | CQ2 | Which cell lines have been engineered to overexpress gene Y? | | CQ3 | Which nutrient concentrations in cell line K are most associated with viable cell density above Z at day 6? | | CQ4 | How does the expression of gene X vary between clone A and clone B? | | CQ5 | What pathways are differentially expressed under Fed-batch vs Perfusion in cell line K? | | CQ6 | Which are the top genes correlated with recombinant protein productivity in the stationary phase? | | CQ7 | Which genes have the highest fold change between cells with viability >90% and those with <50%? | | CQ8 | Which cell lines or subclones are best suited for glycosylation profiles required for therapeutic protein X? | ### Command-Line Options ``` mcbo-agent-eval [OPTIONS] Options: --data-dir PATH Data directory with graph.ttl (default: data.sample) --cq TEXT CQ identifier (CQ1-CQ8) or natural language question --provider TEXT LLM provider: openai, anthropic, ollama, mock (default: auto-detect from env) --model TEXT Model name (e.g., gpt-4-turbo-preview, claude-3-opus, qwen2.5:3b) --verbose, -v Show detailed tool calls and reasoning --max-iterations N Max tool-calling iterations (default: 10) ``` --- ## MCP Server The agent can run as an MCP (Model Context Protocol) server, allowing integration with Claude Desktop, Cursor, or other MCP-compatible clients. ### Starting the Server ```bash # Install MCP dependencies (requires Python 3.10+) pip install -e python/[mcp] # Start the server python -m mcbo.agent.mcp_server ``` ### Claude Desktop Configuration Add to `~/.config/claude/claude_desktop_config.json` (Linux) or `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS): ```json { "mcpServers": { "mcbo": { "command": "python", "args": ["-m", "mcbo.agent.mcp_server"], "cwd": "/path/to/mcbo", "env": { "DATA_DIR": "/path/to/mcbo/data.sample" } } } } ``` ### Testing with MCP Inspector ```bash # Install the inspector npx @anthropic/mcp-inspector # Point it at your server # Then navigate to http://localhost:5173 ``` --- ## Customization Guide ### How the System Prompt Works The agent's behavior is controlled by `SYSTEM_PROMPT` in: ``` python/mcbo/agent/orchestrator.py (line ~34) ``` The prompt has these sections: 1. **SPARQL TEMPLATES** - Maps question types to template names 2. **WORKFLOWS** - Step-by-step instructions for each CQ type 3. **CRITICAL RULES** - Guardrails to prevent hallucination **To modify the prompt:** ```python # In orchestrator.py SYSTEM_PROMPT = """You are an expert bioprocess data analyst... # Add new workflows: For MY_NEW_QUESTION_TYPE: 1. execute_sparql with template "my_template" 2. my_analysis_tool with params... # Add new rules: CRITICAL RULES: - Always cite specific run IDs - Never make up gene names """ ``` ### SPARQL Templates Templates are parameterized SPARQL queries in: ``` python/mcbo/agent/sparql_templates.py ``` **Structure:** ```python SPARQL_TEMPLATES = { "template_name": { "description": "What this template fetches", "params": ["optional", "parameters"], "query": """ PREFIX mcbo: SELECT ?var1 ?var2 WHERE { ?process a ?processType . ... } """ }, ... } ``` **To add a new template:** ```python # In sparql_templates.py SPARQL_TEMPLATES["my_new_template"] = { "description": "Fetches custom data for my use case", "params": ["cell_line"], # Optional parameters "query": """ PREFIX mcbo: PREFIX rdfs: SELECT ?gene ?expressionValue ?cellLine WHERE { ?process mcbo:usesCellLine ?cl . ?cl rdfs:label ?cellLine . FILTER(CONTAINS(?cellLine, "{cell_line}")) ?process mcbo:hasProcessOutput ?sample . ?sample mcbo:hasGeneExpression ?expr . ?expr mcbo:hasExpressionValue ?expressionValue . ?expr ?g . ?g rdfs:label ?gene . } """ } ``` **Then update the system prompt to use it:** ```python # In orchestrator.py, add to WORKFLOWS section: For MY_QUESTION_TYPE: 1. execute_sparql with template "my_new_template" and params {"cell_line": "HEK293"} ``` ### Adding New Tools Tools are defined in: ``` python/mcbo/agent/tools.py ``` **Structure:** ```python TOOL_DEFINITIONS = [ { "name": "tool_name", "description": "What this tool does", "input_schema": { "type": "object", "properties": { "param1": {"type": "string", "description": "..."}, "param2": {"type": "number", "description": "..."} }, "required": ["param1"] } }, ... ] ``` **To add a new tool:** 1. Define the tool schema in `tools.py`: ```python TOOL_DEFINITIONS.append({ "name": "my_new_tool", "description": "Performs custom analysis on bioprocess data", "input_schema": { "type": "object", "properties": { "data": {"type": "array", "description": "Input data"}, "threshold": {"type": "number", "description": "Threshold value"} }, "required": ["data"] } }) ``` 2. Add the implementation in `ToolExecutor`: ```python # In tools.py, in ToolExecutor._execute_single_tool() elif tool_name == "my_new_tool": return self._my_new_tool(**args) def _my_new_tool(self, data: list, threshold: float = 0.5) -> dict: """Custom analysis implementation.""" # Your logic here return {"result": processed_data} ``` 3. Update the system prompt to explain when to use it. ### Statistical Tools Available in `python/mcbo/agent/stats_tools.py`: | Function | Description | |----------|-------------| | `compute_correlation(data, x_col, y_col, method)` | Pearson/Spearman correlation | | `compute_fold_change(data, group_col, val_col, g1, g2)` | Log2 fold change between groups | | `differential_expression(data, group_col, g1, g2, ...)` | T-test with fold change | | `find_peak_conditions(data, condition_cols, metric)` | Find optimal conditions | | `summarize_by_group(data, group_col, val_col, agg)` | Group-wise aggregation | ### Pathway Tools Available in `python/mcbo/agent/pathway_tools.py`: | Function | Description | |----------|-------------| | `get_kegg_pathways(gene_list)` | Query KEGG for pathways | | `get_reactome_pathways(gene_list)` | Query Reactome for pathways | | `perform_enrichment_analysis(gene_list, ...)` | Fisher's exact enrichment | --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Agent Orchestrator │ │ ┌─────────────────────────────────────────────────────────┐│ │ │ System Prompt ││ │ │ - SPARQL template selection guide ││ │ │ - Workflow instructions by question type ││ │ │ - Critical rules (no hallucination) ││ │ └─────────────────────────────────────────────────────────┘│ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────┐│ │ │ LLM Provider ││ │ │ - OpenAIProvider (gpt-4-turbo-preview) ││ │ │ - AnthropicProvider (claude-3-opus) ││ │ │ - OllamaProvider (qwen2.5:3b, mistral:7b, etc.) ││ │ │ - MockProvider (for testing) ││ │ └─────────────────────────────────────────────────────────┘│ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────┐│ │ │ Tool Executor ││ │ │ ││ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ │ │execute_sparql│ │ stats_tools │ │pathway_tools │ ││ │ │ │ │ │ │ │ │ ││ │ │ │ - Templates │ │ - correlation│ │ - KEGG │ ││ │ │ │ - RDF Graph │ │ - fold_change│ │ - Reactome │ ││ │ │ │ │ │ - diff_expr │ │ - enrichment │ ││ │ │ └──────────────┘ └──────────────┘ └──────────────┘ ││ │ └─────────────────────────────────────────────────────────┘│ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ RDF Graph │ │ (graph.ttl) │ │ │ │ - Cell culture processes (Batch, Fed-batch, Perfusion) │ │ - Cell lines and clones │ │ - Culture conditions (temp, pH, DO) │ │ - Gene expression measurements │ │ - Productivity and viability data │ └─────────────────────────────────────────────────────────────┘ ``` ### File Structure ``` python/mcbo/agent/ ├── __init__.py # Module exports ├── orchestrator.py # Main agent logic, system prompt, LLM providers ├── tools.py # Tool definitions and executor ├── sparql_templates.py # Parameterized SPARQL queries ├── stats_tools.py # Statistical analysis functions ├── pathway_tools.py # Pathway enrichment (KEGG, Reactome) ├── agent_eval.py # CLI entry point └── mcp_server.py # MCP server implementation ``` --- ## Troubleshooting ### "No data found" errors 1. Check that the graph exists: `ls -la data.sample/graph.ttl` 2. Rebuild if needed: `mcbo-build-graph build --data-dir data.sample` 3. Verify data: `mcbo-stats --data-dir data.sample` ### Ollama 404 errors ```bash # Make sure Ollama is running ollama serve # Make sure model is pulled ollama list ollama pull qwen2.5:3b ``` ### Model hallucinating Try: 1. Use a larger model: `--model qwen2.5:7b` 2. Use OpenAI: `--provider openai` 3. Add `--verbose` to see what's happening ### Slow performance For Ollama: - Use smaller model: `qwen2.5:3b` instead of `7b` - Check GPU is being used: `nvidia-smi` - Try quantized versions: `ollama pull qwen2.5:3b-q4_0` --- ## Environment Variables | Variable | Description | Example | |----------|-------------|---------| | `OPENAI_API_KEY` | OpenAI API key | `sk-...` | | `ANTHROPIC_API_KEY` | Anthropic API key | `sk-ant-...` | | `OLLAMA_HOST` | Ollama server URL | `http://localhost:11434` | | `DATA_DIR` | Default data directory | `/path/to/data.sample` | --- ## Testing ```bash # Run agent tests pytest python/tests/test_agent_integration.py -v # Run with mock provider (no LLM needed) mcbo-agent-eval --data-dir data.sample --cq CQ1 --provider mock ```