File size: 16,545 Bytes
ef16689 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 | # DETAILS.md
π **Powered by [Detailer](https://detailer.ginylil.com)** - Context-aware codebase analysis
---
## 1. Project Overview
### Project Purpose & Domain
This project is a comprehensive **biomedical AI toolkit and research platform** designed to facilitate **biomedical data analysis, knowledge extraction, and AI-driven reasoning**. It integrates large language models (LLMs), domain-specific bioinformatics tools, and scientific data processing pipelines to enable:
- Automated extraction of biomedical knowledge from literature (e.g., bioRxiv papers)
- Querying and integration of diverse biomedical databases and APIs
- Execution of domain-specific computational biology and physiology analyses
- AI agent orchestration for complex biomedical reasoning and tool invocation
- Benchmarking and evaluation of biomedical tasks and datasets
### Target Users and Use Cases
- **Biomedical researchers and data scientists** seeking to automate literature mining, data retrieval, and analysis workflows.
- **Bioinformaticians** requiring integrated access to multiple biological databases and computational tools.
- **AI researchers** interested in applying LLMs and autonomous agents to biomedical problem solving.
- **Developers and integrators** building domain-specific AI pipelines and scientific workflows.
- Use cases include:
- Extracting structured biomedical tasks and entities from scientific papers
- Querying gene, protein, disease, and pathway databases via natural language prompts
- Running computational models of biological systems (e.g., metabolic networks, signaling)
- Performing image analysis and quantitative pathology workflows
- Orchestrating multi-step AI reasoning with tool use and self-criticism
### Core Business Logic and Domain Models
- **Biomedical domain models**: gene IDs, protein structures, pathways, disease ontologies, experimental assays.
- **Task abstractions**: benchmark tasks with prompt/response evaluation (e.g., humanity last exam, lab bench).
- **Tool metadata schemas**: declarative descriptions of biomedical tools and APIs for dynamic invocation.
- **AI agent workflows**: ReAct-style reasoning graphs integrating LLMs, tool calls, retrieval, and self-critique.
- **Data models**: structured JSON, pandas DataFrames, numpy arrays representing biological data and analysis results.
---
## 2. Architecture and Structure
### High-Level Architecture
The system is organized into modular layers and components:
- **Core Library (`biomni/`)**: Contains main application logic, including:
- **Agent framework (`biomni/agent/`)**: Implements autonomous AI agents using LLMs and workflow graphs.
- **Task definitions (`biomni/task/`)**: Abstract base and concrete biomedical benchmark tasks.
- **Tool implementations (`biomni/tool/`)**: Domain-specific analysis functions, API clients, and computational biology workflows.
- **Tool metadata (`biomni/tool/tool_description/`)**: Declarative schemas describing tool APIs and parameters.
- **Model components (`biomni/model/`)**: AI-driven resource retriever for selecting relevant tools and data.
- **Utility modules (`biomni/utils.py`, `biomni/llm.py`, `biomni/env_desc.py`)**: Helpers for LLM instantiation, system commands, environment descriptions.
- **Versioning (`biomni/version.py`)**: Package version management.
- **Environment Setup (`biomni_env/`)**: Scripts and configuration files for reproducible environment provisioning, including:
- Conda environment YAMLs (`environment.yml`, `bio_env.yml`)
- R package specifications (`r_packages.yml`)
- CLI tools installer (`install_cli_tools.sh`)
- Shell scripts for environment setup (`setup.sh`, `setup_path.sh`)
- **Scripts (`biomni/biorxiv_scripts/`)**: Data processing pipelines for literature mining and task extraction.
- **Documentation and Configuration**:
- Root-level files: `README.md`, `CONTRIBUTION.md`, `pyproject.toml`, `.pre-commit-config.yaml`.
---
### Complete Repository Structure
```
.
βββ biomni/ (90 items)
β βββ agent/
β β βββ __init__.py
β β βββ a1.py
β β βββ env_collection.py
β β βββ qa_llm.py
β β βββ react.py
β βββ biorxiv_scripts/
β β βββ extract_biorxiv_tasks.py
β β βββ generate_function.py
β β βββ process_all_subjects.py
β βββ model/
β β βββ __init__.py
β β βββ retriever.py
β βββ task/
β β βββ __init__.py
β β βββ base_task.py
β β βββ hle.py
β β βββ lab_bench.py
β βββ tool/ (65 items)
β β βββ schema_db/ (25 items)
β β β βββ cbioportal.pkl
β β β βββ clinvar.pkl
β β β βββ dbsnp.pkl
β β β βββ emdb.pkl
β β β βββ ensembl.pkl
β β β βββ geo.pkl
β β β βββ gnomad.pkl
β β β βββ gtopdb.pkl
β β β βββ gwas_catalog.pkl
β β β βββ interpro.pkl
β β β βββ ... (15 more files)
β β βββ tool_description/ (18 items)
β β β βββ biochemistry.py
β β β βββ bioengineering.py
β β β βββ biophysics.py
β β β βββ cancer_biology.py
β β β βββ cell_biology.py
β β β βββ database.py
β β β βββ genetics.py
β β β βββ genomics.py
β β β βββ immunology.py
β β β βββ literature.py
β β β βββ microbiology.py
β β β βββ molecular_biology.py
β β β βββ pathology.py
β β β βββ pharmacology.py
β β β βββ physiology.py
β β β βββ support_tools.py
β β β βββ synthetic_biology.py
β β β βββ systems_biology.py
β β βββ __init__.py
β β βββ biochemistry.py
β β βββ bioengineering.py
β β βββ biophysics.py
β β βββ cancer_biology.py
β β βββ cell_biology.py
β β βββ database.py
β β βββ genetics.py
β β βββ ... (12 more files)
β βββ __init__.py
β βββ env_desc.py
β βββ llm.py
β βββ utils.py
β βββ version.py
βββ biomni_env/ (9 items)
β βββ README.md
β βββ bio_env.yml
β βββ cli_tools_config.json
β βββ environment.yml
β βββ install_cli_tools.sh
β βββ install_r_packages.R
β βββ r_packages.yml
β βββ setup.sh
β βββ setup_path.sh
βββ figs/
β βββ biomni_logo.png
βββ tutorials/
β βββ examples/
β β βββ cloning.ipynb
β βββ 101_biomni.ipynb
β βββ biomni_101.ipynb
βββ .gitignore
βββ .pre-commit-config.yaml
βββ CONTRIBUTION.md
βββ LICENSE
βββ README.md
βββ pyproject.toml
```
---
## 3. Technical Implementation Details
### Core Modules and Their Roles
#### `biomni/agent/`
- Implements autonomous AI agents using the **ReAct paradigm**:
- `react.py`: Main ReAct agent class managing reasoning, tool invocation, retrieval, and self-criticism workflows.
- `env_collection.py`: Environment and data retrieval utilities.
- `qa_llm.py`: Question-answering LLM wrappers.
- `a1.py`: Possibly experimental or auxiliary agent code.
- Uses **langgraph** for workflow graph orchestration and **langchain** for LLM integration.
#### `biomni/task/`
- Defines **benchmark tasks** with a common interface:
- `base_task.py`: Abstract base class specifying methods like `get_example()`, `evaluate()`, `output_class()`.
- `hle.py`: "Humanity Last Exam" task implementation.
- `lab_bench.py`: Lab bench dataset task.
- Tasks load data (e.g., parquet files), generate prompts, and evaluate LLM responses.
#### `biomni/tool/`
- Contains **domain-specific scientific analysis functions** organized by subdomains:
- `biochemistry.py`, `bioengineering.py`, `biophysics.py`, `cancer_biology.py`, `cell_biology.py`, `genetics.py`, `pathology.py`, `physiology.py`, `systems_biology.py`, etc.
- Each file implements multiple functions performing analyses, simulations, or data processing workflows.
- Functions accept input files/parameters and return detailed textual logs and output files.
- **API client modules** (e.g., `database.py`) provide facade functions to query external biomedical databases (UniProt, GWAS Catalog, Ensembl, etc.) via REST or GraphQL APIs, often using LLMs to generate query payloads from natural language prompts.
- **Tool registry (`tool_registry.py`)** manages metadata about available tools, supporting dynamic registration and lookup.
#### `biomni/tool/tool_description/`
- Contains **declarative metadata schemas** describing tool APIs:
- Each file exports a `description` list of dictionaries defining tool names, descriptions, required and optional parameters with types and defaults.
- Supports **dynamic API generation, validation, and documentation**.
- Organized by biological domain (e.g., genetics, immunology, pathology).
#### `biomni/model/retriever.py`
- Implements `ToolRetriever` class for **AI-driven resource selection**:
- Uses LLMs (OpenAI or Anthropic) to parse user queries and select relevant tools, datasets, and libraries.
- Encapsulates prompt formatting and response parsing logic.
#### `biomni/utils.py` and `biomni/llm.py`
- `utils.py`: Utility functions for running system commands (R, Bash), file operations, schema generation, logging, and colorized printing.
- `llm.py`: Factory functions to instantiate LLMs (OpenAI, Anthropic) with configurable parameters.
#### `biomni/env_desc.py`
- Contains **environment and dataset descriptions**, acting as a centralized metadata repository for datasets and experimental environments.
---
### Environment Setup (`biomni_env/`)
- `setup.sh`: Main shell script to create conda environment, install R packages, and CLI bioinformatics tools.
- `install_cli_tools.sh`: Automates downloading, compiling, and installing external bioinformatics command-line tools, managing PATH and verification.
- `r_packages.yml`: Lists R packages required.
- `environment.yml` and `bio_env.yml`: Conda environment specifications.
- `setup_path.sh`: Shell script to update environment variables for CLI tools.
---
### Entry Points and Execution Flow
- **Agent usage**: Instantiate `react` agent from `biomni.agent.react`, configure with tools and retrieval, then call `go(prompt)` to run reasoning workflows.
- **Task evaluation**: Use classes in `biomni.task` to load datasets, generate prompts, and evaluate LLM outputs.
- **Tool invocation**: Call functions in `biomni.tool` modules or use API facades in `database.py` to query external resources.
- **Metadata-driven tool discovery**: Use `tool_registry.py` and `tool_description` schemas to dynamically discover and validate tools.
- **Environment setup**: Run `biomni_env/setup.sh` to provision environment and install dependencies.
---
## 4. Development Patterns and Standards
### Code Organization Principles
- **Modular design**: Clear separation of concerns by domain and functionality (agent, task, tool, model).
- **Functional programming style**: Most analysis modules use standalone functions with explicit inputs and outputs.
- **Declarative metadata**: Tool descriptions and schemas are separated from implementation, enabling dynamic validation and UI generation.
- **Abstract base classes**: Used in `biomni.task.base_task` to enforce consistent task interfaces.
- **Factory pattern**: Used in `llm.py` to instantiate LLMs based on configuration.
- **Strategy pattern**: Task implementations and tool retrieval use interchangeable strategies.
### Testing and Coverage
- No explicit test files detected; testing likely manual or via notebooks (`tutorials/`).
- Tasks and tools return detailed logs suitable for manual verification.
- Metadata schemas facilitate automated validation of inputs.
### Error Handling and Logging
- Use of try-except blocks around external calls and subprocesses.
- Logging via custom callback handlers (`PromptLogger`, `NodeLogger`) in LLM interactions.
- Utilities provide colorized printing and error wrappers for robustness.
### Configuration Management
- Environment variables for API keys (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`).
- YAML and JSON files for environment and tool configuration.
- Dynamic loading of schemas from pickle files for API request generation.
- CLI tools and R packages installed via scripted environment setup.
---
## 5. Integration and Dependencies
### External Libraries
- **LLM & AI Frameworks**: `langchain_core`, `langchain_openai`, `langchain_anthropic`
- **Scientific Computing**: `numpy`, `pandas`, `scipy`, `scikit-image`, `matplotlib`, `BioPython`, `cobra`, `sklearn`
- **Data Processing**: `pickle`, `json`, `requests`, `PyPDF2`
- **System and OS**: `subprocess`, `os`, `sys`, `tempfile`, `multiprocessing`
- **Others**: `tqdm` (progress bars), `enum`, `ast` (code introspection)
### External APIs and Data Sources
- Biomedical databases: UniProt, GWAS Catalog, Ensembl, ClinVar, dbSNP, EMDB, GEO, GnomAD, InterPro, etc.
- Bioinformatics tools: PLINK, IQ-TREE, GCTA, MACS2, samtools, LUMPY, installed via CLI tools installer.
- R packages for statistical and bioinformatics analyses.
### Build and Deployment Dependencies
- Python 3 environment managed via Conda (`environment.yml`).
- R environment with specified packages (`r_packages.yml`).
- Shell scripts for CLI tool installation and environment setup.
- Pre-commit hooks for code quality and security.
---
## 6. Usage and Operational Guidance
### Getting Started
1. **Environment Setup**
- Run `biomni_env/setup.sh` to create the Conda environment, install R packages, and CLI tools.
- Source `biomni_env/setup_path.sh` or add it to your shell profile to configure PATH.
2. **API Keys**
- Set environment variables `OPENAI_API_KEY` and/or `ANTHROPIC_API_KEY` for LLM access.
3. **Running Agents**
- Import and instantiate the `react` agent from `biomni.agent.react`.
- Configure with desired tools and retrieval options.
- Call `go(prompt)` to execute reasoning workflows.
4. **Executing Tasks**
- Use classes in `biomni.task` to load datasets and evaluate LLM responses.
- Implement new tasks by subclassing `base_task` and following the interface.
5. **Querying Databases**
- Use `biomni.tool.database` functions (e.g., `query_uniprot`, `query_gwas_catalog`) to retrieve data via natural language or direct parameters.
6. **Extending Tools**
- Add new tool metadata in `biomni/tool/tool_description/` as structured dictionaries.
- Implement corresponding analysis functions in `biomni/tool/`.
- Register tools in `tool_registry.py` for discovery.
### Monitoring and Debugging
- Use logging callbacks (`PromptLogger`, `NodeLogger`) to trace LLM interactions.
- Check output logs returned by analysis functions for detailed execution info.
- Use pre-commit hooks to maintain code quality.
### Performance and Scalability
- Modular design allows parallel execution of tasks and tools.
- Timeout wrappers in agent tools prevent hanging executions.
- Use of efficient numerical libraries (`numpy`, `scipy`) for computational tasks.
- Large data handled via streaming and chunking (e.g., PDF text extraction).
### Security Considerations
- API keys managed via environment variables, not hardcoded.
- Pre-commit hooks include security checks.
- External tool installations verified via version commands.
### Observability
- Progress bars (`tqdm`) used in data processing scripts.
- Structured logs and JSON outputs facilitate downstream analysis.
- Agent workflows produce detailed message histories for audit.
---
## Summary
This project is a **modular, extensible biomedical AI platform** integrating **LLM-powered agents**, **domain-specific scientific tools**, and **metadata-driven APIs** to automate complex biomedical research workflows. It emphasizes **declarative tool descriptions**, **dynamic resource retrieval**, and **robust environment provisioning** to enable researchers and developers to build, evaluate, and extend AI-driven biomedical applications efficiently.
---
# End of DETAILS.md
|