File size: 16,545 Bytes

ef16689

# DETAILS.md

🔍 **Powered by [Detailer](https://detailer.ginylil.com)** - Context-aware codebase analysis



---

## 1. Project Overview

### Project Purpose & Domain

This project is a comprehensive **biomedical AI toolkit and research platform** designed to facilitate **biomedical data analysis, knowledge extraction, and AI-driven reasoning**. It integrates large language models (LLMs), domain-specific bioinformatics tools, and scientific data processing pipelines to enable:

- Automated extraction of biomedical knowledge from literature (e.g., bioRxiv papers)
- Querying and integration of diverse biomedical databases and APIs
- Execution of domain-specific computational biology and physiology analyses
- AI agent orchestration for complex biomedical reasoning and tool invocation
- Benchmarking and evaluation of biomedical tasks and datasets

### Target Users and Use Cases

- **Biomedical researchers and data scientists** seeking to automate literature mining, data retrieval, and analysis workflows.
- **Bioinformaticians** requiring integrated access to multiple biological databases and computational tools.
- **AI researchers** interested in applying LLMs and autonomous agents to biomedical problem solving.
- **Developers and integrators** building domain-specific AI pipelines and scientific workflows.
- Use cases include:
  - Extracting structured biomedical tasks and entities from scientific papers
  - Querying gene, protein, disease, and pathway databases via natural language prompts
  - Running computational models of biological systems (e.g., metabolic networks, signaling)
  - Performing image analysis and quantitative pathology workflows
  - Orchestrating multi-step AI reasoning with tool use and self-criticism

### Core Business Logic and Domain Models

- **Biomedical domain models**: gene IDs, protein structures, pathways, disease ontologies, experimental assays.
- **Task abstractions**: benchmark tasks with prompt/response evaluation (e.g., humanity last exam, lab bench).
- **Tool metadata schemas**: declarative descriptions of biomedical tools and APIs for dynamic invocation.
- **AI agent workflows**: ReAct-style reasoning graphs integrating LLMs, tool calls, retrieval, and self-critique.
- **Data models**: structured JSON, pandas DataFrames, numpy arrays representing biological data and analysis results.

---

## 2. Architecture and Structure

### High-Level Architecture

The system is organized into modular layers and components:

- **Core Library (`biomni/`)**: Contains main application logic, including:
  - **Agent framework (`biomni/agent/`)**: Implements autonomous AI agents using LLMs and workflow graphs.
  - **Task definitions (`biomni/task/`)**: Abstract base and concrete biomedical benchmark tasks.
  - **Tool implementations (`biomni/tool/`)**: Domain-specific analysis functions, API clients, and computational biology workflows.
  - **Tool metadata (`biomni/tool/tool_description/`)**: Declarative schemas describing tool APIs and parameters.
  - **Model components (`biomni/model/`)**: AI-driven resource retriever for selecting relevant tools and data.
  - **Utility modules (`biomni/utils.py`, `biomni/llm.py`, `biomni/env_desc.py`)**: Helpers for LLM instantiation, system commands, environment descriptions.
  - **Versioning (`biomni/version.py`)**: Package version management.

- **Environment Setup (`biomni_env/`)**: Scripts and configuration files for reproducible environment provisioning, including:
  - Conda environment YAMLs (`environment.yml`, `bio_env.yml`)
  - R package specifications (`r_packages.yml`)
  - CLI tools installer (`install_cli_tools.sh`)
  - Shell scripts for environment setup (`setup.sh`, `setup_path.sh`)

- **Scripts (`biomni/biorxiv_scripts/`)**: Data processing pipelines for literature mining and task extraction.

- **Documentation and Configuration**:
  - Root-level files: `README.md`, `CONTRIBUTION.md`, `pyproject.toml`, `.pre-commit-config.yaml`.

---

### Complete Repository Structure

```
.
├── biomni/ (90 items)
│   ├── agent/
│   │   ├── __init__.py
│   │   ├── a1.py
│   │   ├── env_collection.py
│   │   ├── qa_llm.py
│   │   └── react.py
│   ├── biorxiv_scripts/
│   │   ├── extract_biorxiv_tasks.py
│   │   ├── generate_function.py
│   │   └── process_all_subjects.py
│   ├── model/
│   │   ├── __init__.py
│   │   └── retriever.py
│   ├── task/
│   │   ├── __init__.py
│   │   ├── base_task.py
│   │   ├── hle.py
│   │   └── lab_bench.py
│   ├── tool/ (65 items)
│   │   ├── schema_db/ (25 items)
│   │   │   ├── cbioportal.pkl
│   │   │   ├── clinvar.pkl
│   │   │   ├── dbsnp.pkl
│   │   │   ├── emdb.pkl
│   │   │   ├── ensembl.pkl
│   │   │   ├── geo.pkl
│   │   │   ├── gnomad.pkl
│   │   │   ├── gtopdb.pkl
│   │   │   ├── gwas_catalog.pkl
│   │   │   ├── interpro.pkl
│   │   │   └── ... (15 more files)
│   │   ├── tool_description/ (18 items)
│   │   │   ├── biochemistry.py
│   │   │   ├── bioengineering.py
│   │   │   ├── biophysics.py
│   │   │   ├── cancer_biology.py
│   │   │   ├── cell_biology.py
│   │   │   ├── database.py
│   │   │   ├── genetics.py
│   │   │   ├── genomics.py
│   │   │   ├── immunology.py
│   │   │   ├── literature.py
│   │   │   ├── microbiology.py
│   │   │   ├── molecular_biology.py
│   │   │   ├── pathology.py
│   │   │   ├── pharmacology.py
│   │   │   ├── physiology.py
│   │   │   ├── support_tools.py
│   │   │   ├── synthetic_biology.py
│   │   │   └── systems_biology.py
│   │   ├── __init__.py
│   │   ├── biochemistry.py
│   │   ├── bioengineering.py
│   │   ├── biophysics.py
│   │   ├── cancer_biology.py
│   │   ├── cell_biology.py
│   │   ├── database.py
│   │   ├── genetics.py
│   │   └── ... (12 more files)
│   ├── __init__.py
│   ├── env_desc.py
│   ├── llm.py
│   ├── utils.py
│   └── version.py
├── biomni_env/ (9 items)
│   ├── README.md
│   ├── bio_env.yml
│   ├── cli_tools_config.json
│   ├── environment.yml
│   ├── install_cli_tools.sh
│   ├── install_r_packages.R
│   ├── r_packages.yml
│   ├── setup.sh
│   └── setup_path.sh
├── figs/
│   └── biomni_logo.png
├── tutorials/
│   ├── examples/
│   │   └── cloning.ipynb
│   ├── 101_biomni.ipynb
│   └── biomni_101.ipynb
├── .gitignore
├── .pre-commit-config.yaml
├── CONTRIBUTION.md
├── LICENSE
├── README.md
└── pyproject.toml
```

---

## 3. Technical Implementation Details

### Core Modules and Their Roles

#### `biomni/agent/`

- Implements autonomous AI agents using the **ReAct paradigm**:
  - `react.py`: Main ReAct agent class managing reasoning, tool invocation, retrieval, and self-criticism workflows.
  - `env_collection.py`: Environment and data retrieval utilities.
  - `qa_llm.py`: Question-answering LLM wrappers.
  - `a1.py`: Possibly experimental or auxiliary agent code.

- Uses **langgraph** for workflow graph orchestration and **langchain** for LLM integration.

#### `biomni/task/`

- Defines **benchmark tasks** with a common interface:
  - `base_task.py`: Abstract base class specifying methods like `get_example()`, `evaluate()`, `output_class()`.
  - `hle.py`: "Humanity Last Exam" task implementation.
  - `lab_bench.py`: Lab bench dataset task.

- Tasks load data (e.g., parquet files), generate prompts, and evaluate LLM responses.

#### `biomni/tool/`

- Contains **domain-specific scientific analysis functions** organized by subdomains:
  - `biochemistry.py`, `bioengineering.py`, `biophysics.py`, `cancer_biology.py`, `cell_biology.py`, `genetics.py`, `pathology.py`, `physiology.py`, `systems_biology.py`, etc.
  - Each file implements multiple functions performing analyses, simulations, or data processing workflows.
  - Functions accept input files/parameters and return detailed textual logs and output files.

- **API client modules** (e.g., `database.py`) provide facade functions to query external biomedical databases (UniProt, GWAS Catalog, Ensembl, etc.) via REST or GraphQL APIs, often using LLMs to generate query payloads from natural language prompts.

- **Tool registry (`tool_registry.py`)** manages metadata about available tools, supporting dynamic registration and lookup.

#### `biomni/tool/tool_description/`

- Contains **declarative metadata schemas** describing tool APIs:
  - Each file exports a `description` list of dictionaries defining tool names, descriptions, required and optional parameters with types and defaults.
  - Supports **dynamic API generation, validation, and documentation**.
  - Organized by biological domain (e.g., genetics, immunology, pathology).

#### `biomni/model/retriever.py`

- Implements `ToolRetriever` class for **AI-driven resource selection**:
  - Uses LLMs (OpenAI or Anthropic) to parse user queries and select relevant tools, datasets, and libraries.
  - Encapsulates prompt formatting and response parsing logic.

#### `biomni/utils.py` and `biomni/llm.py`

- `utils.py`: Utility functions for running system commands (R, Bash), file operations, schema generation, logging, and colorized printing.
- `llm.py`: Factory functions to instantiate LLMs (OpenAI, Anthropic) with configurable parameters.

#### `biomni/env_desc.py`

- Contains **environment and dataset descriptions**, acting as a centralized metadata repository for datasets and experimental environments.

---

### Environment Setup (`biomni_env/`)

- `setup.sh`: Main shell script to create conda environment, install R packages, and CLI bioinformatics tools.
- `install_cli_tools.sh`: Automates downloading, compiling, and installing external bioinformatics command-line tools, managing PATH and verification.
- `r_packages.yml`: Lists R packages required.
- `environment.yml` and `bio_env.yml`: Conda environment specifications.
- `setup_path.sh`: Shell script to update environment variables for CLI tools.

---

### Entry Points and Execution Flow

- **Agent usage**: Instantiate `react` agent from `biomni.agent.react`, configure with tools and retrieval, then call `go(prompt)` to run reasoning workflows.
- **Task evaluation**: Use classes in `biomni.task` to load datasets, generate prompts, and evaluate LLM outputs.
- **Tool invocation**: Call functions in `biomni.tool` modules or use API facades in `database.py` to query external resources.
- **Metadata-driven tool discovery**: Use `tool_registry.py` and `tool_description` schemas to dynamically discover and validate tools.
- **Environment setup**: Run `biomni_env/setup.sh` to provision environment and install dependencies.

---

## 4. Development Patterns and Standards

### Code Organization Principles

- **Modular design**: Clear separation of concerns by domain and functionality (agent, task, tool, model).
- **Functional programming style**: Most analysis modules use standalone functions with explicit inputs and outputs.
- **Declarative metadata**: Tool descriptions and schemas are separated from implementation, enabling dynamic validation and UI generation.
- **Abstract base classes**: Used in `biomni.task.base_task` to enforce consistent task interfaces.
- **Factory pattern**: Used in `llm.py` to instantiate LLMs based on configuration.
- **Strategy pattern**: Task implementations and tool retrieval use interchangeable strategies.

### Testing and Coverage

- No explicit test files detected; testing likely manual or via notebooks (`tutorials/`).
- Tasks and tools return detailed logs suitable for manual verification.
- Metadata schemas facilitate automated validation of inputs.

### Error Handling and Logging

- Use of try-except blocks around external calls and subprocesses.
- Logging via custom callback handlers (`PromptLogger`, `NodeLogger`) in LLM interactions.
- Utilities provide colorized printing and error wrappers for robustness.

### Configuration Management

- Environment variables for API keys (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`).
- YAML and JSON files for environment and tool configuration.
- Dynamic loading of schemas from pickle files for API request generation.
- CLI tools and R packages installed via scripted environment setup.

---

## 5. Integration and Dependencies

### External Libraries

- **LLM & AI Frameworks**: `langchain_core`, `langchain_openai`, `langchain_anthropic`
- **Scientific Computing**: `numpy`, `pandas`, `scipy`, `scikit-image`, `matplotlib`, `BioPython`, `cobra`, `sklearn`
- **Data Processing**: `pickle`, `json`, `requests`, `PyPDF2`
- **System and OS**: `subprocess`, `os`, `sys`, `tempfile`, `multiprocessing`
- **Others**: `tqdm` (progress bars), `enum`, `ast` (code introspection)

### External APIs and Data Sources

- Biomedical databases: UniProt, GWAS Catalog, Ensembl, ClinVar, dbSNP, EMDB, GEO, GnomAD, InterPro, etc.
- Bioinformatics tools: PLINK, IQ-TREE, GCTA, MACS2, samtools, LUMPY, installed via CLI tools installer.
- R packages for statistical and bioinformatics analyses.

### Build and Deployment Dependencies

- Python 3 environment managed via Conda (`environment.yml`).
- R environment with specified packages (`r_packages.yml`).
- Shell scripts for CLI tool installation and environment setup.
- Pre-commit hooks for code quality and security.

---

## 6. Usage and Operational Guidance

### Getting Started

1. **Environment Setup**
   - Run `biomni_env/setup.sh` to create the Conda environment, install R packages, and CLI tools.
   - Source `biomni_env/setup_path.sh` or add it to your shell profile to configure PATH.

2. **API Keys**
   - Set environment variables `OPENAI_API_KEY` and/or `ANTHROPIC_API_KEY` for LLM access.

3. **Running Agents**
   - Import and instantiate the `react` agent from `biomni.agent.react`.
   - Configure with desired tools and retrieval options.
   - Call `go(prompt)` to execute reasoning workflows.

4. **Executing Tasks**
   - Use classes in `biomni.task` to load datasets and evaluate LLM responses.
   - Implement new tasks by subclassing `base_task` and following the interface.

5. **Querying Databases**
   - Use `biomni.tool.database` functions (e.g., `query_uniprot`, `query_gwas_catalog`) to retrieve data via natural language or direct parameters.

6. **Extending Tools**
   - Add new tool metadata in `biomni/tool/tool_description/` as structured dictionaries.
   - Implement corresponding analysis functions in `biomni/tool/`.
   - Register tools in `tool_registry.py` for discovery.

### Monitoring and Debugging

- Use logging callbacks (`PromptLogger`, `NodeLogger`) to trace LLM interactions.
- Check output logs returned by analysis functions for detailed execution info.
- Use pre-commit hooks to maintain code quality.

### Performance and Scalability

- Modular design allows parallel execution of tasks and tools.
- Timeout wrappers in agent tools prevent hanging executions.
- Use of efficient numerical libraries (`numpy`, `scipy`) for computational tasks.
- Large data handled via streaming and chunking (e.g., PDF text extraction).

### Security Considerations

- API keys managed via environment variables, not hardcoded.
- Pre-commit hooks include security checks.
- External tool installations verified via version commands.

### Observability

- Progress bars (`tqdm`) used in data processing scripts.
- Structured logs and JSON outputs facilitate downstream analysis.
- Agent workflows produce detailed message histories for audit.

---

## Summary

This project is a **modular, extensible biomedical AI platform** integrating **LLM-powered agents**, **domain-specific scientific tools**, and **metadata-driven APIs** to automate complex biomedical research workflows. It emphasizes **declarative tool descriptions**, **dynamic resource retrieval**, and **robust environment provisioning** to enable researchers and developers to build, evaluate, and extend AI-driven biomedical applications efficiently.

---

# End of DETAILS.md