# DETAILS.md 🔍 **Powered by [Detailer](https://detailer.ginylil.com)** - Context-aware codebase analysis --- ## 1. Project Overview ### Project Purpose & Domain This project is a comprehensive **biomedical AI toolkit and research platform** designed to facilitate **biomedical data analysis, knowledge extraction, and AI-driven reasoning**. It integrates large language models (LLMs), domain-specific bioinformatics tools, and scientific data processing pipelines to enable: - Automated extraction of biomedical knowledge from literature (e.g., bioRxiv papers) - Querying and integration of diverse biomedical databases and APIs - Execution of domain-specific computational biology and physiology analyses - AI agent orchestration for complex biomedical reasoning and tool invocation - Benchmarking and evaluation of biomedical tasks and datasets ### Target Users and Use Cases - **Biomedical researchers and data scientists** seeking to automate literature mining, data retrieval, and analysis workflows. - **Bioinformaticians** requiring integrated access to multiple biological databases and computational tools. - **AI researchers** interested in applying LLMs and autonomous agents to biomedical problem solving. - **Developers and integrators** building domain-specific AI pipelines and scientific workflows. - Use cases include: - Extracting structured biomedical tasks and entities from scientific papers - Querying gene, protein, disease, and pathway databases via natural language prompts - Running computational models of biological systems (e.g., metabolic networks, signaling) - Performing image analysis and quantitative pathology workflows - Orchestrating multi-step AI reasoning with tool use and self-criticism ### Core Business Logic and Domain Models - **Biomedical domain models**: gene IDs, protein structures, pathways, disease ontologies, experimental assays. - **Task abstractions**: benchmark tasks with prompt/response evaluation (e.g., humanity last exam, lab bench). - **Tool metadata schemas**: declarative descriptions of biomedical tools and APIs for dynamic invocation. - **AI agent workflows**: ReAct-style reasoning graphs integrating LLMs, tool calls, retrieval, and self-critique. - **Data models**: structured JSON, pandas DataFrames, numpy arrays representing biological data and analysis results. --- ## 2. Architecture and Structure ### High-Level Architecture The system is organized into modular layers and components: - **Core Library (`biomni/`)**: Contains main application logic, including: - **Agent framework (`biomni/agent/`)**: Implements autonomous AI agents using LLMs and workflow graphs. - **Task definitions (`biomni/task/`)**: Abstract base and concrete biomedical benchmark tasks. - **Tool implementations (`biomni/tool/`)**: Domain-specific analysis functions, API clients, and computational biology workflows. - **Tool metadata (`biomni/tool/tool_description/`)**: Declarative schemas describing tool APIs and parameters. - **Model components (`biomni/model/`)**: AI-driven resource retriever for selecting relevant tools and data. - **Utility modules (`biomni/utils.py`, `biomni/llm.py`, `biomni/env_desc.py`)**: Helpers for LLM instantiation, system commands, environment descriptions. - **Versioning (`biomni/version.py`)**: Package version management. - **Environment Setup (`biomni_env/`)**: Scripts and configuration files for reproducible environment provisioning, including: - Conda environment YAMLs (`environment.yml`, `bio_env.yml`) - R package specifications (`r_packages.yml`) - CLI tools installer (`install_cli_tools.sh`) - Shell scripts for environment setup (`setup.sh`, `setup_path.sh`) - **Scripts (`biomni/biorxiv_scripts/`)**: Data processing pipelines for literature mining and task extraction. - **Documentation and Configuration**: - Root-level files: `README.md`, `CONTRIBUTION.md`, `pyproject.toml`, `.pre-commit-config.yaml`. --- ### Complete Repository Structure ``` . ├── biomni/ (90 items) │ ├── agent/ │ │ ├── __init__.py │ │ ├── a1.py │ │ ├── env_collection.py │ │ ├── qa_llm.py │ │ └── react.py │ ├── biorxiv_scripts/ │ │ ├── extract_biorxiv_tasks.py │ │ ├── generate_function.py │ │ └── process_all_subjects.py │ ├── model/ │ │ ├── __init__.py │ │ └── retriever.py │ ├── task/ │ │ ├── __init__.py │ │ ├── base_task.py │ │ ├── hle.py │ │ └── lab_bench.py │ ├── tool/ (65 items) │ │ ├── schema_db/ (25 items) │ │ │ ├── cbioportal.pkl │ │ │ ├── clinvar.pkl │ │ │ ├── dbsnp.pkl │ │ │ ├── emdb.pkl │ │ │ ├── ensembl.pkl │ │ │ ├── geo.pkl │ │ │ ├── gnomad.pkl │ │ │ ├── gtopdb.pkl │ │ │ ├── gwas_catalog.pkl │ │ │ ├── interpro.pkl │ │ │ └── ... (15 more files) │ │ ├── tool_description/ (18 items) │ │ │ ├── biochemistry.py │ │ │ ├── bioengineering.py │ │ │ ├── biophysics.py │ │ │ ├── cancer_biology.py │ │ │ ├── cell_biology.py │ │ │ ├── database.py │ │ │ ├── genetics.py │ │ │ ├── genomics.py │ │ │ ├── immunology.py │ │ │ ├── literature.py │ │ │ ├── microbiology.py │ │ │ ├── molecular_biology.py │ │ │ ├── pathology.py │ │ │ ├── pharmacology.py │ │ │ ├── physiology.py │ │ │ ├── support_tools.py │ │ │ ├── synthetic_biology.py │ │ │ └── systems_biology.py │ │ ├── __init__.py │ │ ├── biochemistry.py │ │ ├── bioengineering.py │ │ ├── biophysics.py │ │ ├── cancer_biology.py │ │ ├── cell_biology.py │ │ ├── database.py │ │ ├── genetics.py │ │ └── ... (12 more files) │ ├── __init__.py │ ├── env_desc.py │ ├── llm.py │ ├── utils.py │ └── version.py ├── biomni_env/ (9 items) │ ├── README.md │ ├── bio_env.yml │ ├── cli_tools_config.json │ ├── environment.yml │ ├── install_cli_tools.sh │ ├── install_r_packages.R │ ├── r_packages.yml │ ├── setup.sh │ └── setup_path.sh ├── figs/ │ └── biomni_logo.png ├── tutorials/ │ ├── examples/ │ │ └── cloning.ipynb │ ├── 101_biomni.ipynb │ └── biomni_101.ipynb ├── .gitignore ├── .pre-commit-config.yaml ├── CONTRIBUTION.md ├── LICENSE ├── README.md └── pyproject.toml ``` --- ## 3. Technical Implementation Details ### Core Modules and Their Roles #### `biomni/agent/` - Implements autonomous AI agents using the **ReAct paradigm**: - `react.py`: Main ReAct agent class managing reasoning, tool invocation, retrieval, and self-criticism workflows. - `env_collection.py`: Environment and data retrieval utilities. - `qa_llm.py`: Question-answering LLM wrappers. - `a1.py`: Possibly experimental or auxiliary agent code. - Uses **langgraph** for workflow graph orchestration and **langchain** for LLM integration. #### `biomni/task/` - Defines **benchmark tasks** with a common interface: - `base_task.py`: Abstract base class specifying methods like `get_example()`, `evaluate()`, `output_class()`. - `hle.py`: "Humanity Last Exam" task implementation. - `lab_bench.py`: Lab bench dataset task. - Tasks load data (e.g., parquet files), generate prompts, and evaluate LLM responses. #### `biomni/tool/` - Contains **domain-specific scientific analysis functions** organized by subdomains: - `biochemistry.py`, `bioengineering.py`, `biophysics.py`, `cancer_biology.py`, `cell_biology.py`, `genetics.py`, `pathology.py`, `physiology.py`, `systems_biology.py`, etc. - Each file implements multiple functions performing analyses, simulations, or data processing workflows. - Functions accept input files/parameters and return detailed textual logs and output files. - **API client modules** (e.g., `database.py`) provide facade functions to query external biomedical databases (UniProt, GWAS Catalog, Ensembl, etc.) via REST or GraphQL APIs, often using LLMs to generate query payloads from natural language prompts. - **Tool registry (`tool_registry.py`)** manages metadata about available tools, supporting dynamic registration and lookup. #### `biomni/tool/tool_description/` - Contains **declarative metadata schemas** describing tool APIs: - Each file exports a `description` list of dictionaries defining tool names, descriptions, required and optional parameters with types and defaults. - Supports **dynamic API generation, validation, and documentation**. - Organized by biological domain (e.g., genetics, immunology, pathology). #### `biomni/model/retriever.py` - Implements `ToolRetriever` class for **AI-driven resource selection**: - Uses LLMs (OpenAI or Anthropic) to parse user queries and select relevant tools, datasets, and libraries. - Encapsulates prompt formatting and response parsing logic. #### `biomni/utils.py` and `biomni/llm.py` - `utils.py`: Utility functions for running system commands (R, Bash), file operations, schema generation, logging, and colorized printing. - `llm.py`: Factory functions to instantiate LLMs (OpenAI, Anthropic) with configurable parameters. #### `biomni/env_desc.py` - Contains **environment and dataset descriptions**, acting as a centralized metadata repository for datasets and experimental environments. --- ### Environment Setup (`biomni_env/`) - `setup.sh`: Main shell script to create conda environment, install R packages, and CLI bioinformatics tools. - `install_cli_tools.sh`: Automates downloading, compiling, and installing external bioinformatics command-line tools, managing PATH and verification. - `r_packages.yml`: Lists R packages required. - `environment.yml` and `bio_env.yml`: Conda environment specifications. - `setup_path.sh`: Shell script to update environment variables for CLI tools. --- ### Entry Points and Execution Flow - **Agent usage**: Instantiate `react` agent from `biomni.agent.react`, configure with tools and retrieval, then call `go(prompt)` to run reasoning workflows. - **Task evaluation**: Use classes in `biomni.task` to load datasets, generate prompts, and evaluate LLM outputs. - **Tool invocation**: Call functions in `biomni.tool` modules or use API facades in `database.py` to query external resources. - **Metadata-driven tool discovery**: Use `tool_registry.py` and `tool_description` schemas to dynamically discover and validate tools. - **Environment setup**: Run `biomni_env/setup.sh` to provision environment and install dependencies. --- ## 4. Development Patterns and Standards ### Code Organization Principles - **Modular design**: Clear separation of concerns by domain and functionality (agent, task, tool, model). - **Functional programming style**: Most analysis modules use standalone functions with explicit inputs and outputs. - **Declarative metadata**: Tool descriptions and schemas are separated from implementation, enabling dynamic validation and UI generation. - **Abstract base classes**: Used in `biomni.task.base_task` to enforce consistent task interfaces. - **Factory pattern**: Used in `llm.py` to instantiate LLMs based on configuration. - **Strategy pattern**: Task implementations and tool retrieval use interchangeable strategies. ### Testing and Coverage - No explicit test files detected; testing likely manual or via notebooks (`tutorials/`). - Tasks and tools return detailed logs suitable for manual verification. - Metadata schemas facilitate automated validation of inputs. ### Error Handling and Logging - Use of try-except blocks around external calls and subprocesses. - Logging via custom callback handlers (`PromptLogger`, `NodeLogger`) in LLM interactions. - Utilities provide colorized printing and error wrappers for robustness. ### Configuration Management - Environment variables for API keys (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`). - YAML and JSON files for environment and tool configuration. - Dynamic loading of schemas from pickle files for API request generation. - CLI tools and R packages installed via scripted environment setup. --- ## 5. Integration and Dependencies ### External Libraries - **LLM & AI Frameworks**: `langchain_core`, `langchain_openai`, `langchain_anthropic` - **Scientific Computing**: `numpy`, `pandas`, `scipy`, `scikit-image`, `matplotlib`, `BioPython`, `cobra`, `sklearn` - **Data Processing**: `pickle`, `json`, `requests`, `PyPDF2` - **System and OS**: `subprocess`, `os`, `sys`, `tempfile`, `multiprocessing` - **Others**: `tqdm` (progress bars), `enum`, `ast` (code introspection) ### External APIs and Data Sources - Biomedical databases: UniProt, GWAS Catalog, Ensembl, ClinVar, dbSNP, EMDB, GEO, GnomAD, InterPro, etc. - Bioinformatics tools: PLINK, IQ-TREE, GCTA, MACS2, samtools, LUMPY, installed via CLI tools installer. - R packages for statistical and bioinformatics analyses. ### Build and Deployment Dependencies - Python 3 environment managed via Conda (`environment.yml`). - R environment with specified packages (`r_packages.yml`). - Shell scripts for CLI tool installation and environment setup. - Pre-commit hooks for code quality and security. --- ## 6. Usage and Operational Guidance ### Getting Started 1. **Environment Setup** - Run `biomni_env/setup.sh` to create the Conda environment, install R packages, and CLI tools. - Source `biomni_env/setup_path.sh` or add it to your shell profile to configure PATH. 2. **API Keys** - Set environment variables `OPENAI_API_KEY` and/or `ANTHROPIC_API_KEY` for LLM access. 3. **Running Agents** - Import and instantiate the `react` agent from `biomni.agent.react`. - Configure with desired tools and retrieval options. - Call `go(prompt)` to execute reasoning workflows. 4. **Executing Tasks** - Use classes in `biomni.task` to load datasets and evaluate LLM responses. - Implement new tasks by subclassing `base_task` and following the interface. 5. **Querying Databases** - Use `biomni.tool.database` functions (e.g., `query_uniprot`, `query_gwas_catalog`) to retrieve data via natural language or direct parameters. 6. **Extending Tools** - Add new tool metadata in `biomni/tool/tool_description/` as structured dictionaries. - Implement corresponding analysis functions in `biomni/tool/`. - Register tools in `tool_registry.py` for discovery. ### Monitoring and Debugging - Use logging callbacks (`PromptLogger`, `NodeLogger`) to trace LLM interactions. - Check output logs returned by analysis functions for detailed execution info. - Use pre-commit hooks to maintain code quality. ### Performance and Scalability - Modular design allows parallel execution of tasks and tools. - Timeout wrappers in agent tools prevent hanging executions. - Use of efficient numerical libraries (`numpy`, `scipy`) for computational tasks. - Large data handled via streaming and chunking (e.g., PDF text extraction). ### Security Considerations - API keys managed via environment variables, not hardcoded. - Pre-commit hooks include security checks. - External tool installations verified via version commands. ### Observability - Progress bars (`tqdm`) used in data processing scripts. - Structured logs and JSON outputs facilitate downstream analysis. - Agent workflows produce detailed message histories for audit. --- ## Summary This project is a **modular, extensible biomedical AI platform** integrating **LLM-powered agents**, **domain-specific scientific tools**, and **metadata-driven APIs** to automate complex biomedical research workflows. It emphasizes **declarative tool descriptions**, **dynamic resource retrieval**, and **robust environment provisioning** to enable researchers and developers to build, evaluate, and extend AI-driven biomedical applications efficiently. --- # End of DETAILS.md