Add files using upload-large-folder tool

ef16689 verified about 1 month ago

16.5 kB

	# DETAILS.md

	🔍 Powered by [Detailer](https://detailer.ginylil.com) - Context-aware codebase analysis



	---

	## 1. Project Overview

	### Project Purpose & Domain

	This project is a comprehensive biomedical AI toolkit and research platform designed to facilitate biomedical data analysis, knowledge extraction, and AI-driven reasoning. It integrates large language models (LLMs), domain-specific bioinformatics tools, and scientific data processing pipelines to enable:

	- Automated extraction of biomedical knowledge from literature (e.g., bioRxiv papers)
	- Querying and integration of diverse biomedical databases and APIs
	- Execution of domain-specific computational biology and physiology analyses
	- AI agent orchestration for complex biomedical reasoning and tool invocation
	- Benchmarking and evaluation of biomedical tasks and datasets

	### Target Users and Use Cases

	- Biomedical researchers and data scientists seeking to automate literature mining, data retrieval, and analysis workflows.
	- Bioinformaticians requiring integrated access to multiple biological databases and computational tools.
	- AI researchers interested in applying LLMs and autonomous agents to biomedical problem solving.
	- Developers and integrators building domain-specific AI pipelines and scientific workflows.
	- Use cases include:
	- Extracting structured biomedical tasks and entities from scientific papers
	- Querying gene, protein, disease, and pathway databases via natural language prompts
	- Running computational models of biological systems (e.g., metabolic networks, signaling)
	- Performing image analysis and quantitative pathology workflows
	- Orchestrating multi-step AI reasoning with tool use and self-criticism

	### Core Business Logic and Domain Models

	- Biomedical domain models: gene IDs, protein structures, pathways, disease ontologies, experimental assays.
	- Task abstractions: benchmark tasks with prompt/response evaluation (e.g., humanity last exam, lab bench).
	- Tool metadata schemas: declarative descriptions of biomedical tools and APIs for dynamic invocation.
	- AI agent workflows: ReAct-style reasoning graphs integrating LLMs, tool calls, retrieval, and self-critique.
	- Data models: structured JSON, pandas DataFrames, numpy arrays representing biological data and analysis results.

	---

	## 2. Architecture and Structure

	### High-Level Architecture

	The system is organized into modular layers and components:

	- Core Library (`biomni/`): Contains main application logic, including:
	- Agent framework (`biomni/agent/`): Implements autonomous AI agents using LLMs and workflow graphs.
	- Task definitions (`biomni/task/`): Abstract base and concrete biomedical benchmark tasks.
	- Tool implementations (`biomni/tool/`): Domain-specific analysis functions, API clients, and computational biology workflows.
	- Tool metadata (`biomni/tool/tool_description/`): Declarative schemas describing tool APIs and parameters.
	- Model components (`biomni/model/`): AI-driven resource retriever for selecting relevant tools and data.
	- Utility modules (`biomni/utils.py`, `biomni/llm.py`, `biomni/env_desc.py`): Helpers for LLM instantiation, system commands, environment descriptions.
	- Versioning (`biomni/version.py`): Package version management.

	- Environment Setup (`biomni_env/`): Scripts and configuration files for reproducible environment provisioning, including:
	- Conda environment YAMLs (`environment.yml`, `bio_env.yml`)
	- R package specifications (`r_packages.yml`)
	- CLI tools installer (`install_cli_tools.sh`)
	- Shell scripts for environment setup (`setup.sh`, `setup_path.sh`)

	- Scripts (`biomni/biorxiv_scripts/`): Data processing pipelines for literature mining and task extraction.

	- Documentation and Configuration:
	- Root-level files: `README.md`, `CONTRIBUTION.md`, `pyproject.toml`, `.pre-commit-config.yaml`.

	---

	### Complete Repository Structure

	```
	.
	├── biomni/ (90 items)
	│ ├── agent/
	│ │ ├── __init__.py
	│ │ ├── a1.py
	│ │ ├── env_collection.py
	│ │ ├── qa_llm.py
	│ │ └── react.py
	│ ├── biorxiv_scripts/
	│ │ ├── extract_biorxiv_tasks.py
	│ │ ├── generate_function.py
	│ │ └── process_all_subjects.py
	│ ├── model/
	│ │ ├── __init__.py
	│ │ └── retriever.py
	│ ├── task/
	│ │ ├── __init__.py
	│ │ ├── base_task.py
	│ │ ├── hle.py
	│ │ └── lab_bench.py
	│ ├── tool/ (65 items)
	│ │ ├── schema_db/ (25 items)
	│ │ │ ├── cbioportal.pkl
	│ │ │ ├── clinvar.pkl
	│ │ │ ├── dbsnp.pkl
	│ │ │ ├── emdb.pkl
	│ │ │ ├── ensembl.pkl
	│ │ │ ├── geo.pkl
	│ │ │ ├── gnomad.pkl
	│ │ │ ├── gtopdb.pkl
	│ │ │ ├── gwas_catalog.pkl
	│ │ │ ├── interpro.pkl
	│ │ │ └── ... (15 more files)
	│ │ ├── tool_description/ (18 items)
	│ │ │ ├── biochemistry.py
	│ │ │ ├── bioengineering.py
	│ │ │ ├── biophysics.py
	│ │ │ ├── cancer_biology.py
	│ │ │ ├── cell_biology.py
	│ │ │ ├── database.py
	│ │ │ ├── genetics.py
	│ │ │ ├── genomics.py
	│ │ │ ├── immunology.py
	│ │ │ ├── literature.py
	│ │ │ ├── microbiology.py
	│ │ │ ├── molecular_biology.py
	│ │ │ ├── pathology.py
	│ │ │ ├── pharmacology.py
	│ │ │ ├── physiology.py
	│ │ │ ├── support_tools.py
	│ │ │ ├── synthetic_biology.py
	│ │ │ └── systems_biology.py
	│ │ ├── __init__.py
	│ │ ├── biochemistry.py
	│ │ ├── bioengineering.py
	│ │ ├── biophysics.py
	│ │ ├── cancer_biology.py
	│ │ ├── cell_biology.py
	│ │ ├── database.py
	│ │ ├── genetics.py
	│ │ └── ... (12 more files)
	│ ├── __init__.py
	│ ├── env_desc.py
	│ ├── llm.py
	│ ├── utils.py
	│ └── version.py
	├── biomni_env/ (9 items)
	│ ├── README.md
	│ ├── bio_env.yml
	│ ├── cli_tools_config.json
	│ ├── environment.yml
	│ ├── install_cli_tools.sh
	│ ├── install_r_packages.R
	│ ├── r_packages.yml
	│ ├── setup.sh
	│ └── setup_path.sh
	├── figs/
	│ └── biomni_logo.png
	├── tutorials/
	│ ├── examples/
	│ │ └── cloning.ipynb
	│ ├── 101_biomni.ipynb
	│ └── biomni_101.ipynb
	├── .gitignore
	├── .pre-commit-config.yaml
	├── CONTRIBUTION.md
	├── LICENSE
	├── README.md
	└── pyproject.toml
	```

	---

	## 3. Technical Implementation Details

	### Core Modules and Their Roles

	#### `biomni/agent/`

	- Implements autonomous AI agents using the ReAct paradigm:
	- `react.py`: Main ReAct agent class managing reasoning, tool invocation, retrieval, and self-criticism workflows.
	- `env_collection.py`: Environment and data retrieval utilities.
	- `qa_llm.py`: Question-answering LLM wrappers.
	- `a1.py`: Possibly experimental or auxiliary agent code.

	- Uses langgraph for workflow graph orchestration and langchain for LLM integration.

	#### `biomni/task/`

	- Defines benchmark tasks with a common interface:
	- `base_task.py`: Abstract base class specifying methods like `get_example()`, `evaluate()`, `output_class()`.
	- `hle.py`: "Humanity Last Exam" task implementation.
	- `lab_bench.py`: Lab bench dataset task.

	- Tasks load data (e.g., parquet files), generate prompts, and evaluate LLM responses.

	#### `biomni/tool/`

	- Contains domain-specific scientific analysis functions organized by subdomains:
	- `biochemistry.py`, `bioengineering.py`, `biophysics.py`, `cancer_biology.py`, `cell_biology.py`, `genetics.py`, `pathology.py`, `physiology.py`, `systems_biology.py`, etc.
	- Each file implements multiple functions performing analyses, simulations, or data processing workflows.
	- Functions accept input files/parameters and return detailed textual logs and output files.

	- API client modules (e.g., `database.py`) provide facade functions to query external biomedical databases (UniProt, GWAS Catalog, Ensembl, etc.) via REST or GraphQL APIs, often using LLMs to generate query payloads from natural language prompts.

	- Tool registry (`tool_registry.py`) manages metadata about available tools, supporting dynamic registration and lookup.

	#### `biomni/tool/tool_description/`

	- Contains declarative metadata schemas describing tool APIs:
	- Each file exports a `description` list of dictionaries defining tool names, descriptions, required and optional parameters with types and defaults.
	- Supports dynamic API generation, validation, and documentation.
	- Organized by biological domain (e.g., genetics, immunology, pathology).

	#### `biomni/model/retriever.py`

	- Implements `ToolRetriever` class for AI-driven resource selection:
	- Uses LLMs (OpenAI or Anthropic) to parse user queries and select relevant tools, datasets, and libraries.
	- Encapsulates prompt formatting and response parsing logic.

	#### `biomni/utils.py` and `biomni/llm.py`

	- `utils.py`: Utility functions for running system commands (R, Bash), file operations, schema generation, logging, and colorized printing.
	- `llm.py`: Factory functions to instantiate LLMs (OpenAI, Anthropic) with configurable parameters.

	#### `biomni/env_desc.py`

	- Contains environment and dataset descriptions, acting as a centralized metadata repository for datasets and experimental environments.

	---

	### Environment Setup (`biomni_env/`)

	- `setup.sh`: Main shell script to create conda environment, install R packages, and CLI bioinformatics tools.
	- `install_cli_tools.sh`: Automates downloading, compiling, and installing external bioinformatics command-line tools, managing PATH and verification.
	- `r_packages.yml`: Lists R packages required.
	- `environment.yml` and `bio_env.yml`: Conda environment specifications.
	- `setup_path.sh`: Shell script to update environment variables for CLI tools.

	---

	### Entry Points and Execution Flow

	- Agent usage: Instantiate `react` agent from `biomni.agent.react`, configure with tools and retrieval, then call `go(prompt)` to run reasoning workflows.
	- Task evaluation: Use classes in `biomni.task` to load datasets, generate prompts, and evaluate LLM outputs.
	- Tool invocation: Call functions in `biomni.tool` modules or use API facades in `database.py` to query external resources.
	- Metadata-driven tool discovery: Use `tool_registry.py` and `tool_description` schemas to dynamically discover and validate tools.
	- Environment setup: Run `biomni_env/setup.sh` to provision environment and install dependencies.

	---

	## 4. Development Patterns and Standards

	### Code Organization Principles

	- Modular design: Clear separation of concerns by domain and functionality (agent, task, tool, model).
	- Functional programming style: Most analysis modules use standalone functions with explicit inputs and outputs.
	- Declarative metadata: Tool descriptions and schemas are separated from implementation, enabling dynamic validation and UI generation.
	- Abstract base classes: Used in `biomni.task.base_task` to enforce consistent task interfaces.
	- Factory pattern: Used in `llm.py` to instantiate LLMs based on configuration.
	- Strategy pattern: Task implementations and tool retrieval use interchangeable strategies.

	### Testing and Coverage

	- No explicit test files detected; testing likely manual or via notebooks (`tutorials/`).
	- Tasks and tools return detailed logs suitable for manual verification.
	- Metadata schemas facilitate automated validation of inputs.

	### Error Handling and Logging

	- Use of try-except blocks around external calls and subprocesses.
	- Logging via custom callback handlers (`PromptLogger`, `NodeLogger`) in LLM interactions.
	- Utilities provide colorized printing and error wrappers for robustness.

	### Configuration Management

	- Environment variables for API keys (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`).
	- YAML and JSON files for environment and tool configuration.
	- Dynamic loading of schemas from pickle files for API request generation.
	- CLI tools and R packages installed via scripted environment setup.

	---

	## 5. Integration and Dependencies

	### External Libraries

	- LLM & AI Frameworks: `langchain_core`, `langchain_openai`, `langchain_anthropic`
	- Scientific Computing: `numpy`, `pandas`, `scipy`, `scikit-image`, `matplotlib`, `BioPython`, `cobra`, `sklearn`
	- Data Processing: `pickle`, `json`, `requests`, `PyPDF2`
	- System and OS: `subprocess`, `os`, `sys`, `tempfile`, `multiprocessing`
	- Others: `tqdm` (progress bars), `enum`, `ast` (code introspection)

	### External APIs and Data Sources

	- Biomedical databases: UniProt, GWAS Catalog, Ensembl, ClinVar, dbSNP, EMDB, GEO, GnomAD, InterPro, etc.
	- Bioinformatics tools: PLINK, IQ-TREE, GCTA, MACS2, samtools, LUMPY, installed via CLI tools installer.
	- R packages for statistical and bioinformatics analyses.

	### Build and Deployment Dependencies

	- Python 3 environment managed via Conda (`environment.yml`).
	- R environment with specified packages (`r_packages.yml`).
	- Shell scripts for CLI tool installation and environment setup.
	- Pre-commit hooks for code quality and security.

	---

	## 6. Usage and Operational Guidance

	### Getting Started

	1. Environment Setup
	- Run `biomni_env/setup.sh` to create the Conda environment, install R packages, and CLI tools.
	- Source `biomni_env/setup_path.sh` or add it to your shell profile to configure PATH.

	2. API Keys
	- Set environment variables `OPENAI_API_KEY` and/or `ANTHROPIC_API_KEY` for LLM access.

	3. Running Agents
	- Import and instantiate the `react` agent from `biomni.agent.react`.
	- Configure with desired tools and retrieval options.
	- Call `go(prompt)` to execute reasoning workflows.

	4. Executing Tasks
	- Use classes in `biomni.task` to load datasets and evaluate LLM responses.
	- Implement new tasks by subclassing `base_task` and following the interface.

	5. Querying Databases
	- Use `biomni.tool.database` functions (e.g., `query_uniprot`, `query_gwas_catalog`) to retrieve data via natural language or direct parameters.

	6. Extending Tools
	- Add new tool metadata in `biomni/tool/tool_description/` as structured dictionaries.
	- Implement corresponding analysis functions in `biomni/tool/`.
	- Register tools in `tool_registry.py` for discovery.

	### Monitoring and Debugging

	- Use logging callbacks (`PromptLogger`, `NodeLogger`) to trace LLM interactions.
	- Check output logs returned by analysis functions for detailed execution info.
	- Use pre-commit hooks to maintain code quality.

	### Performance and Scalability

	- Modular design allows parallel execution of tasks and tools.
	- Timeout wrappers in agent tools prevent hanging executions.
	- Use of efficient numerical libraries (`numpy`, `scipy`) for computational tasks.
	- Large data handled via streaming and chunking (e.g., PDF text extraction).

	### Security Considerations

	- API keys managed via environment variables, not hardcoded.
	- Pre-commit hooks include security checks.
	- External tool installations verified via version commands.

	### Observability

	- Progress bars (`tqdm`) used in data processing scripts.
	- Structured logs and JSON outputs facilitate downstream analysis.
	- Agent workflows produce detailed message histories for audit.

	---

	## Summary

	This project is a modular, extensible biomedical AI platform integrating LLM-powered agents, domain-specific scientific tools, and metadata-driven APIs to automate complex biomedical research workflows. It emphasizes declarative tool descriptions, dynamic resource retrieval, and robust environment provisioning to enable researchers and developers to build, evaluate, and extend AI-driven biomedical applications efficiently.

	---

	# End of DETAILS.md