BioRLHF / README.md
jang1563's picture
docs(readme): replace em dashes with cleaner punctuation
73bdf88 verified
# BioRLHF
[![CI](https://github.com/jang1563/BioRLHF/actions/workflows/ci.yml/badge.svg)](https://github.com/jang1563/BioRLHF/actions/workflows/ci.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)
**Biological Reinforcement Learning from Human Feedback**: A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.
## Highlights
- **Three-stage training pipeline**: SFT β†’ DPO β†’ GRPO with verifier-based rewards
- **Multi-reward GRPO**: Four composable verifiers (factual, pathway, consistency, uncertainty) with configurable weights
- **+19% reward improvement** over SFT baseline using GRPO (0.650 vs 0.547)
- **-70% calibration error**: ECE reduced from 0.258 to 0.078 after GRPO
- **90% accuracy** on domain-specific biological reasoning tasks (SFT stage)
- **Learns from 363 examples**: efficient domain adaptation from spaceflight transcriptomics data
## Key Results
### GRPO Training (Phase 3)
| Metric | SFT Baseline | After GRPO | Improvement |
|--------|-------------|------------|-------------|
| Avg Reward | 0.547 | 0.650 | +19% |
| ECE (Calibration Error) | 0.258 | 0.078 | -70% |
**GRPO Configuration (Full v2):**
- 16 generations per prompt (G=16) for robust advantage estimation
- Multi-reward: V1 (factual, 0.35) + V2 (pathway, 0.30) + V3 (consistency, 0.15) + V4 (uncertainty, 0.20)
- KL penalty beta=0.02, 2 iterations per batch, group-normalized rewards
### Model Comparison (SFT, 20-question evaluation)
| Model | Overall | Factual | Reasoning | Calibration |
|-------|---------|---------|-----------|-------------|
| **Mistral-7B** | **90.0%** | 80.0% | 100.0% | 100.0% |
| Qwen2.5-7B | 40.0% | 30.0% | 80.0% | 20.0% |
| Phi-2 | 25.0% | 20.0% | 60.0% | 0.0% |
### SFT Training Progression
| Version | Accuracy | Key Improvement |
|---------|----------|-----------------|
| v1 (Base SFT) | ~20% | Format learned, facts wrong |
| v2 (Expanded) | ~60% | More examples helped |
| v3 (Fact Drilling) | ~80% | Repetition fixed key facts |
| v4 (Advanced) | ~85% | Chain-of-thought, calibration |
| **Final** | **90%** | Targeted drilling for remaining errors |
## Installation
### From Source
```bash
git clone https://github.com/jang1563/BioRLHF.git
cd BioRLHF
pip install -e .
```
### With Development Dependencies
```bash
pip install -e ".[dev]"
```
### GPU Requirements
- NVIDIA GPU with 48GB+ VRAM recommended (A40 or A100)
- 24GB+ VRAM sufficient for SFT/DPO with 4-bit quantization
- CUDA 12.1+ recommended
## Quick Start
### SFT Training
```python
from biorlhf import SFTTrainingConfig, run_sft_training
config = SFTTrainingConfig(
model_name="mistralai/Mistral-7B-v0.3",
dataset_path="data/kmp_sft_final.json",
output_dir="./my_sft_model",
num_epochs=10,
learning_rate=1e-4,
)
model_path = run_sft_training(config)
```
### GRPO Training with Verifiers
```bash
# Using the CLI
biorlhf-grpo --config configs/grpo_full_v2.json
```
```python
# Or programmatically
from biorlhf.training.grpo import GRPOConfig, run_grpo_training
config = GRPOConfig.from_json("configs/grpo_full_v2.json")
run_grpo_training(config)
```
### Creating a Dataset
```python
from biorlhf.data import create_sft_dataset
dataset = create_sft_dataset(
output_path="my_dataset.json",
include_calibration=True,
include_chain_of_thought=True,
)
print(f"Created {len(dataset)} training examples")
```
### Evaluating a Model
```python
from biorlhf import evaluate_model
result = evaluate_model(
model_path="./my_sft_model",
test_questions_path="data/kmp_test_set.json",
)
print(f"Overall Accuracy: {result.overall_accuracy:.1%}")
print(f"Factual: {result.factual_accuracy:.1%}")
print(f"Reasoning: {result.reasoning_accuracy:.1%}")
print(f"Calibration: {result.calibration_accuracy:.1%}")
```
### Running Inference
```python
from biorlhf.utils import load_model_for_inference, generate_response
model, tokenizer = load_model_for_inference(
model_path="./my_sft_model",
base_model="mistralai/Mistral-7B-v0.3",
)
prompt = "### Instruction:\nWhich tissue is most sensitive to ionizing radiation?\n\n### Response:\n"
response = generate_response(model, tokenizer, prompt)
print(response)
```
## Architecture
### Three-Stage Training Pipeline
```
Stage 1: SFT Stage 2: DPO Stage 3: GRPO
(Supervised Fine-Tuning) (Direct Preference (Group Relative Policy
Optimization) Optimization)
Mistral-7B-v0.3 SFT model SFT model (merged)
| | |
LoRA (r=64, alpha=128) Preference pairs Generate G=16 completions
| | |
363 training examples Ranked responses Score with V1-V4 verifiers
| | |
10 epochs, lr=1e-4 beta=0.1 Multi-reward composition
| | |
SFT Adapter DPO Model GRPO Model
```
### Verifier-Based Reward System (V1-V4)
| Verifier | Name | Weight | What It Scores |
|----------|------|--------|----------------|
| **V1** | Factual | 0.35 | Exact match of biological facts (DEG counts, tissue names, directions) |
| **V2** | Pathway | 0.30 | Correct pathway/gene set enrichment references (Hallmark, KEGG) |
| **V3** | Consistency | 0.15 | Internal logical consistency within the response |
| **V4** | Uncertainty | 0.20 | Appropriate confidence calibration and epistemic humility |
The verifiers are composable via `RewardComposer` and can be individually weighted:
```python
from biorlhf.verifiers import RewardComposer
composer = RewardComposer(
active_verifiers=["V1", "V2", "V3", "V4"],
weights={"V1": 0.35, "V2": 0.30, "V3": 0.15, "V4": 0.20},
)
reward = composer.score(question, response, ground_truth)
```
## Dataset
Training data is derived from a 2x2x2 factorial transcriptomic study:
- **Drug**: Kaempferol (KMP) vs Control
- **Stressor 1**: Hindlimb Unloading (HU): simulates microgravity
- **Stressor 2**: Ionizing Radiation (IR): simulates space radiation
- **Tissues**: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)
### Training Example Types
| Type | Count | Purpose |
|------|-------|---------|
| Factual Q&A | ~150 | Specific facts (DEG counts, tissue types) |
| Chain-of-Thought | ~50 | Step-by-step reasoning |
| Calibration | ~30 | Uncertainty expression |
| Multi-hop Reasoning | ~30 | Integrating multiple facts |
| Error Correction | ~20 | Learning from mistakes |
### Ground Truth Data
```python
from biorlhf.data import (
STRESSOR_EFFECTS,
KMP_EFFECTS,
INTERACTIONS,
TISSUE_TYPES,
OXPHOS_PATTERNS,
)
# Example: Get DEG counts for stressors
print(STRESSOR_EFFECTS["Hippocampus"])
# {'HU': 1555, 'IR': 5477, 'HU_IR': 5510}
```
## Project Structure
```
BioRLHF/
β”œβ”€β”€ src/biorlhf/ # Main package
β”‚ β”œβ”€β”€ training/ # SFT, DPO, and GRPO trainers
β”‚ β”œβ”€β”€ data/ # Dataset creation & ground truth
β”‚ β”œβ”€β”€ evaluation/ # Model evaluation & calibration
β”‚ β”œβ”€β”€ verifiers/ # V1-V4 reward verifiers
β”‚ β”‚ β”œβ”€β”€ factual.py # V1: Factual accuracy scoring
β”‚ β”‚ β”œβ”€β”€ pathway.py # V2: Pathway enrichment scoring
β”‚ β”‚ β”œβ”€β”€ consistency.py # V3: Logical consistency scoring
β”‚ β”‚ β”œβ”€β”€ uncertainty.py # V4: Calibration/uncertainty scoring
β”‚ β”‚ └── composer.py # Multi-reward composition
β”‚ β”œβ”€β”€ utils/ # Model loading, inference helpers
β”‚ └── cli.py # Command-line interface
β”œβ”€β”€ configs/ # Training configurations
β”‚ β”œβ”€β”€ grpo_mve.json # Minimum viable experiment
β”‚ └── grpo_full_v2.json # Full multi-reward training
β”œβ”€β”€ data/ # Training datasets
β”‚ β”œβ”€β”€ kmp_sft_final.json # 363 SFT training examples
β”‚ └── kmp_test_set.json # 20-question evaluation set
β”œβ”€β”€ examples/ # Usage examples
β”œβ”€β”€ scripts/ # SLURM job scripts & HPC guide
β”œβ”€β”€ tests/ # Unit tests
└── docs/ # Documentation
```
## Scientific Contributions
### 1. Verifier-Based GRPO Improves Calibration
- GRPO with V1-V4 verifiers reduced calibration error (ECE) by 70%
- Multi-reward composition outperforms single-reward training
- G=16 generations dramatically reduces zero-variance batches (from 50% to <5%)
### 2. Fact Drilling Works for SFT
- Initial training: 20% accuracy on key facts
- After targeted repetition: 100% accuracy on drilled facts
- LLMs need explicit reinforcement of specific domain facts
### 3. Calibration is Learnable
- Trained on "I cannot determine X from this data" examples
- Mistral achieved 100% calibration accuracy at SFT stage
- GRPO further improved calibration via the V4 uncertainty verifier
### 4. DPO is Fragile for Domain Knowledge
- Aggressive DPO (beta=0.05) destroyed learned knowledge
- Model hallucinated unrelated content
- Preference learning needs careful tuning in specialized domains
### 5. Architecture Matters More Than Size
- Mistral-7B >> Qwen2.5-7B despite similar parameter counts
- Phi-2 (2.7B) insufficient for complex biological reasoning
- Model selection is critical for domain fine-tuning
## Key Learnings for AI Safety
1. **Honesty is trainable**: Models can learn appropriate epistemic humility
2. **Domain grounding matters**: Anchoring to experimental truth prevents hallucination
3. **Multi-reward > single reward**: Decomposing correctness into verifiable dimensions improves learning signal
4. **Preference learning is fragile**: DPO can catastrophically forget domain knowledge
5. **Evaluation drives improvement**: Systematic testing reveals specific failure modes
## Related Projects
- **[SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench)**: 115-question benchmark for LLMs on spaceflight biomedical data
## Citation
If you use BioRLHF in your research, please cite:
```bibtex
@software{biorlhf2026,
author = {Kim, JangKeun},
title = {BioRLHF: Biological Reinforcement Learning from Human Feedback},
year = {2026},
url = {https://github.com/jang1563/BioRLHF},
note = {Fine-tuning LLMs for biological reasoning with verifier-based GRPO}
}
```
## Contributing
Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## License
This project is licensed under the MIT License: see the [LICENSE](LICENSE) file for details.
---
*Developed by JangKeun Kim (jak4013@med.cornell.edu), Weill Cornell Medicine*