| # BioRLHF |
|
|
| [](https://github.com/jang1563/BioRLHF/actions/workflows/ci.yml) |
| [](https://www.python.org/downloads/) |
| [](https://opensource.org/licenses/MIT) |
| [](https://github.com/psf/black) |
| [](https://github.com/astral-sh/ruff) |
| [](CONTRIBUTING.md) |
|
|
| **Biological Reinforcement Learning from Human Feedback**: A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning. |
|
|
| ## Highlights |
|
|
| - **Three-stage training pipeline**: SFT β DPO β GRPO with verifier-based rewards |
| - **Multi-reward GRPO**: Four composable verifiers (factual, pathway, consistency, uncertainty) with configurable weights |
| - **+19% reward improvement** over SFT baseline using GRPO (0.650 vs 0.547) |
| - **-70% calibration error**: ECE reduced from 0.258 to 0.078 after GRPO |
| - **90% accuracy** on domain-specific biological reasoning tasks (SFT stage) |
| - **Learns from 363 examples**: efficient domain adaptation from spaceflight transcriptomics data |
|
|
| ## Key Results |
|
|
| ### GRPO Training (Phase 3) |
|
|
| | Metric | SFT Baseline | After GRPO | Improvement | |
| |--------|-------------|------------|-------------| |
| | Avg Reward | 0.547 | 0.650 | +19% | |
| | ECE (Calibration Error) | 0.258 | 0.078 | -70% | |
|
|
| **GRPO Configuration (Full v2):** |
| - 16 generations per prompt (G=16) for robust advantage estimation |
| - Multi-reward: V1 (factual, 0.35) + V2 (pathway, 0.30) + V3 (consistency, 0.15) + V4 (uncertainty, 0.20) |
| - KL penalty beta=0.02, 2 iterations per batch, group-normalized rewards |
|
|
| ### Model Comparison (SFT, 20-question evaluation) |
|
|
| | Model | Overall | Factual | Reasoning | Calibration | |
| |-------|---------|---------|-----------|-------------| |
| | **Mistral-7B** | **90.0%** | 80.0% | 100.0% | 100.0% | |
| | Qwen2.5-7B | 40.0% | 30.0% | 80.0% | 20.0% | |
| | Phi-2 | 25.0% | 20.0% | 60.0% | 0.0% | |
|
|
| ### SFT Training Progression |
|
|
| | Version | Accuracy | Key Improvement | |
| |---------|----------|-----------------| |
| | v1 (Base SFT) | ~20% | Format learned, facts wrong | |
| | v2 (Expanded) | ~60% | More examples helped | |
| | v3 (Fact Drilling) | ~80% | Repetition fixed key facts | |
| | v4 (Advanced) | ~85% | Chain-of-thought, calibration | |
| | **Final** | **90%** | Targeted drilling for remaining errors | |
|
|
| ## Installation |
|
|
| ### From Source |
|
|
| ```bash |
| git clone https://github.com/jang1563/BioRLHF.git |
| cd BioRLHF |
| pip install -e . |
| ``` |
|
|
| ### With Development Dependencies |
|
|
| ```bash |
| pip install -e ".[dev]" |
| ``` |
|
|
| ### GPU Requirements |
|
|
| - NVIDIA GPU with 48GB+ VRAM recommended (A40 or A100) |
| - 24GB+ VRAM sufficient for SFT/DPO with 4-bit quantization |
| - CUDA 12.1+ recommended |
|
|
| ## Quick Start |
|
|
| ### SFT Training |
|
|
| ```python |
| from biorlhf import SFTTrainingConfig, run_sft_training |
| |
| config = SFTTrainingConfig( |
| model_name="mistralai/Mistral-7B-v0.3", |
| dataset_path="data/kmp_sft_final.json", |
| output_dir="./my_sft_model", |
| num_epochs=10, |
| learning_rate=1e-4, |
| ) |
| |
| model_path = run_sft_training(config) |
| ``` |
|
|
| ### GRPO Training with Verifiers |
|
|
| ```bash |
| # Using the CLI |
| biorlhf-grpo --config configs/grpo_full_v2.json |
| ``` |
|
|
| ```python |
| # Or programmatically |
| from biorlhf.training.grpo import GRPOConfig, run_grpo_training |
| |
| config = GRPOConfig.from_json("configs/grpo_full_v2.json") |
| run_grpo_training(config) |
| ``` |
|
|
| ### Creating a Dataset |
|
|
| ```python |
| from biorlhf.data import create_sft_dataset |
| |
| dataset = create_sft_dataset( |
| output_path="my_dataset.json", |
| include_calibration=True, |
| include_chain_of_thought=True, |
| ) |
| |
| print(f"Created {len(dataset)} training examples") |
| ``` |
|
|
| ### Evaluating a Model |
|
|
| ```python |
| from biorlhf import evaluate_model |
| |
| result = evaluate_model( |
| model_path="./my_sft_model", |
| test_questions_path="data/kmp_test_set.json", |
| ) |
| |
| print(f"Overall Accuracy: {result.overall_accuracy:.1%}") |
| print(f"Factual: {result.factual_accuracy:.1%}") |
| print(f"Reasoning: {result.reasoning_accuracy:.1%}") |
| print(f"Calibration: {result.calibration_accuracy:.1%}") |
| ``` |
|
|
| ### Running Inference |
|
|
| ```python |
| from biorlhf.utils import load_model_for_inference, generate_response |
| |
| model, tokenizer = load_model_for_inference( |
| model_path="./my_sft_model", |
| base_model="mistralai/Mistral-7B-v0.3", |
| ) |
| |
| prompt = "### Instruction:\nWhich tissue is most sensitive to ionizing radiation?\n\n### Response:\n" |
| response = generate_response(model, tokenizer, prompt) |
| print(response) |
| ``` |
|
|
| ## Architecture |
|
|
| ### Three-Stage Training Pipeline |
|
|
| ``` |
| Stage 1: SFT Stage 2: DPO Stage 3: GRPO |
| (Supervised Fine-Tuning) (Direct Preference (Group Relative Policy |
| Optimization) Optimization) |
| |
| Mistral-7B-v0.3 SFT model SFT model (merged) |
| | | | |
| LoRA (r=64, alpha=128) Preference pairs Generate G=16 completions |
| | | | |
| 363 training examples Ranked responses Score with V1-V4 verifiers |
| | | | |
| 10 epochs, lr=1e-4 beta=0.1 Multi-reward composition |
| | | | |
| SFT Adapter DPO Model GRPO Model |
| ``` |
|
|
| ### Verifier-Based Reward System (V1-V4) |
|
|
| | Verifier | Name | Weight | What It Scores | |
| |----------|------|--------|----------------| |
| | **V1** | Factual | 0.35 | Exact match of biological facts (DEG counts, tissue names, directions) | |
| | **V2** | Pathway | 0.30 | Correct pathway/gene set enrichment references (Hallmark, KEGG) | |
| | **V3** | Consistency | 0.15 | Internal logical consistency within the response | |
| | **V4** | Uncertainty | 0.20 | Appropriate confidence calibration and epistemic humility | |
|
|
| The verifiers are composable via `RewardComposer` and can be individually weighted: |
|
|
| ```python |
| from biorlhf.verifiers import RewardComposer |
| |
| composer = RewardComposer( |
| active_verifiers=["V1", "V2", "V3", "V4"], |
| weights={"V1": 0.35, "V2": 0.30, "V3": 0.15, "V4": 0.20}, |
| ) |
| |
| reward = composer.score(question, response, ground_truth) |
| ``` |
|
|
| ## Dataset |
|
|
| Training data is derived from a 2x2x2 factorial transcriptomic study: |
|
|
| - **Drug**: Kaempferol (KMP) vs Control |
| - **Stressor 1**: Hindlimb Unloading (HU): simulates microgravity |
| - **Stressor 2**: Ionizing Radiation (IR): simulates space radiation |
| - **Tissues**: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out) |
|
|
| ### Training Example Types |
|
|
| | Type | Count | Purpose | |
| |------|-------|---------| |
| | Factual Q&A | ~150 | Specific facts (DEG counts, tissue types) | |
| | Chain-of-Thought | ~50 | Step-by-step reasoning | |
| | Calibration | ~30 | Uncertainty expression | |
| | Multi-hop Reasoning | ~30 | Integrating multiple facts | |
| | Error Correction | ~20 | Learning from mistakes | |
|
|
| ### Ground Truth Data |
|
|
| ```python |
| from biorlhf.data import ( |
| STRESSOR_EFFECTS, |
| KMP_EFFECTS, |
| INTERACTIONS, |
| TISSUE_TYPES, |
| OXPHOS_PATTERNS, |
| ) |
| |
| # Example: Get DEG counts for stressors |
| print(STRESSOR_EFFECTS["Hippocampus"]) |
| # {'HU': 1555, 'IR': 5477, 'HU_IR': 5510} |
| ``` |
|
|
| ## Project Structure |
|
|
| ``` |
| BioRLHF/ |
| βββ src/biorlhf/ # Main package |
| β βββ training/ # SFT, DPO, and GRPO trainers |
| β βββ data/ # Dataset creation & ground truth |
| β βββ evaluation/ # Model evaluation & calibration |
| β βββ verifiers/ # V1-V4 reward verifiers |
| β β βββ factual.py # V1: Factual accuracy scoring |
| β β βββ pathway.py # V2: Pathway enrichment scoring |
| β β βββ consistency.py # V3: Logical consistency scoring |
| β β βββ uncertainty.py # V4: Calibration/uncertainty scoring |
| β β βββ composer.py # Multi-reward composition |
| β βββ utils/ # Model loading, inference helpers |
| β βββ cli.py # Command-line interface |
| βββ configs/ # Training configurations |
| β βββ grpo_mve.json # Minimum viable experiment |
| β βββ grpo_full_v2.json # Full multi-reward training |
| βββ data/ # Training datasets |
| β βββ kmp_sft_final.json # 363 SFT training examples |
| β βββ kmp_test_set.json # 20-question evaluation set |
| βββ examples/ # Usage examples |
| βββ scripts/ # SLURM job scripts & HPC guide |
| βββ tests/ # Unit tests |
| βββ docs/ # Documentation |
| ``` |
|
|
| ## Scientific Contributions |
|
|
| ### 1. Verifier-Based GRPO Improves Calibration |
|
|
| - GRPO with V1-V4 verifiers reduced calibration error (ECE) by 70% |
| - Multi-reward composition outperforms single-reward training |
| - G=16 generations dramatically reduces zero-variance batches (from 50% to <5%) |
|
|
| ### 2. Fact Drilling Works for SFT |
|
|
| - Initial training: 20% accuracy on key facts |
| - After targeted repetition: 100% accuracy on drilled facts |
| - LLMs need explicit reinforcement of specific domain facts |
|
|
| ### 3. Calibration is Learnable |
|
|
| - Trained on "I cannot determine X from this data" examples |
| - Mistral achieved 100% calibration accuracy at SFT stage |
| - GRPO further improved calibration via the V4 uncertainty verifier |
|
|
| ### 4. DPO is Fragile for Domain Knowledge |
|
|
| - Aggressive DPO (beta=0.05) destroyed learned knowledge |
| - Model hallucinated unrelated content |
| - Preference learning needs careful tuning in specialized domains |
|
|
| ### 5. Architecture Matters More Than Size |
|
|
| - Mistral-7B >> Qwen2.5-7B despite similar parameter counts |
| - Phi-2 (2.7B) insufficient for complex biological reasoning |
| - Model selection is critical for domain fine-tuning |
|
|
| ## Key Learnings for AI Safety |
|
|
| 1. **Honesty is trainable**: Models can learn appropriate epistemic humility |
| 2. **Domain grounding matters**: Anchoring to experimental truth prevents hallucination |
| 3. **Multi-reward > single reward**: Decomposing correctness into verifiable dimensions improves learning signal |
| 4. **Preference learning is fragile**: DPO can catastrophically forget domain knowledge |
| 5. **Evaluation drives improvement**: Systematic testing reveals specific failure modes |
|
|
| ## Related Projects |
|
|
| - **[SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench)**: 115-question benchmark for LLMs on spaceflight biomedical data |
|
|
| ## Citation |
|
|
| If you use BioRLHF in your research, please cite: |
|
|
| ```bibtex |
| @software{biorlhf2026, |
| author = {Kim, JangKeun}, |
| title = {BioRLHF: Biological Reinforcement Learning from Human Feedback}, |
| year = {2026}, |
| url = {https://github.com/jang1563/BioRLHF}, |
| note = {Fine-tuning LLMs for biological reasoning with verifier-based GRPO} |
| } |
| ``` |
|
|
| ## Contributing |
|
|
| Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. |
|
|
| ## License |
|
|
| This project is licensed under the MIT License: see the [LICENSE](LICENSE) file for details. |
|
|
| --- |
|
|
| *Developed by JangKeun Kim (jak4013@med.cornell.edu), Weill Cornell Medicine* |
|
|