BioRLHF / README.md

docs(readme): replace em dashes with cleaner punctuation

73bdf88 verified 2 days ago

11.6 kB

	# BioRLHF

	[![CI](https://github.com/jang1563/BioRLHF/actions/workflows/ci.yml/badge.svg)](https://github.com/jang1563/BioRLHF/actions/workflows/ci.yml)
	[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
	[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
	[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md)

	Biological Reinforcement Learning from Human Feedback: A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning.

	## Highlights

	- Three-stage training pipeline: SFT → DPO → GRPO with verifier-based rewards
	- Multi-reward GRPO: Four composable verifiers (factual, pathway, consistency, uncertainty) with configurable weights
	- +19% reward improvement over SFT baseline using GRPO (0.650 vs 0.547)
	- -70% calibration error: ECE reduced from 0.258 to 0.078 after GRPO
	- 90% accuracy on domain-specific biological reasoning tasks (SFT stage)
	- Learns from 363 examples: efficient domain adaptation from spaceflight transcriptomics data

	## Key Results

	### GRPO Training (Phase 3)

	\| Metric \| SFT Baseline \| After GRPO \| Improvement \|
	\|--------\|-------------\|------------\|-------------\|
	\| Avg Reward \| 0.547 \| 0.650 \| +19% \|
	\| ECE (Calibration Error) \| 0.258 \| 0.078 \| -70% \|

	GRPO Configuration (Full v2):
	- 16 generations per prompt (G=16) for robust advantage estimation
	- Multi-reward: V1 (factual, 0.35) + V2 (pathway, 0.30) + V3 (consistency, 0.15) + V4 (uncertainty, 0.20)
	- KL penalty beta=0.02, 2 iterations per batch, group-normalized rewards

	### Model Comparison (SFT, 20-question evaluation)

	\| Model \| Overall \| Factual \| Reasoning \| Calibration \|
	\|-------\|---------\|---------\|-----------\|-------------\|
	\| Mistral-7B \| 90.0% \| 80.0% \| 100.0% \| 100.0% \|
	\| Qwen2.5-7B \| 40.0% \| 30.0% \| 80.0% \| 20.0% \|
	\| Phi-2 \| 25.0% \| 20.0% \| 60.0% \| 0.0% \|

	### SFT Training Progression

	\| Version \| Accuracy \| Key Improvement \|
	\|---------\|----------\|-----------------\|
	\| v1 (Base SFT) \| ~20% \| Format learned, facts wrong \|
	\| v2 (Expanded) \| ~60% \| More examples helped \|
	\| v3 (Fact Drilling) \| ~80% \| Repetition fixed key facts \|
	\| v4 (Advanced) \| ~85% \| Chain-of-thought, calibration \|
	\| Final \| 90% \| Targeted drilling for remaining errors \|

	## Installation

	### From Source

	```bash
	git clone https://github.com/jang1563/BioRLHF.git
	cd BioRLHF
	pip install -e .
	```

	### With Development Dependencies

	```bash
	pip install -e ".[dev]"
	```

	### GPU Requirements

	- NVIDIA GPU with 48GB+ VRAM recommended (A40 or A100)
	- 24GB+ VRAM sufficient for SFT/DPO with 4-bit quantization
	- CUDA 12.1+ recommended

	## Quick Start

	### SFT Training

	```python
	from biorlhf import SFTTrainingConfig, run_sft_training

	config = SFTTrainingConfig(
	model_name="mistralai/Mistral-7B-v0.3",
	dataset_path="data/kmp_sft_final.json",
	output_dir="./my_sft_model",
	num_epochs=10,
	learning_rate=1e-4,
	)

	model_path = run_sft_training(config)
	```

	### GRPO Training with Verifiers

	```bash
	# Using the CLI
	biorlhf-grpo --config configs/grpo_full_v2.json
	```

	```python
	# Or programmatically
	from biorlhf.training.grpo import GRPOConfig, run_grpo_training

	config = GRPOConfig.from_json("configs/grpo_full_v2.json")
	run_grpo_training(config)
	```

	### Creating a Dataset

	```python
	from biorlhf.data import create_sft_dataset

	dataset = create_sft_dataset(
	output_path="my_dataset.json",
	include_calibration=True,
	include_chain_of_thought=True,
	)

	print(f"Created {len(dataset)} training examples")
	```

	### Evaluating a Model

	```python
	from biorlhf import evaluate_model

	result = evaluate_model(
	model_path="./my_sft_model",
	test_questions_path="data/kmp_test_set.json",
	)

	print(f"Overall Accuracy: {result.overall_accuracy:.1%}")
	print(f"Factual: {result.factual_accuracy:.1%}")
	print(f"Reasoning: {result.reasoning_accuracy:.1%}")
	print(f"Calibration: {result.calibration_accuracy:.1%}")
	```

	### Running Inference

	```python
	from biorlhf.utils import load_model_for_inference, generate_response

	model, tokenizer = load_model_for_inference(
	model_path="./my_sft_model",
	base_model="mistralai/Mistral-7B-v0.3",
	)

	prompt = "### Instruction:\nWhich tissue is most sensitive to ionizing radiation?\n\n### Response:\n"
	response = generate_response(model, tokenizer, prompt)
	print(response)
	```

	## Architecture

	### Three-Stage Training Pipeline

	```
	Stage 1: SFT Stage 2: DPO Stage 3: GRPO
	(Supervised Fine-Tuning) (Direct Preference (Group Relative Policy
	Optimization) Optimization)

	Mistral-7B-v0.3 SFT model SFT model (merged)
	\| \| \|
	LoRA (r=64, alpha=128) Preference pairs Generate G=16 completions
	\| \| \|
	363 training examples Ranked responses Score with V1-V4 verifiers
	\| \| \|
	10 epochs, lr=1e-4 beta=0.1 Multi-reward composition
	\| \| \|
	SFT Adapter DPO Model GRPO Model
	```

	### Verifier-Based Reward System (V1-V4)

	\| Verifier \| Name \| Weight \| What It Scores \|
	\|----------\|------\|--------\|----------------\|
	\| V1 \| Factual \| 0.35 \| Exact match of biological facts (DEG counts, tissue names, directions) \|
	\| V2 \| Pathway \| 0.30 \| Correct pathway/gene set enrichment references (Hallmark, KEGG) \|
	\| V3 \| Consistency \| 0.15 \| Internal logical consistency within the response \|
	\| V4 \| Uncertainty \| 0.20 \| Appropriate confidence calibration and epistemic humility \|

	The verifiers are composable via `RewardComposer` and can be individually weighted:

	```python
	from biorlhf.verifiers import RewardComposer

	composer = RewardComposer(
	active_verifiers=["V1", "V2", "V3", "V4"],
	weights={"V1": 0.35, "V2": 0.30, "V3": 0.15, "V4": 0.20},
	)

	reward = composer.score(question, response, ground_truth)
	```

	## Dataset

	Training data is derived from a 2x2x2 factorial transcriptomic study:

	- Drug: Kaempferol (KMP) vs Control
	- Stressor 1: Hindlimb Unloading (HU): simulates microgravity
	- Stressor 2: Ionizing Radiation (IR): simulates space radiation
	- Tissues: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out)

	### Training Example Types

	\| Type \| Count \| Purpose \|
	\|------\|-------\|---------\|
	\| Factual Q&A \| ~150 \| Specific facts (DEG counts, tissue types) \|
	\| Chain-of-Thought \| ~50 \| Step-by-step reasoning \|
	\| Calibration \| ~30 \| Uncertainty expression \|
	\| Multi-hop Reasoning \| ~30 \| Integrating multiple facts \|
	\| Error Correction \| ~20 \| Learning from mistakes \|

	### Ground Truth Data

	```python
	from biorlhf.data import (
	STRESSOR_EFFECTS,
	KMP_EFFECTS,
	INTERACTIONS,
	TISSUE_TYPES,
	OXPHOS_PATTERNS,
	)

	# Example: Get DEG counts for stressors
	print(STRESSOR_EFFECTS["Hippocampus"])
	# {'HU': 1555, 'IR': 5477, 'HU_IR': 5510}
	```

	## Project Structure

	```
	BioRLHF/
	├── src/biorlhf/ # Main package
	│ ├── training/ # SFT, DPO, and GRPO trainers
	│ ├── data/ # Dataset creation & ground truth
	│ ├── evaluation/ # Model evaluation & calibration
	│ ├── verifiers/ # V1-V4 reward verifiers
	│ │ ├── factual.py # V1: Factual accuracy scoring
	│ │ ├── pathway.py # V2: Pathway enrichment scoring
	│ │ ├── consistency.py # V3: Logical consistency scoring
	│ │ ├── uncertainty.py # V4: Calibration/uncertainty scoring
	│ │ └── composer.py # Multi-reward composition
	│ ├── utils/ # Model loading, inference helpers
	│ └── cli.py # Command-line interface
	├── configs/ # Training configurations
	│ ├── grpo_mve.json # Minimum viable experiment
	│ └── grpo_full_v2.json # Full multi-reward training
	├── data/ # Training datasets
	│ ├── kmp_sft_final.json # 363 SFT training examples
	│ └── kmp_test_set.json # 20-question evaluation set
	├── examples/ # Usage examples
	├── scripts/ # SLURM job scripts & HPC guide
	├── tests/ # Unit tests
	└── docs/ # Documentation
	```

	## Scientific Contributions

	### 1. Verifier-Based GRPO Improves Calibration

	- GRPO with V1-V4 verifiers reduced calibration error (ECE) by 70%
	- Multi-reward composition outperforms single-reward training
	- G=16 generations dramatically reduces zero-variance batches (from 50% to <5%)

	### 2. Fact Drilling Works for SFT

	- Initial training: 20% accuracy on key facts
	- After targeted repetition: 100% accuracy on drilled facts
	- LLMs need explicit reinforcement of specific domain facts

	### 3. Calibration is Learnable

	- Trained on "I cannot determine X from this data" examples
	- Mistral achieved 100% calibration accuracy at SFT stage
	- GRPO further improved calibration via the V4 uncertainty verifier

	### 4. DPO is Fragile for Domain Knowledge

	- Aggressive DPO (beta=0.05) destroyed learned knowledge
	- Model hallucinated unrelated content
	- Preference learning needs careful tuning in specialized domains

	### 5. Architecture Matters More Than Size

	- Mistral-7B >> Qwen2.5-7B despite similar parameter counts
	- Phi-2 (2.7B) insufficient for complex biological reasoning
	- Model selection is critical for domain fine-tuning

	## Key Learnings for AI Safety

	1. Honesty is trainable: Models can learn appropriate epistemic humility
	2. Domain grounding matters: Anchoring to experimental truth prevents hallucination
	3. Multi-reward > single reward: Decomposing correctness into verifiable dimensions improves learning signal
	4. Preference learning is fragile: DPO can catastrophically forget domain knowledge
	5. Evaluation drives improvement: Systematic testing reveals specific failure modes

	## Related Projects

	- [SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench): 115-question benchmark for LLMs on spaceflight biomedical data

	## Citation

	If you use BioRLHF in your research, please cite:

	```bibtex
	@software{biorlhf2026,
	author = {Kim, JangKeun},
	title = {BioRLHF: Biological Reinforcement Learning from Human Feedback},
	year = {2026},
	url = {https://github.com/jang1563/BioRLHF},
	note = {Fine-tuning LLMs for biological reasoning with verifier-based GRPO}
	}
	```

	## Contributing

	Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

	## License

	This project is licensed under the MIT License: see the [LICENSE](LICENSE) file for details.

	---

	Developed by JangKeun Kim (jak4013@med.cornell.edu), Weill Cornell Medicine