| # BioRLHF Model Comparison Study |
|
|
| ## Executive Summary |
|
|
| This study compared three language models fine-tuned on biological reasoning tasks using identical training data (363 examples) and hyperparameters. **Mistral-7B achieved 90% accuracy**, significantly outperforming Qwen2.5-7B (40%) and Phi-2 (25%). |
|
|
| ## Methodology |
|
|
| ### Training Configuration |
| - **Dataset**: 363 examples (factual recall + chain-of-thought + calibration) |
| - **Epochs**: 10 |
| - **Learning Rate**: 1e-4 |
| - **LoRA**: r=64, α=128 |
| - **Max Length**: 1536 tokens |
|
|
| ### Evaluation |
| - **20 test questions** across 3 categories: |
| - Factual Recall (10 questions) |
| - Reasoning (5 questions) |
| - Calibration/Uncertainty (5 questions) |
|
|
| ## Results |
|
|
| | Model | Parameters | Overall | Factual | Reasoning | Calibration | |
| |-------|------------|---------|---------|-----------|-------------| |
| | **Mistral-7B** | 7B | **90.0%** | 80.0% | 100.0% | 100.0% | |
| | Qwen2.5-7B | 7B | 40.0% | 30.0% | 80.0% | 20.0% | |
| | Phi-2 | 2.7B | 25.0% | 20.0% | 60.0% | 0.0% | |
|
|
| ## Key Findings |
|
|
| ### 1. Mistral-7B Shows Superior Fine-tuning Capability |
| Despite similar parameter counts, Mistral-7B learned the domain knowledge far more effectively than Qwen2.5-7B. This suggests Mistral's architecture is more amenable to domain-specific fine-tuning. |
|
|
| ### 2. Calibration Requires Explicit Training |
| - Mistral-7B: 100% calibration accuracy |
| - Qwen2.5-7B: 20% calibration accuracy |
| - Phi-2: 0% calibration accuracy |
|
|
| Only Mistral learned to express appropriate uncertainty. This demonstrates that calibration is a learnable skill but requires sufficient model capacity and training signal. |
|
|
| ### 3. Smaller Models Struggle with Domain Knowledge |
| Phi-2 (2.7B parameters) achieved only 25% accuracy, suggesting a minimum model size threshold for effective biological reasoning fine-tuning. |
|
|
| ### 4. Hardest Questions |
| All models struggled with specific numeric recall: |
| - Heart baseline DEGs (112) - 0/3 correct |
| - Heart stress DEGs (2,110) - 0/3 correct |
|
|
| This suggests these facts need more aggressive drilling or alternative training strategies. |
|
|
| ## Conclusions |
|
|
| 1. **Model selection matters**: Mistral-7B is recommended for biological domain fine-tuning |
| 2. **Calibration is learnable**: With appropriate training examples, models can learn epistemic humility |
| 3. **Size threshold exists**: Models below ~7B parameters may lack capacity for complex domain reasoning |
|
|
| ## Implications for AI in Life Sciences |
|
|
| This study demonstrates that: |
| - Small-scale fine-tuning (363 examples) can achieve high accuracy on domain-specific tasks |
| - Uncertainty calibration can be explicitly trained |
| - Model architecture significantly impacts fine-tuning effectiveness |
|
|
| These findings inform best practices for deploying LLMs in scientific research contexts where accuracy and appropriate uncertainty expression are critical. |
|
|
| --- |
|
|
| *Study conducted: January 9, 2026* |
| *Dataset: KMP spaceflight countermeasure transcriptomic data* |
| *Framework: BioRLHF (Biological Reinforcement Learning from Human Feedback)* |
|
|