File size: 3,001 Bytes
c7ebaa1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# BioRLHF Model Comparison Study

## Executive Summary

This study compared three language models fine-tuned on biological reasoning tasks using identical training data (363 examples) and hyperparameters. **Mistral-7B achieved 90% accuracy**, significantly outperforming Qwen2.5-7B (40%) and Phi-2 (25%).

## Methodology

### Training Configuration
- **Dataset**: 363 examples (factual recall + chain-of-thought + calibration)
- **Epochs**: 10
- **Learning Rate**: 1e-4
- **LoRA**: r=64, α=128
- **Max Length**: 1536 tokens

### Evaluation
- **20 test questions** across 3 categories:
  - Factual Recall (10 questions)
  - Reasoning (5 questions)
  - Calibration/Uncertainty (5 questions)

## Results

| Model | Parameters | Overall | Factual | Reasoning | Calibration |
|-------|------------|---------|---------|-----------|-------------|
| **Mistral-7B** | 7B | **90.0%** | 80.0% | 100.0% | 100.0% |
| Qwen2.5-7B | 7B | 40.0% | 30.0% | 80.0% | 20.0% |
| Phi-2 | 2.7B | 25.0% | 20.0% | 60.0% | 0.0% |

## Key Findings

### 1. Mistral-7B Shows Superior Fine-tuning Capability
Despite similar parameter counts, Mistral-7B learned the domain knowledge far more effectively than Qwen2.5-7B. This suggests Mistral's architecture is more amenable to domain-specific fine-tuning.

### 2. Calibration Requires Explicit Training
- Mistral-7B: 100% calibration accuracy
- Qwen2.5-7B: 20% calibration accuracy  
- Phi-2: 0% calibration accuracy

Only Mistral learned to express appropriate uncertainty. This demonstrates that calibration is a learnable skill but requires sufficient model capacity and training signal.

### 3. Smaller Models Struggle with Domain Knowledge
Phi-2 (2.7B parameters) achieved only 25% accuracy, suggesting a minimum model size threshold for effective biological reasoning fine-tuning.

### 4. Hardest Questions
All models struggled with specific numeric recall:
- Heart baseline DEGs (112) - 0/3 correct
- Heart stress DEGs (2,110) - 0/3 correct

This suggests these facts need more aggressive drilling or alternative training strategies.

## Conclusions

1. **Model selection matters**: Mistral-7B is recommended for biological domain fine-tuning
2. **Calibration is learnable**: With appropriate training examples, models can learn epistemic humility
3. **Size threshold exists**: Models below ~7B parameters may lack capacity for complex domain reasoning

## Implications for AI in Life Sciences

This study demonstrates that:
- Small-scale fine-tuning (363 examples) can achieve high accuracy on domain-specific tasks
- Uncertainty calibration can be explicitly trained
- Model architecture significantly impacts fine-tuning effectiveness

These findings inform best practices for deploying LLMs in scientific research contexts where accuracy and appropriate uncertainty expression are critical.

---

*Study conducted: January 9, 2026*
*Dataset: KMP spaceflight countermeasure transcriptomic data*
*Framework: BioRLHF (Biological Reinforcement Learning from Human Feedback)*