jang1563
/

BioRLHF

Model card Files Files and versions

BioRLHF / COMPARISON_REPORT.md

jang1563's picture

Initial commit: BioRLHF v0.1.0

c7ebaa1 5 months ago

|

history blame contribute delete

3 kB

	# BioRLHF Model Comparison Study

	## Executive Summary

	This study compared three language models fine-tuned on biological reasoning tasks using identical training data (363 examples) and hyperparameters. Mistral-7B achieved 90% accuracy, significantly outperforming Qwen2.5-7B (40%) and Phi-2 (25%).

	## Methodology

	### Training Configuration
	- Dataset: 363 examples (factual recall + chain-of-thought + calibration)
	- Epochs: 10
	- Learning Rate: 1e-4
	- LoRA: r=64, α=128
	- Max Length: 1536 tokens

	### Evaluation
	- 20 test questions across 3 categories:
	- Factual Recall (10 questions)
	- Reasoning (5 questions)
	- Calibration/Uncertainty (5 questions)

	## Results

	\| Model \| Parameters \| Overall \| Factual \| Reasoning \| Calibration \|
	\|-------\|------------\|---------\|---------\|-----------\|-------------\|
	\| Mistral-7B \| 7B \| 90.0% \| 80.0% \| 100.0% \| 100.0% \|
	\| Qwen2.5-7B \| 7B \| 40.0% \| 30.0% \| 80.0% \| 20.0% \|
	\| Phi-2 \| 2.7B \| 25.0% \| 20.0% \| 60.0% \| 0.0% \|

	## Key Findings

	### 1. Mistral-7B Shows Superior Fine-tuning Capability
	Despite similar parameter counts, Mistral-7B learned the domain knowledge far more effectively than Qwen2.5-7B. This suggests Mistral's architecture is more amenable to domain-specific fine-tuning.

	### 2. Calibration Requires Explicit Training
	- Mistral-7B: 100% calibration accuracy
	- Qwen2.5-7B: 20% calibration accuracy
	- Phi-2: 0% calibration accuracy

	Only Mistral learned to express appropriate uncertainty. This demonstrates that calibration is a learnable skill but requires sufficient model capacity and training signal.

	### 3. Smaller Models Struggle with Domain Knowledge
	Phi-2 (2.7B parameters) achieved only 25% accuracy, suggesting a minimum model size threshold for effective biological reasoning fine-tuning.

	### 4. Hardest Questions
	All models struggled with specific numeric recall:
	- Heart baseline DEGs (112) - 0/3 correct
	- Heart stress DEGs (2,110) - 0/3 correct

	This suggests these facts need more aggressive drilling or alternative training strategies.

	## Conclusions

	1. Model selection matters: Mistral-7B is recommended for biological domain fine-tuning
	2. Calibration is learnable: With appropriate training examples, models can learn epistemic humility
	3. Size threshold exists: Models below ~7B parameters may lack capacity for complex domain reasoning

	## Implications for AI in Life Sciences

	This study demonstrates that:
	- Small-scale fine-tuning (363 examples) can achieve high accuracy on domain-specific tasks
	- Uncertainty calibration can be explicitly trained
	- Model architecture significantly impacts fine-tuning effectiveness

	These findings inform best practices for deploying LLMs in scientific research contexts where accuracy and appropriate uncertainty expression are critical.

	---

	Study conducted: January 9, 2026
	Dataset: KMP spaceflight countermeasure transcriptomic data
	Framework: BioRLHF (Biological Reinforcement Learning from Human Feedback)