Spaces:
Running
Running
๐ Diagnostic-Reasoning-Q3 (Pentabrid architecture V9): 8B model ranks #3 on MedXpertQA, beating 70B and 671B models
#1
pinned
by naturally-intuitive - opened
Pentabrid V9 Evaluation Results
An 8B parameter model ranking #3 globally on MedXpertQA Text, behind only DeepSeek-R1 (671B) and o3-mini (proprietary).
MedXpertQA Text (ICML 2025 Benchmark)
| Rank | Model | Parameters | Score |
|---|---|---|---|
| 1 | DeepSeek-R1 | 671B | 37.8% |
| 2 | o3-mini | proprietary | 37.3% |
| 3 | Pentabrid V9 (ours) | 8B | 24.9% (609/2450) |
| 4 | LLaMA-3.3-70B | 70B | 24.5% |
| 5 | DeepSeek-V3 | 671B | 24.2% |
| 6 | Qwen2.5-72B | 72B | 18.9% |
- Reasoning subset: 473/1861 (25.4%)
- Understanding subset: 136/589 (23.1%)
- First sub-10B model ever evaluated on MedXpertQA
Generation-Based Scores
| Benchmark | Score | Accuracy |
|---|---|---|
| MedQA (USMLE) | 853/1273 | 67.0% |
| PubMedQA | 695/1000 | 69.5% |
| MMLU Clinical Knowledge | 226/265 | 85.3% |
| MedMCQA | 1178/2000 | 58.9% |
Log-Likelihood Scores (7-Benchmark Average: 76.4%)
| Benchmark | Score | Accuracy |
|---|---|---|
| MMLU Professional Medicine | 244/272 | 89.7% |
| MMLU Medical Genetics | 88/100 | 88.0% |
| MMLU Clinical Knowledge | 229/265 | 86.4% |
| MMLU Anatomy | 107/135 | 79.3% |
| MedQA (USMLE) | 844/1273 | 66.3% |
| PubMedQA | 333/500 | 66.6% |
| MedMCQA | 2451/4183 | 58.6% |
Reasoning Tax Analysis
Generation mode consistently outperforms log-likelihood scoring for reasoning models:
| Benchmark | Generation | Log-Likelihood | Delta |
|---|---|---|---|
| MedQA | 853/1273 (67.0%) | 844/1273 (66.3%) | +9 marks |
| PubMedQA | 695/1000 (69.5%) | 333/500 (66.6%) | +2.9pp |
| MedMCQA | 1178/2000 (58.9%) | 2451/4183 (58.6%) | +0.3pp |
Built with the Pentabrid clinical reasoning methodology โ integrating Bayesian likelihood ratios, diagnostic frameworks, and clinical behavior patterns. Fine-tuned from Qwen3-8B.
Dr Adnan Agha
College of Medicine & Health Sciences, United Arab Emirates University
IP Application #2442
naturally-intuitive pinned discussion
