๐Ÿ† Diagnostic-Reasoning-Q3 (Pentabrid architecture V9): 8B model ranks #3 on MedXpertQA, beating 70B and 671B models

#1
by naturally-intuitive - opened
Clinical Reasoning Labs for Medical Diagnostic Accuracy org

Pentabrid V9 Evaluation Results

An 8B parameter model ranking #3 globally on MedXpertQA Text, behind only DeepSeek-R1 (671B) and o3-mini (proprietary).

MedXpertQA Text (ICML 2025 Benchmark)

Rank Model Parameters Score
1 DeepSeek-R1 671B 37.8%
2 o3-mini proprietary 37.3%
3 Pentabrid V9 (ours) 8B 24.9% (609/2450)
4 LLaMA-3.3-70B 70B 24.5%
5 DeepSeek-V3 671B 24.2%
6 Qwen2.5-72B 72B 18.9%
  • Reasoning subset: 473/1861 (25.4%)
  • Understanding subset: 136/589 (23.1%)
  • First sub-10B model ever evaluated on MedXpertQA

Generation-Based Scores

Benchmark Score Accuracy
MedQA (USMLE) 853/1273 67.0%
PubMedQA 695/1000 69.5%
MMLU Clinical Knowledge 226/265 85.3%
MedMCQA 1178/2000 58.9%

Log-Likelihood Scores (7-Benchmark Average: 76.4%)

Benchmark Score Accuracy
MMLU Professional Medicine 244/272 89.7%
MMLU Medical Genetics 88/100 88.0%
MMLU Clinical Knowledge 229/265 86.4%
MMLU Anatomy 107/135 79.3%
MedQA (USMLE) 844/1273 66.3%
PubMedQA 333/500 66.6%
MedMCQA 2451/4183 58.6%

Reasoning Tax Analysis

Generation mode consistently outperforms log-likelihood scoring for reasoning models:

Benchmark Generation Log-Likelihood Delta
MedQA 853/1273 (67.0%) 844/1273 (66.3%) +9 marks
PubMedQA 695/1000 (69.5%) 333/500 (66.6%) +2.9pp
MedMCQA 1178/2000 (58.9%) 2451/4183 (58.6%) +0.3pp

Built with the Pentabrid clinical reasoning methodology โ€” integrating Bayesian likelihood ratios, diagnostic frameworks, and clinical behavior patterns. Fine-tuned from Qwen3-8B.

Dr Adnan Agha
College of Medicine & Health Sciences, United Arab Emirates University
IP Application #2442

naturally-intuitive pinned discussion
Clinical Reasoning Labs for Medical Diagnostic Accuracy org

Q3

Sign up or log in to comment