BIGJUTT/5556 / benchmarks
1.25 TB
140 files
Updated 27 days ago
Name
Size
README.md920 Bytes
xet
README.md

Darwin-60B-DUO Benchmarks

Updated 2026-05-28 KST

🏆 GPQA Diamond (full 198 questions) — FINAL: 88.38%

Stage Score
Darwin single (MAJ@5) 69.70%
AWAXIS single (MAJ@5) 77.27%
Cascade (+ MTI) — first cascade 83.84%
🏆 Darwin-DELPHI DUO FINAL 88.38%

The final score is produced by the Darwin-DELPHI test-time engine operating over the two-specialist DUO. The engine is proprietary.

📌 Constituent verified scores

Constituent Benchmark Rank
Darwin-28B-REASON Hugging Face GPQA Diamond #3
AWAXIS-Think-31B National K-AI Leaderboard (Republic of Korea, MSIT) #1

Future evaluations (scheduled)

Benchmark Status
KMMLU TBA
CLIcK TBA
Helmet / Ruler (long context) TBA
Needle-in-Haystack 128 K NIAH 32 K verified per backend
Total size
1.25 TB
Files
140
Last updated
Jun 6
Pre-warmed CDN
US EU US EU

Contributors