Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
# Qwen3-Coder-30B-A3B-Instruct-FP8-Dynamic Evaluation Results
|
| 5 |
+
|
| 6 |
+
## Open LLM Leaderboard Benchmark Performance
|
| 7 |
+
|
| 8 |
+
| Benchmark | Score | Std Error |
|
| 9 |
+
|---------|------|---------|
|
| 10 |
+
| ARC Challenge (acc) | 65.36% | ±1.39% |
|
| 11 |
+
| ARC Challenge (acc_norm) | 67.83% | ±1.37% |
|
| 12 |
+
| GSM-8K (flexible-extract) | 89.84% | ±0.83% |
|
| 13 |
+
| GSM-8K (strict-match) | 88.93% | ±0.86% |
|
| 14 |
+
| Hellaswag (acc) | 50.25% | ±0.50% |
|
| 15 |
+
| Hellaswag (acc_norm) | 58.75% | ±0.49% |
|
| 16 |
+
| MMLU (overall) | 78.07% | ±0.33% |
|
| 17 |
+
| TruthfulQA MC1 | 38.31% | ±1.70% |
|
| 18 |
+
| TruthfulQA MC2 | 59.00% | ±1.64% |
|
| 19 |
+
| Winogrande | 62.75% | ±1.36% |
|
| 20 |
+
|
| 21 |
+
## MMLU Category Performance
|
| 22 |
+
|
| 23 |
+
| Category | Score | Std Error |
|
| 24 |
+
|---------|------|---------|
|
| 25 |
+
| Humanities | 69.93% | ±0.64% |
|
| 26 |
+
| Other | 79.95% | ±0.69% |
|
| 27 |
+
| Social Sciences | 86.22% | ±0.61% |
|
| 28 |
+
| STEM | 80.40% | ±0.69% |
|
| 29 |
+
|
| 30 |
+
## Humanities Subcategory Performance
|
| 31 |
+
|
| 32 |
+
| Subject | Score | Std Error |
|
| 33 |
+
|------|------|---------|
|
| 34 |
+
| Formal Logic | 61.90% | ±4.34% |
|
| 35 |
+
| High School European History | 86.67% | ±2.65% |
|
| 36 |
+
| High School US History | 86.76% | ±2.38% |
|
| 37 |
+
| High School World History | 89.03% | ±2.03% |
|
| 38 |
+
| International Law | 82.64% | ±3.46% |
|
| 39 |
+
| Jurisprudence | 82.41% | ±3.68% |
|
| 40 |
+
| Logical Fallacies | 82.21% | ±3.00% |
|
| 41 |
+
| Moral Disputes | 74.86% | ±2.34% |
|
| 42 |
+
| Moral Scenarios | 67.15% | ±1.57% |
|
| 43 |
+
| Philosophy | 77.17% | ±2.38% |
|
| 44 |
+
| Prehistory | 83.02% | ±2.09% |
|
| 45 |
+
| Professional Law | 54.95% | ±1.27% |
|
| 46 |
+
| World Religions | 85.38% | ±2.71% |
|
| 47 |
+
|
| 48 |
+
## Social Sciences Subcategory Performance
|
| 49 |
+
|
| 50 |
+
| Subject | Score | Std Error |
|
| 51 |
+
|------|------|---------|
|
| 52 |
+
| Econometrics | 72.81% | ±4.19% |
|
| 53 |
+
| High School Geography | 90.40% | ±2.10% |
|
| 54 |
+
| High School Government and Politics | 95.34% | ±1.52% |
|
| 55 |
+
| High School Macroeconomics | 85.64% | ±1.78% |
|
| 56 |
+
| High School Microeconomics | 91.60% | ±1.80% |
|
| 57 |
+
| High School Psychology | 93.94% | ±1.02% |
|
| 58 |
+
| Human Sexuality | 86.26% | ±3.02% |
|
| 59 |
+
| Professional Psychology | 81.70% | ±1.56% |
|
| 60 |
+
| Public Relations | 74.55% | ±4.17% |
|
| 61 |
+
| Security Studies | 75.51% | ±2.75% |
|
| 62 |
+
| Sociology | 85.57% | ±2.48% |
|
| 63 |
+
| US Foreign Policy | 91.00% | ±2.88% |
|
| 64 |
+
|
| 65 |
+
## STEM Subcategory Performance
|
| 66 |
+
|
| 67 |
+
| Subject | Score | Std Error |
|
| 68 |
+
|------|------|---------|
|
| 69 |
+
| Abstract Algebra | 72.00% | ±4.51% |
|
| 70 |
+
| Anatomy | 73.33% | ±3.82% |
|
| 71 |
+
| Astronomy | 90.13% | ±2.43% |
|
| 72 |
+
| College Biology | 89.58% | ±2.55% |
|
| 73 |
+
| College Chemistry | 62.00% | ±4.88% |
|
| 74 |
+
| College Computer Science | 79.00% | ±4.09% |
|
| 75 |
+
| College Mathematics | 64.00% | ±4.82% |
|
| 76 |
+
| College Physics | 72.55% | ±4.44% |
|
| 77 |
+
| Computer Security | 86.00% | ±3.49% |
|
| 78 |
+
| Conceptual Physics | 90.21% | ±1.94% |
|
| 79 |
+
| Electrical Engineering | 80.69% | ±3.29% |
|
| 80 |
+
| Elementary Mathematics | 85.45% | ±1.82% |
|
| 81 |
+
| High School Biology | 92.90% | ±1.46% |
|
| 82 |
+
| High School Chemistry | 77.83% | ±2.92% |
|
| 83 |
+
| High School Computer Science | 86.00% | ±3.49% |
|
| 84 |
+
| High School Mathematics | 66.67% | ±2.87% |
|
| 85 |
+
| High School Physics | 75.50% | ±3.51% |
|
| 86 |
+
| High School Statistics | 78.70% | ±2.79% |
|
| 87 |
+
| Machine Learning | 75.89% | ±4.06% |
|
| 88 |
+
|
| 89 |
+
## Other Subcategory Performance
|
| 90 |
+
|
| 91 |
+
| Subject | Score | Std Error |
|
| 92 |
+
|------|------|---------|
|
| 93 |
+
| Business Ethics | 80.00% | ±4.02% |
|
| 94 |
+
| Clinical Knowledge | 83.40% | ±2.29% |
|
| 95 |
+
| College Medicine | 78.03% | ±3.16% |
|
| 96 |
+
| Global Facts | 51.00% | ±5.02% |
|
| 97 |
+
| Human Aging | 79.37% | ±2.72% |
|
| 98 |
+
| Management | 85.44% | ±3.49% |
|
| 99 |
+
| Marketing | 89.74% | ±1.99% |
|
| 100 |
+
| Medical Genetics | 88.00% | ±3.27% |
|
| 101 |
+
| Miscellaneous | 88.25% | ±1.15% |
|
| 102 |
+
| Nutrition | 80.72% | ±2.26% |
|
| 103 |
+
| Professional Accounting | 64.89% | ±2.85% |
|
| 104 |
+
| Professional Medicine | 84.19% | ±2.22% |
|
| 105 |
+
| Virology | 50.60% | ±3.89% |
|