Update README.md
Browse files
README.md
CHANGED
|
@@ -91,33 +91,6 @@ The model incorporates several key architectural improvements:
|
|
| 91 |
- Dynamic learning rate scheduling with restarts
|
| 92 |
- Careful hyperparameter tuning for stability at scale
|
| 93 |
|
| 94 |
-
## Performance Benchmarks
|
| 95 |
-
|
| 96 |
-
### Reasoning and Knowledge
|
| 97 |
-
|
| 98 |
-
| Benchmark | Score | Description |
|
| 99 |
-
|-----------|-------|-------------|
|
| 100 |
-
| MMLU | 84.7% | Massive Multitask Language Understanding |
|
| 101 |
-
| ARC Challenge | 83.4% | Advanced reasoning and comprehension |
|
| 102 |
-
| HellaSwag | 88.9% | Common sense inference |
|
| 103 |
-
| WinoGrande | 82.3% | Commonsense reasoning |
|
| 104 |
-
| TruthfulQA | 61.2% | Truthfulness in question answering |
|
| 105 |
-
|
| 106 |
-
### Mathematical Reasoning
|
| 107 |
-
|
| 108 |
-
| Benchmark | Score | Description |
|
| 109 |
-
|-----------|-------|-------------|
|
| 110 |
-
| GSM8K | 89.2% | Grade school mathematics |
|
| 111 |
-
| MATH | 56.7% | Competition-level mathematics |
|
| 112 |
-
| Minerva Math | 53.4% | Advanced mathematical reasoning |
|
| 113 |
-
|
| 114 |
-
### Code Generation
|
| 115 |
-
|
| 116 |
-
| Benchmark | Score | Description |
|
| 117 |
-
|-----------|-------|-------------|
|
| 118 |
-
| HumanEval | 75.6% | Python code generation |
|
| 119 |
-
| MBPP | 72.3% | Basic Python programming |
|
| 120 |
-
| DS-1000 | 64.5% | Data science code completion |
|
| 121 |
|
| 122 |
### Context Understanding
|
| 123 |
|
|
|
|
| 91 |
- Dynamic learning rate scheduling with restarts
|
| 92 |
- Careful hyperparameter tuning for stability at scale
|
| 93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
### Context Understanding
|
| 96 |
|