Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -41,3 +41,36 @@ Private finetuned Nanbeige4.1-3B reasoning release by `strykes`.
|
|
| 41 |
|
| 42 |
- Intended for research and benchmarking.
|
| 43 |
- Validate outputs before critical use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
- Intended for research and benchmarking.
|
| 43 |
- Validate outputs before critical use.
|
| 44 |
+
|
| 45 |
+
## Benchmarks (2026-02-24)
|
| 46 |
+
|
| 47 |
+
### Local lm-eval results (this finetune)
|
| 48 |
+
|
| 49 |
+
| Task | Metric | Score |
|
| 50 |
+
|---|---:|---:|
|
| 51 |
+
| mmlu | acc,none | 59.98% |
|
| 52 |
+
| gsm8k | exact_match,flexible-extract | 62.40% |
|
| 53 |
+
| arc_challenge | acc_norm,none | 31.74% |
|
| 54 |
+
| hellaswag | acc_norm,none | 56.07% |
|
| 55 |
+
| winogrande | acc,none | 50.04% |
|
| 56 |
+
| piqa | acc_norm,none | 63.22% |
|
| 57 |
+
| boolq | acc,none | 74.37% |
|
| 58 |
+
| truthfulqa_mc2 | acc,none | 45.34% |
|
| 59 |
+
|
| 60 |
+
### Public references
|
| 61 |
+
|
| 62 |
+
- Base model (`Nanbeige/Nanbeige4.1-3B`) author-published benchmarks are listed in:
|
| 63 |
+
- `benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md`
|
| 64 |
+
- Frontier references (Claude/GPT/Gemini) are included in the same comparison report.
|
| 65 |
+
|
| 66 |
+
### Reproducibility artifacts
|
| 67 |
+
|
| 68 |
+
- `benchmarks/lm-eval-2026-02-24/summary_v3.tsv`
|
| 69 |
+
- `benchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.json`
|
| 70 |
+
- `benchmarks/lm-eval-2026-02-24/run_v3.log`
|
| 71 |
+
- `benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md`
|
| 72 |
+
|
| 73 |
+
### Caveat
|
| 74 |
+
|
| 75 |
+
Public model-card comparisons are not always apples-to-apples with lm-evaluation-harness settings (prompting, few-shot, decoding, and benchmark versions can differ).
|
| 76 |
+
|