strykes commited on
Commit
9390f1b
·
verified ·
1 Parent(s): 1732864

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +33 -0
README.md CHANGED
@@ -41,3 +41,36 @@ Private finetuned Nanbeige4.1-3B reasoning release by `strykes`.
41
 
42
  - Intended for research and benchmarking.
43
  - Validate outputs before critical use.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  - Intended for research and benchmarking.
43
  - Validate outputs before critical use.
44
+
45
+ ## Benchmarks (2026-02-24)
46
+
47
+ ### Local lm-eval results (this finetune)
48
+
49
+ | Task | Metric | Score |
50
+ |---|---:|---:|
51
+ | mmlu | acc,none | 59.98% |
52
+ | gsm8k | exact_match,flexible-extract | 62.40% |
53
+ | arc_challenge | acc_norm,none | 31.74% |
54
+ | hellaswag | acc_norm,none | 56.07% |
55
+ | winogrande | acc,none | 50.04% |
56
+ | piqa | acc_norm,none | 63.22% |
57
+ | boolq | acc,none | 74.37% |
58
+ | truthfulqa_mc2 | acc,none | 45.34% |
59
+
60
+ ### Public references
61
+
62
+ - Base model (`Nanbeige/Nanbeige4.1-3B`) author-published benchmarks are listed in:
63
+ - `benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md`
64
+ - Frontier references (Claude/GPT/Gemini) are included in the same comparison report.
65
+
66
+ ### Reproducibility artifacts
67
+
68
+ - `benchmarks/lm-eval-2026-02-24/summary_v3.tsv`
69
+ - `benchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.json`
70
+ - `benchmarks/lm-eval-2026-02-24/run_v3.log`
71
+ - `benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md`
72
+
73
+ ### Caveat
74
+
75
+ Public model-card comparisons are not always apples-to-apples with lm-evaluation-harness settings (prompting, few-shot, decoding, and benchmark versions can differ).
76
+