strykes
/

emberforge-3b-reasoner

Text Generation

text-generation-inference

Model card Files Files and versions

strykes commited on Feb 24

Commit

9390f1b

·

verified ·

1 Parent(s): 1732864

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +33 -0

README.md CHANGED Viewed

@@ -41,3 +41,36 @@ Private finetuned Nanbeige4.1-3B reasoning release by `strykes`.
 - Intended for research and benchmarking.
 - Validate outputs before critical use.

 - Intended for research and benchmarking.
 - Validate outputs before critical use.
+## Benchmarks (2026-02-24)
+### Local lm-eval results (this finetune)
+| Task | Metric | Score |
+|---|---:|---:|
+| mmlu | acc,none | 59.98% |
+| gsm8k | exact_match,flexible-extract | 62.40% |
+| arc_challenge | acc_norm,none | 31.74% |
+| hellaswag | acc_norm,none | 56.07% |
+| winogrande | acc,none | 50.04% |
+| piqa | acc_norm,none | 63.22% |
+| boolq | acc,none | 74.37% |
+| truthfulqa_mc2 | acc,none | 45.34% |
+### Public references
+- Base model (`Nanbeige/Nanbeige4.1-3B`) author-published benchmarks are listed in:
+  - `benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md`
+- Frontier references (Claude/GPT/Gemini) are included in the same comparison report.
+### Reproducibility artifacts
+- `benchmarks/lm-eval-2026-02-24/summary_v3.tsv`
+- `benchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.json`
+- `benchmarks/lm-eval-2026-02-24/run_v3.log`
+- `benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md`
+### Caveat
+Public model-card comparisons are not always apples-to-apples with lm-evaluation-harness settings (prompting, few-shot, decoding, and benchmark versions can differ).