EmberForge-3B-Reasoner
Private finetuned Nanbeige4.1-3B reasoning release by strykes.
Included Artifacts
- Merged full model (Safetensors) at repo root for HF benchmarking
- LoRA adapter in
adapter/ - GGUF in
gguf/:Nanbeige4.1-3B-Q5_K_M.ggufNanbeige4.1-3B-Q4_K_M.ggufNanbeige4.1-3B-f16.gguf
- Optional archive in
archives/
Training Snapshot
- Base:
Nanbeige/Nanbeige4.1-3B - Method: Unsloth QLoRA -> merged weights
- Data: ~3.5k synthetic reasoning samples
- Epochs: 2
- Sequence length: 4096
Notes
- Intended for research and benchmarking.
- Validate outputs before critical use.
Benchmarks (2026-02-24)
Local lm-eval results (this finetune)
| Task | Metric | Score |
|---|---|---|
| mmlu | acc,none | 59.98% |
| gsm8k | exact_match,flexible-extract | 62.40% |
| arc_challenge | acc_norm,none | 31.74% |
| hellaswag | acc_norm,none | 56.07% |
| winogrande | acc,none | 50.04% |
| piqa | acc_norm,none | 63.22% |
| boolq | acc,none | 74.37% |
| truthfulqa_mc2 | acc,none | 45.34% |
Public references
- Base model (
Nanbeige/Nanbeige4.1-3B) author-published benchmarks are listed in:benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
- Frontier references (Claude/GPT/Gemini) are included in the same comparison report.
Reproducibility artifacts
benchmarks/lm-eval-2026-02-24/summary_v3.tsvbenchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.jsonbenchmarks/lm-eval-2026-02-24/run_v3.logbenchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
Caveat
Public model-card comparisons are not always apples-to-apples with lm-evaluation-harness settings (prompting, few-shot, decoding, and benchmark versions can differ).
- Downloads last month
- 10