EmberForge-3B-Reasoner

Private finetuned Nanbeige4.1-3B reasoning release by strykes.

Included Artifacts

  • Merged full model (Safetensors) at repo root for HF benchmarking
  • LoRA adapter in adapter/
  • GGUF in gguf/:
    • Nanbeige4.1-3B-Q5_K_M.gguf
    • Nanbeige4.1-3B-Q4_K_M.gguf
    • Nanbeige4.1-3B-f16.gguf
  • Optional archive in archives/

Training Snapshot

  • Base: Nanbeige/Nanbeige4.1-3B
  • Method: Unsloth QLoRA -> merged weights
  • Data: ~3.5k synthetic reasoning samples
  • Epochs: 2
  • Sequence length: 4096

Notes

  • Intended for research and benchmarking.
  • Validate outputs before critical use.

Benchmarks (2026-02-24)

Local lm-eval results (this finetune)

Task Metric Score
mmlu acc,none 59.98%
gsm8k exact_match,flexible-extract 62.40%
arc_challenge acc_norm,none 31.74%
hellaswag acc_norm,none 56.07%
winogrande acc,none 50.04%
piqa acc_norm,none 63.22%
boolq acc,none 74.37%
truthfulqa_mc2 acc,none 45.34%

Public references

  • Base model (Nanbeige/Nanbeige4.1-3B) author-published benchmarks are listed in:
    • benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md
  • Frontier references (Claude/GPT/Gemini) are included in the same comparison report.

Reproducibility artifacts

  • benchmarks/lm-eval-2026-02-24/summary_v3.tsv
  • benchmarks/lm-eval-2026-02-24/results_2026-02-24T00-06-21.474293.json
  • benchmarks/lm-eval-2026-02-24/run_v3.log
  • benchmarks/lm-eval-2026-02-24/benchmark_comparison_public_2026-02-24.md

Caveat

Public model-card comparisons are not always apples-to-apples with lm-evaluation-harness settings (prompting, few-shot, decoding, and benchmark versions can differ).

Downloads last month
10
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for strykes/emberforge-3b-reasoner

Quantized
(55)
this model
Quantizations
1 model