A newer version of this model is available: codestrate/Llama3.2-3B-Claude-Reasoning-Distill

Llama 3.2 3B — Claude Reasoning Distill (Adapter)

PS: Needs Base Model to work!

An updated attempt at distilling Claude Opus 4.6/4.7 reasoning traces into a small-form-factor model. The predecessor Llama 3.2 1B Claude Opus Reasoning Distill demonstrated that a 1B model could adopt <think> blocks but suffered from echolalia and a GSM8K regression. This run addresses the two root causes identified from that experiment:

  1. Capacity — 3B sits closer to the parameter floor where structured reasoning adoption is viable, as seen in models like Gemma 4 E2B-IT and Qwen3-1.7B (which has <think> baked into pretraining)
  2. Token boundaries — <think> and </think> are registered as special tokens (vocab 128256 → 128258) with trained embeddings, giving the model a hard mode boundary instead of treating them as plain text

Benchmarks are not yet available. GSM8K and HumanEval evaluations vs base Llama-3.2-3B-Instruct 4bit and more benchmarks like ARC for reasoning are in progress and will be added here when complete.


Model Details

Field Value
Base model unsloth/Llama-3.2-3B-Instruct-bnb-4bit
Model type Causal LM — LoRA adapter (PEFT) on Llama-3.2-3B-Instruct
Language English
License Meta Llama 3.2 Community License
Training framework Unsloth + TRL SFTTrainer
Hardware Tesla T4 (Kaggle)
Max sequence length 2048

Intended Use

Generating step-by-step reasoning traces (<think> blocks) followed by final answers across a broad range of instruction-following tasks. Useful for studying how reasoning distillation scales to sub-4B models and how registered thinking tokens affect small-model behaviour.

Not intended for: production use, mathematical proofs requiring reliability, or replacing a larger reasoning model. Benchmark regressions vs base are expected until verified otherwise.


How to Get Started

From the adapter

The LoRA adapter is available separately — load it on top of the base model without downloading the full merged weights.

Important: load the tokenizer from the adapter directory, not the base model. The adapter tokenizer carries the correct 128258-token vocabulary with <think>/</think> baked in. Using the base model tokenizer (128256) will cause an embedding dimension mismatch.

from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer
from peft import PeftModel

ADAPTER_PATH = "codestrate/Llama3.2-3B-Claude-Reasoning-Distill"

model, _ = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    load_in_4bit=True,
    max_seq_length=2048,
)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)  # vocab=128258
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = "You are a helpful assistant. Think step by step inside <think>...</think> before giving your final answer."
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Write a Python function to check if a number is prime."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(
    input_ids=inputs,
    streamer=streamer,
    max_new_tokens=1024,
    temperature=0.7,
    min_p=0.1,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
    use_cache=True,
)

From GGUF (Ollama / LM Studio)

A Modelfile is included for Ollama. For direct use:

ollama run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Training Details

Dataset

angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — instruct_train.jsonl split (full instruct + reasoning, ~7,700 examples). Data already in OpenAI messages format; mapped directly through apply_chat_template with no additional preprocessing.

The previous 1B run used only the coding + math categories (~2,000 examples). This run uses the full instruct split for broader coverage.

Hyperparameters

Parameter Value
LoRA Rank / Alpha 32 / 64
Target Modules All
Sequence Length 2048
Effective Batch 16 (2 × grad_accum 8)
Steps 904 (~2 epochs)
Learning Rate 1e-4 / cosine
Warmup Steps 50
Optimizer adamw_8bit
Weight Decay 0.01
Precision bfloat16

Loss Curve

Available in the merged quant repo.

Step Loss Step Loss Step Loss
50 2.1372 350 1.8798 650 1.7567
100 1.9597 400 1.8512 700 1.7530
150 1.9251 450 1.8493 750 1.7391
200 1.8972 500 1.7670 800 1.7709
250 1.8891 550 1.7707 850 1.7401
300 1.8738 600 1.7668 900 1.7598

Drop: 2.14 → 1.74 (~0.40 absolute). Visible cross-epoch improvement at step ~452 (−0.082). Plateau reached in epoch 2 from step 750 — a third epoch would not have been beneficial on this dataset.

Known Limitations

  • Benchmarks not yet available — results will be added when the evaluation runs complete
  • Echolalia / repetition — reduced vs the 1B run due to special token boundaries, but not eliminated; repetition_penalty=1.3 and no_repeat_ngram_size=6 are recommended at inference (needs more testing)
  • System prompt required — without the <think>...</think> contract in the system prompt, the model may not cleanly transition from reasoning block to final answer
  • Not a production model — a research artefact studying reasoning distillation at sub-4B scale

Available Files

File Format Use
Llama-3.2-3B-Claude-Reasoning-Distill.Q4_K_M.gguf GGUF Q4_K_M LM Studio / Ollama (recommended)
Llama-3.2-3B-Claude-Reasoning-Distill.Q8_0.gguf GGUF Q8 Higher fidelity inference (near lossless; still lightweight)
Llama-3.2-3B-Claude-Reasoning-Distill.F16.gguf GGUF F16 Full precision GGUF
Adapter (This Repository) LoRA adapter PEFT inference / further fine-tuning

Framework Versions

  • Python 3.12.13
  • Unsloth 2026.5.8
  • PEFT 0.19.1
  • TRL 0.24.0
  • PyTorch 2.10.0+cu128
  • Transformers 4.47.1

Predecessor: Llama3.2-1B-Claude-Opus-Reasoning-Distill
Trained 2x faster with Unsloth

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter

Dataset used to train codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter