Llama3.2-1B-Claude-Opus-Reasoning-Distill : GGUF (Code + Math)

This model was finetuned and converted to GGUF format using Unsloth.

Note: This was a naive attempt to distill reasoning into a non reasoning model: Model should only be seen as a toy attempt.

⚠️ What Went Wrong (Read Before Using)

This model was a learning experiment. Three things went wrong, and you should know about them before using it.

1. SFT can't teach reasoning, only mimics it in responses The goal was to distill Claude Opus's reasoning behavior into a 1B model by training on its <think> traces. That's the wrong tool for the job. Supervised fine-tuning teaches the model to copy the format of reasoning — it learns to write <think> before an answer because that's what the training data does, not because it has developed any actual reasoning capability. To genuinely develop reasoning, I learned you'd need reinforcement learning (GRPO/PPO) with a verifiable reward — reward correct answers, let the model figure out how to get there. That's how reasoning models actually/generally work.

2. The dataset was too small and too narrow — then I overtrained Only ~2,000 examples, code+math only, trained for 5 epochs. At 5 epochs on 2k examples, the model is mostly memorizing. GSM8K dropped 10% vs base — not because 1B can't do math, but because it saw 5 repetitions of a narrow slice and lost generalization.

3. The model doesn't stop generating or repeating Two compounding bugs: the training dataset had many examples truncated at the 2048 token limit, which cut off the end-of-turn token (<|eot_id|>) from those examples — so the model never reliably learned that responses have an end. On top of that, HuggingFace's default eos_token_id for Llama 3 is 128001 (<|end_of_text|>), but the model actually generates 128009 (<|eot_id|>) to end turns. Without explicitly passing both, model.generate() never stops.

Fix if you're using this model:

model.generate(
    input_ids=inputs,
    eos_token_id=[128001, 128009],
    max_new_tokens=512,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
)

For Ollama, add to your Modelfile:

PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

A LoRA fine-tune of meta-llama/Llama-3.2-1B-Instruct that tried to distill chain-of-thought reasoning from Claude Opus 4.6/4.7 into a 1B parameter model. The model learns to emit structured <think>...</think> reasoning blocks before answering, targeting code generation and math reasoning tasks.

Experimental. This is a personal research fine-tune trained on a single consumer GPU (RTX 3050 6 GB). Benchmarks show meaningful regressions on standard evals — see the Results section for an honest account.

Model Details

  • Developed by: CodeStrate
  • Model type: Causal LM — LoRA adapter (PEFT) on Llama-3.2-1B-Instruct
  • Language: English
  • License: Meta Llama 3.2 Community License
  • Fine-tuned from: unsloth/Llama-3.2-1B-Instruct-bnb-4bit
  • Max Sequence Length: 2048
  • Training framework: Unsloth + TRL SFTTrainer
  • Hardware: NVIDIA RTX 3050 6 GB GDDR6 Mobile

Intended Use

Direct Use

Generating step-by-step reasoning traces (<think> blocks) followed by final answers for coding and math problems. Useful for studying how reasoning distillation scales (or doesn't) to 1B-parameter models.

Out-of-Scope Use

  • Production code generation or mathematical proofs — benchmark regressions make this unreliable
  • Tasks outside coding/math (the training data was filtered to those categories only)
  • Replacing a larger reasoning model

How to Get Started

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill",
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=1024, ### thinking requires a lot more tokens
    temperature=0.7,
    repetition_penalty=1.2,   # recommended to have — mitigates echolalia in my experience. not a sure shot fix.
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The model will produce a <think>...</think> block containing its reasoning before the final answer.

Training Details

Dataset

angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — filtered to coding and math categories, 2,000 examples total (~40% multi-turn conversations).

The dataset contains Claude Opus 4.6/4.7 responses with full <think> reasoning traces. No additional preprocessing was needed — data was already in OpenAI messages format and mapped directly through apply_chat_template.

Training Hyperparameters

Parameter Value
LoRA Rank / Alpha 32 / 64
Target Modules All
Sequence Length 2048
Batch Size (effective) 16 (2 × grad_accum 8)
Steps 500 (~5 epochs over 2k samples)
Learning Rate 1e-4
LR Scheduler cosine
Warmup Steps 100
Optimizer adamw_8bit
Weight Decay 0.01
Precision bfloat16
Chat Template Llama-3 built-in (<|eot_id|> stop)

Loss Curve

Training loss dropped from 2.39 → 1.57 over 500 steps (monotonic with minor noise). The curve had not plateaued at step 500, suggesting more training could further reduce loss.

Step Loss
25 2.393
100 1.976
250 1.729
375 1.622
500 1.571

Evaluation

Evaluated with lm-evaluation-harness on an RTX 3050 6 GB, greedy decoding, batch size 1.

Results

Task Category n-shot Base Fine-tuned Δ
GSM8K — Strict Match Math Reasoning 5 31.77% 21.23% -10.54pp ↓
GSM8K — Flexible Extract Math Reasoning 5 37.23% 25.47% -11.75pp ↓
HumanEval — pass@1 Code Generation 0 0.00% 1.22% +1.22pp ↑
Total Eval Time Inference 1h 04m 2h 07m +97.3% ↑

Interpretation

GSM8K regression is expected and well-understood: the model adopts verbose <think> reasoning blocks, which interfere with the strict #### <answer> output format that GSM8K grading requires. The flexible-extract metric (which searches anywhere in the output for a number) also drops, suggesting capacity limits at 1B parameters — the model struggles to maintain math accuracy while also learning a new output structure.

HumanEval improves marginally from 0 → 1.2%. The low absolute score reflects HumanEval's strict single-function completion format clashing with the model's tendency to generate reasoning preamble.

Inference overhead (2×) is the clearest signal that reasoning distillation succeeded at the format level — the model generates substantially more tokens per sample. This is the classic echolalia / verbose CoT pattern observed across all small-model reasoning distills in this project.

Known Limitations

  • Repetition / echolalia — common across all small-model fine-tunes in this project (LFM2.5, Qwen2.5-0.5B, Llama3.2-1B). Use repetition_penalty=1.2 at inference to reduce severity.
  • Reasoning trace quality<think> blocks are often structurally correct but factually unreliable; capacity ceiling of 1B is the likely bottleneck.
  • Format rigidity — the model expects Llama-3 chat template formatting; raw completions without a system prompt may produce inconsistent output.
  • Loss still descending at 500 steps — extended training (1000+ steps) may improve results.

Framework Versions

  • Python 3.12.13
  • Unsloth 2026.5.7
  • PEFT 0.19.1
  • TRL 0.24.0
  • PyTorch 2.10.0+cu128
  • Transformers 5.5.0

Example usage:

  • For text only LLMs: llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill --jinja
  • For multimodal models: llama-mtmd-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill --jinja

Available Model files:

  • Llama-3.2-1B-Instruct.Q4_K_M.gguf

Ollama

An Ollama Modelfile is included for easy deployment. This was trained 2x faster with Unsloth

Downloads last month
126
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill

Adapter
(403)
this model

Dataset used to train codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill