Instructions to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with PEFT:
Task type is invalid.
- llama-cpp-python
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill", filename="Llama3.2-1B-Claude-Opus-Reasoning-Distill.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M # Run inference directly in the terminal: llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M # Run inference directly in the terminal: llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Use Docker
docker model run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
- Ollama
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Ollama:
ollama run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
- Unsloth Studio
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill to start chatting
- Pi
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Docker Model Runner:
docker model run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
- Lemonade
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
Run and chat with the model
lemonade run user.Llama3.2-1B-Claude-Opus-Reasoning-Distill-Q4_K_M
List all available models
lemonade list
- Llama3.2-1B-Claude-Opus-Reasoning-Distill : GGUF (Code + Math)
Llama3.2-1B-Claude-Opus-Reasoning-Distill : GGUF (Code + Math)
This model was finetuned and converted to GGUF format using Unsloth.
Note: This was a naive attempt to distill reasoning into a non reasoning model: Model should only be seen as a toy attempt.
⚠️ What Went Wrong (Read Before Using)
This model was a learning experiment. Three things went wrong, and you should know about them before using it.
1. SFT can't teach reasoning, only mimics it in responses
The goal was to distill Claude Opus's reasoning behavior into a 1B model by training on its <think> traces. That's the wrong tool for the job. Supervised fine-tuning teaches the model to copy the format of reasoning — it learns to write <think> before an answer because that's what the training data does, not because it has developed any actual reasoning capability. To genuinely develop reasoning, I learned you'd need reinforcement learning (GRPO/PPO) with a verifiable reward — reward correct answers, let the model figure out how to get there. That's how reasoning models actually/generally work.
2. The dataset was too small and too narrow — then I overtrained Only ~2,000 examples, code+math only, trained for 5 epochs. At 5 epochs on 2k examples, the model is mostly memorizing. GSM8K dropped 10% vs base — not because 1B can't do math, but because it saw 5 repetitions of a narrow slice and lost generalization.
3. The model doesn't stop generating or repeating
Two compounding bugs: the training dataset had many examples truncated at the 2048 token limit, which cut off the end-of-turn token (<|eot_id|>) from those examples — so the model never reliably learned that responses have an end. On top of that, HuggingFace's default eos_token_id for Llama 3 is 128001 (<|end_of_text|>), but the model actually generates 128009 (<|eot_id|>) to end turns. Without explicitly passing both, model.generate() never stops.
Fix if you're using this model:
model.generate(
input_ids=inputs,
eos_token_id=[128001, 128009],
max_new_tokens=512,
repetition_penalty=1.3,
no_repeat_ngram_size=6,
)
For Ollama, add to your Modelfile:
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
A LoRA fine-tune of meta-llama/Llama-3.2-1B-Instruct that tried to distill chain-of-thought reasoning
from Claude Opus 4.6/4.7 into a 1B parameter model. The model learns to emit structured
<think>...</think> reasoning blocks before answering, targeting code generation and math reasoning tasks.
Experimental. This is a personal research fine-tune trained on a single consumer GPU (RTX 3050 6 GB). Benchmarks show meaningful regressions on standard evals — see the Results section for an honest account.
Model Details
- Developed by: CodeStrate
- Model type: Causal LM — LoRA adapter (PEFT) on Llama-3.2-1B-Instruct
- Language: English
- License: Meta Llama 3.2 Community License
- Fine-tuned from:
unsloth/Llama-3.2-1B-Instruct-bnb-4bit - Max Sequence Length: 2048
- Training framework: Unsloth + TRL SFTTrainer
- Hardware: NVIDIA RTX 3050 6 GB GDDR6 Mobile
Intended Use
Direct Use
Generating step-by-step reasoning traces (<think> blocks) followed by final answers for
coding and math problems. Useful for studying how reasoning distillation scales (or doesn't)
to 1B-parameter models.
Out-of-Scope Use
- Production code generation or mathematical proofs — benchmark regressions make this unreliable
- Tasks outside coding/math (the training data was filtered to those categories only)
- Replacing a larger reasoning model
How to Get Started
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill",
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=1024, ### thinking requires a lot more tokens
temperature=0.7,
repetition_penalty=1.2, # recommended to have — mitigates echolalia in my experience. not a sure shot fix.
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The model will produce a <think>...</think> block containing its reasoning before the final answer.
Training Details
Dataset
angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
— filtered to coding and math categories, 2,000 examples total (~40% multi-turn conversations).
The dataset contains Claude Opus 4.6/4.7 responses with full <think> reasoning traces.
No additional preprocessing was needed — data was already in OpenAI messages format and mapped
directly through apply_chat_template.
Training Hyperparameters
| Parameter | Value |
|---|---|
| LoRA Rank / Alpha | 32 / 64 |
| Target Modules | All |
| Sequence Length | 2048 |
| Batch Size (effective) | 16 (2 × grad_accum 8) |
| Steps | 500 (~5 epochs over 2k samples) |
| Learning Rate | 1e-4 |
| LR Scheduler | cosine |
| Warmup Steps | 100 |
| Optimizer | adamw_8bit |
| Weight Decay | 0.01 |
| Precision | bfloat16 |
| Chat Template | Llama-3 built-in (<|eot_id|> stop) |
Loss Curve
Training loss dropped from 2.39 → 1.57 over 500 steps (monotonic with minor noise). The curve had not plateaued at step 500, suggesting more training could further reduce loss.
| Step | Loss |
|---|---|
| 25 | 2.393 |
| 100 | 1.976 |
| 250 | 1.729 |
| 375 | 1.622 |
| 500 | 1.571 |
Evaluation
Evaluated with lm-evaluation-harness
on an RTX 3050 6 GB, greedy decoding, batch size 1.
Results
| Task | Category | n-shot | Base | Fine-tuned | Δ |
|---|---|---|---|---|---|
| GSM8K — Strict Match | Math Reasoning | 5 | 31.77% | 21.23% | -10.54pp ↓ |
| GSM8K — Flexible Extract | Math Reasoning | 5 | 37.23% | 25.47% | -11.75pp ↓ |
| HumanEval — pass@1 | Code Generation | 0 | 0.00% | 1.22% | +1.22pp ↑ |
| Total Eval Time | Inference | — | 1h 04m | 2h 07m | +97.3% ↑ |
Interpretation
GSM8K regression is expected and well-understood: the model adopts verbose <think> reasoning
blocks, which interfere with the strict #### <answer> output format that GSM8K grading requires.
The flexible-extract metric (which searches anywhere in the output for a number) also drops,
suggesting capacity limits at 1B parameters — the model struggles to maintain math accuracy
while also learning a new output structure.
HumanEval improves marginally from 0 → 1.2%. The low absolute score reflects HumanEval's strict single-function completion format clashing with the model's tendency to generate reasoning preamble.
Inference overhead (2×) is the clearest signal that reasoning distillation succeeded at the format level — the model generates substantially more tokens per sample. This is the classic echolalia / verbose CoT pattern observed across all small-model reasoning distills in this project.
Known Limitations
- Repetition / echolalia — common across all small-model fine-tunes in this project
(LFM2.5, Qwen2.5-0.5B, Llama3.2-1B). Use
repetition_penalty=1.2at inference to reduce severity. - Reasoning trace quality —
<think>blocks are often structurally correct but factually unreliable; capacity ceiling of 1B is the likely bottleneck. - Format rigidity — the model expects Llama-3 chat template formatting; raw completions without a system prompt may produce inconsistent output.
- Loss still descending at 500 steps — extended training (1000+ steps) may improve results.
Framework Versions
- Python 3.12.13
- Unsloth 2026.5.7
- PEFT 0.19.1
- TRL 0.24.0
- PyTorch 2.10.0+cu128
- Transformers 5.5.0
Example usage:
- For text only LLMs:
llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill --jinja - For multimodal models:
llama-mtmd-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill --jinja
Available Model files:
Llama-3.2-1B-Instruct.Q4_K_M.gguf
Ollama
An Ollama Modelfile is included for easy deployment.
This was trained 2x faster with Unsloth

- Downloads last month
- 126
4-bit
Model tree for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill
Base model
meta-llama/Llama-3.2-1B-Instruct