Instructions to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries
PEFT
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with PEFT:
```
Task type is invalid.
```

How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill",
	filename="Llama3.2-1B-Claude-Opus-Reasoning-Distill.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Use Docker

docker model run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

LM Studio
Jan

vLLM

How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Ollama
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Ollama:
```
ollama run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
```

Unsloth Studio

How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill to start chatting

How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Docker Model Runner:
```
docker model run hf.co/codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M
```

Lemonade

How to use codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill:Q4_K_M

Run and chat with the model

lemonade run user.Llama3.2-1B-Claude-Opus-Reasoning-Distill-Q4_K_M

List all available models

lemonade list

Llama3.2-1B-Claude-Opus-Reasoning-Distill : GGUF (Code + Math)

This model was finetuned and converted to GGUF format using Unsloth.

Note: This was a naive attempt to distill reasoning into a non reasoning model: Model should only be seen as a toy attempt.

⚠️ What Went Wrong (Read Before Using)

This model was a learning experiment. Three things went wrong, and you should know about them before using it.

1. SFT can't teach reasoning, only mimics it in responses The goal was to distill Claude Opus's reasoning behavior into a 1B model by training on its <think> traces. That's the wrong tool for the job. Supervised fine-tuning teaches the model to copy the format of reasoning — it learns to write <think> before an answer because that's what the training data does, not because it has developed any actual reasoning capability. To genuinely develop reasoning, I learned you'd need reinforcement learning (GRPO/PPO) with a verifiable reward — reward correct answers, let the model figure out how to get there. That's how reasoning models actually/generally work.

2. The dataset was too small and too narrow — then I overtrained Only ~2,000 examples, code+math only, trained for 5 epochs. At 5 epochs on 2k examples, the model is mostly memorizing. GSM8K dropped 10% vs base — not because 1B can't do math, but because it saw 5 repetitions of a narrow slice and lost generalization.

3. The model doesn't stop generating or repeating Two compounding bugs: the training dataset had many examples truncated at the 2048 token limit, which cut off the end-of-turn token (<|eot_id|>) from those examples — so the model never reliably learned that responses have an end. On top of that, HuggingFace's default eos_token_id for Llama 3 is 128001 (<|end_of_text|>), but the model actually generates 128009 (<|eot_id|>) to end turns. Without explicitly passing both, model.generate() never stops.

Fix if you're using this model:

model.generate(
    input_ids=inputs,
    eos_token_id=[128001, 128009],
    max_new_tokens=512,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
)

For Ollama, add to your Modelfile:

PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

A LoRA fine-tune of meta-llama/Llama-3.2-1B-Instruct that tried to distill chain-of-thought reasoning from Claude Opus 4.6/4.7 into a 1B parameter model. The model learns to emit structured <think>...</think> reasoning blocks before answering, targeting code generation and math reasoning tasks.

Experimental. This is a personal research fine-tune trained on a single consumer GPU (RTX 3050 6 GB). Benchmarks show meaningful regressions on standard evals — see the Results section for an honest account.

Model Details

Developed by: CodeStrate
Model type: Causal LM — LoRA adapter (PEFT) on Llama-3.2-1B-Instruct
Language: English
License: Meta Llama 3.2 Community License
Fine-tuned from: unsloth/Llama-3.2-1B-Instruct-bnb-4bit
Max Sequence Length: 2048
Training framework: Unsloth + TRL SFTTrainer
Hardware: NVIDIA RTX 3050 6 GB GDDR6 Mobile

Intended Use

Direct Use

Generating step-by-step reasoning traces (<think> blocks) followed by final answers for coding and math problems. Useful for studying how reasoning distillation scales (or doesn't) to 1B-parameter models.

Out-of-Scope Use

Production code generation or mathematical proofs — benchmark regressions make this unreliable
Tasks outside coding/math (the training data was filtered to those categories only)
Replacing a larger reasoning model

How to Get Started

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill",
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=1024, ### thinking requires a lot more tokens
    temperature=0.7,
    repetition_penalty=1.2,   # recommended to have — mitigates echolalia in my experience. not a sure shot fix.
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The model will produce a <think>...</think> block containing its reasoning before the final answer.

Training Details

Dataset

angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — filtered to coding and math categories, 2,000 examples total (~40% multi-turn conversations).

The dataset contains Claude Opus 4.6/4.7 responses with full <think> reasoning traces. No additional preprocessing was needed — data was already in OpenAI messages format and mapped directly through apply_chat_template.

Training Hyperparameters

Parameter	Value
LoRA Rank / Alpha	32 / 64
Target Modules	All
Sequence Length	2048
Batch Size (effective)	16 (2 × grad_accum 8)
Steps	500 (~5 epochs over 2k samples)
Learning Rate	1e-4
LR Scheduler	cosine
Warmup Steps	100
Optimizer	adamw_8bit
Weight Decay	0.01
Precision	bfloat16
Chat Template	Llama-3 built-in (`<\|eot_id\|>` stop)

Loss Curve

Training loss dropped from 2.39 → 1.57 over 500 steps (monotonic with minor noise). The curve had not plateaued at step 500, suggesting more training could further reduce loss.

Step	Loss
25	2.393
100	1.976
250	1.729
375	1.622
500	1.571

Evaluation

Evaluated with lm-evaluation-harness on an RTX 3050 6 GB, greedy decoding, batch size 1.

Results

Task	Category	n-shot	Base	Fine-tuned	Δ
GSM8K — Strict Match	Math Reasoning	5	31.77%	21.23%	-10.54pp ↓
GSM8K — Flexible Extract	Math Reasoning	5	37.23%	25.47%	-11.75pp ↓
HumanEval — pass@1	Code Generation	0	0.00%	1.22%	+1.22pp ↑
Total Eval Time	Inference	—	1h 04m	2h 07m	+97.3% ↑

Interpretation

GSM8K regression is expected and well-understood: the model adopts verbose <think> reasoning blocks, which interfere with the strict #### <answer> output format that GSM8K grading requires. The flexible-extract metric (which searches anywhere in the output for a number) also drops, suggesting capacity limits at 1B parameters — the model struggles to maintain math accuracy while also learning a new output structure.

HumanEval improves marginally from 0 → 1.2%. The low absolute score reflects HumanEval's strict single-function completion format clashing with the model's tendency to generate reasoning preamble.

Inference overhead (2×) is the clearest signal that reasoning distillation succeeded at the format level — the model generates substantially more tokens per sample. This is the classic echolalia / verbose CoT pattern observed across all small-model reasoning distills in this project.

Known Limitations

Repetition / echolalia — common across all small-model fine-tunes in this project (LFM2.5, Qwen2.5-0.5B, Llama3.2-1B). Use repetition_penalty=1.2 at inference to reduce severity.
Reasoning trace quality — <think> blocks are often structurally correct but factually unreliable; capacity ceiling of 1B is the likely bottleneck.
Format rigidity — the model expects Llama-3 chat template formatting; raw completions without a system prompt may produce inconsistent output.
Loss still descending at 500 steps — extended training (1000+ steps) may improve results.

Framework Versions

Python 3.12.13
Unsloth 2026.5.7
PEFT 0.19.1
TRL 0.24.0
PyTorch 2.10.0+cu128
Transformers 5.5.0

Example usage:

For text only LLMs: llama-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill --jinja
For multimodal models: llama-mtmd-cli -hf codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill --jinja

Available Model files:

Llama-3.2-1B-Instruct.Q4_K_M.gguf

Ollama

An Ollama Modelfile is included for easy deployment. This was trained 2x faster with Unsloth

Downloads last month: 126

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

4-bit

Model tree for codestrate/Llama3.2-1B-Claude-Opus-Reasoning-Distill

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

unsloth/Llama-3.2-1B-Instruct

Adapter

(403)

this model

codestrate
/

Llama3.2-1B-Claude-Opus-Reasoning-Distill