Instructions to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter

SGLang

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter",
    max_seq_length=2048,
)

Docker Model Runner
How to use codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter with Docker Model Runner:
```
docker model run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter
```

A newer version of this model is available: codestrate/Llama3.2-3B-Claude-Reasoning-Distill

Llama 3.2 3B — Claude Reasoning Distill (Adapter)

PS: Needs Base Model to work!

An updated attempt at distilling Claude Opus 4.6/4.7 reasoning traces into a small-form-factor model. The predecessor Llama 3.2 1B Claude Opus Reasoning Distill demonstrated that a 1B model could adopt <think> blocks but suffered from echolalia and a GSM8K regression. This run addresses the two root causes identified from that experiment:

Capacity — 3B sits closer to the parameter floor where structured reasoning adoption is viable, as seen in models like Gemma 4 E2B-IT and Qwen3-1.7B (which has <think> baked into pretraining)
Token boundaries — <think> and </think> are registered as special tokens (vocab 128256 → 128258) with trained embeddings, giving the model a hard mode boundary instead of treating them as plain text

Benchmarks are not yet available. GSM8K and HumanEval evaluations vs base Llama-3.2-3B-Instruct 4bit and more benchmarks like ARC for reasoning are in progress and will be added here when complete.

Model Details

Field	Value
Base model	`unsloth/Llama-3.2-3B-Instruct-bnb-4bit`
Model type	Causal LM — LoRA adapter (PEFT) on Llama-3.2-3B-Instruct
Language	English
License	Meta Llama 3.2 Community License
Training framework	Unsloth + TRL SFTTrainer
Hardware	Tesla T4 (Kaggle)
Max sequence length	2048

Intended Use

Generating step-by-step reasoning traces (<think> blocks) followed by final answers across a broad range of instruction-following tasks. Useful for studying how reasoning distillation scales to sub-4B models and how registered thinking tokens affect small-model behaviour.

Not intended for: production use, mathematical proofs requiring reliability, or replacing a larger reasoning model. Benchmark regressions vs base are expected until verified otherwise.

How to Get Started

From the adapter

The LoRA adapter is available separately — load it on top of the base model without downloading the full merged weights.

Important: load the tokenizer from the adapter directory, not the base model. The adapter tokenizer carries the correct 128258-token vocabulary with <think>/</think> baked in. Using the base model tokenizer (128256) will cause an embedding dimension mismatch.

from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer
from peft import PeftModel

ADAPTER_PATH = "codestrate/Llama3.2-3B-Claude-Reasoning-Distill"

model, _ = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    load_in_4bit=True,
    max_seq_length=2048,
)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)  # vocab=128258
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, ADAPTER_PATH)
FastLanguageModel.for_inference(model)

SYSTEM_PROMPT = "You are a helpful assistant. Think step by step inside <think>...</think> before giving your final answer."
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Write a Python function to check if a number is prime."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(
    input_ids=inputs,
    streamer=streamer,
    max_new_tokens=1024,
    temperature=0.7,
    min_p=0.1,
    repetition_penalty=1.3,
    no_repeat_ngram_size=6,
    use_cache=True,
)

From GGUF (Ollama / LM Studio)

A Modelfile is included for Ollama. For direct use:

ollama run hf.co/codestrate/Llama3.2-3B-Claude-Reasoning-Distill:Q4_K_M

Training Details

Dataset

angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k — instruct_train.jsonl split (full instruct + reasoning, ~7,700 examples). Data already in OpenAI messages format; mapped directly through apply_chat_template with no additional preprocessing.

The previous 1B run used only the coding + math categories (~2,000 examples). This run uses the full instruct split for broader coverage.

Hyperparameters

Parameter	Value
LoRA Rank / Alpha	32 / 64
Target Modules	All
Sequence Length	2048
Effective Batch	16 (2 × grad_accum 8)
Steps	904 (~2 epochs)
Learning Rate	1e-4 / cosine
Warmup Steps	50
Optimizer	adamw_8bit
Weight Decay	0.01
Precision	bfloat16

Loss Curve

Available in the merged quant repo.

Step	Loss	Step	Loss	Step	Loss
50	2.1372	350	1.8798	650	1.7567
100	1.9597	400	1.8512	700	1.7530
150	1.9251	450	1.8493	750	1.7391
200	1.8972	500	1.7670	800	1.7709
250	1.8891	550	1.7707	850	1.7401
300	1.8738	600	1.7668	900	1.7598

Drop: 2.14 → 1.74 (~0.40 absolute). Visible cross-epoch improvement at step ~452 (−0.082). Plateau reached in epoch 2 from step 750 — a third epoch would not have been beneficial on this dataset.

Known Limitations

Benchmarks not yet available — results will be added when the evaluation runs complete
Echolalia / repetition — reduced vs the 1B run due to special token boundaries, but not eliminated; repetition_penalty=1.3 and no_repeat_ngram_size=6 are recommended at inference (needs more testing)
System prompt required — without the <think>...</think> contract in the system prompt, the model may not cleanly transition from reasoning block to final answer
Not a production model — a research artefact studying reasoning distillation at sub-4B scale

Available Files

File	Format	Use
`Llama-3.2-3B-Claude-Reasoning-Distill.Q4_K_M.gguf`	GGUF Q4_K_M	LM Studio / Ollama (recommended)
`Llama-3.2-3B-Claude-Reasoning-Distill.Q8_0.gguf`	GGUF Q8	Higher fidelity inference (near lossless; still lightweight)
`Llama-3.2-3B-Claude-Reasoning-Distill.F16.gguf`	GGUF F16	Full precision GGUF
Adapter (This Repository)	LoRA adapter	PEFT inference / further fine-tuning

Framework Versions

Python 3.12.13
Unsloth 2026.5.8
PEFT 0.19.1
TRL 0.24.0
PyTorch 2.10.0+cu128
Transformers 4.47.1

Predecessor: Llama3.2-1B-Claude-Opus-Reasoning-Distill
Trained 2x faster with Unsloth

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for codestrate/Llama3.2-3B-Claude-Reasoning-Distill-Adapter

Base model

meta-llama/Llama-3.2-3B-Instruct

Quantized

unsloth/Llama-3.2-3B-Instruct-bnb-4bit

Finetuned

(248)

this model

codestrate
/

Llama3.2-3B-Claude-Reasoning-Distill-Adapter