Instructions to use justinj92/Llama-3.2-3B-Instruct-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use justinj92/Llama-3.2-3B-Instruct-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="justinj92/Llama-3.2-3B-Instruct-Thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("justinj92/Llama-3.2-3B-Instruct-Thinking")
model = AutoModelForCausalLM.from_pretrained("justinj92/Llama-3.2-3B-Instruct-Thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use justinj92/Llama-3.2-3B-Instruct-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "justinj92/Llama-3.2-3B-Instruct-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "justinj92/Llama-3.2-3B-Instruct-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/justinj92/Llama-3.2-3B-Instruct-Thinking

SGLang

How to use justinj92/Llama-3.2-3B-Instruct-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "justinj92/Llama-3.2-3B-Instruct-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "justinj92/Llama-3.2-3B-Instruct-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "justinj92/Llama-3.2-3B-Instruct-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "justinj92/Llama-3.2-3B-Instruct-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use justinj92/Llama-3.2-3B-Instruct-Thinking with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for justinj92/Llama-3.2-3B-Instruct-Thinking to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for justinj92/Llama-3.2-3B-Instruct-Thinking to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for justinj92/Llama-3.2-3B-Instruct-Thinking to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="justinj92/Llama-3.2-3B-Instruct-Thinking",
    max_seq_length=2048,
)

Docker Model Runner
How to use justinj92/Llama-3.2-3B-Instruct-Thinking with Docker Model Runner:
```
docker model run hf.co/justinj92/Llama-3.2-3B-Instruct-Thinking
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Card for Llama-3.2-3B-Instruct-Thinking

It has been trained using TRL & Unsloth.

Evals

Model	GSM8k 0-Shot	GSM8k Few-Shot
Mistral-7B-v0.1	10	41
Llama-3.2-3B-Instruct-Thinking	31.61	54.51

Training procedure

Trained on 1xH100 96GB via Azure Cloud (North Europe). This is model at Checkpoint 3200 post which the model started to drop in accuracy across reward functions.

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

System Prompt

Make sure to set the system prompt in order to set the tone and guidelines for the responses - Otherwise, it will act in a default way that might not be what you want.

Recommended System Prompt:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively,
i.e., <think> reasoning process here </think><answer> answer here </answer>

Usage Recommendations

Recommend adhering to the following configurations when utilizing the models, including benchmarking, to achieve the expected performance:

Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
When evaluating model performance, it is recommended to conduct multiple tests and average the results.
This model is not enhanced for other domains apart from Maths.

Framework versions

TRL: 0.15.0.dev0
Transformers: 4.49.0.dev0
Pytorch: 2.5.1
Datasets: 3.2.0
Tokenizers: 0.21.0

Citations

Cite Unsloth as:

@software{unsloth,
  author = {Daniel Han, Michael Han and Unsloth team},
  title = {Unsloth},
  url = {http://github.com/unslothai/unsloth},
  year = {2023}
}

Cite GRPO as:

@article{zhihong2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}