File size: 6,598 Bytes
fc7482b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | ---
language:
- en
license: other
pipeline_tag: text-generation
tags:
- llama
- chat
- sft
- reasoning
- cot
- ultrachat
- mixture-of-thoughts
- dpo
base_model: meta-llama/Llama-3.2-1B
library_name: transformers
---
# PursuitOfDataScience/llama3.2-1b-thinking
This repository contains a three-stage fine-tuned version of **meta-llama/Llama-3.2-1B**:
1. **Supervised fine-tuning (SFT)** on a local copy of **HuggingFaceH4/ultrachat_200k**
using an instruction-style, multi-turn chat objective.
2. **Reasoning training** to enhance step-by-step reasoning capabilities,
building on the SFT model using the `open-r1/Mixture-of-Thoughts` dataset.
3. **Direct Preference Optimization (DPO)** alignment using the `mlabonne/orpo-dpo-mix-40k` dataset
to improve response quality and alignment with human preferences.
## Model details
- **Base model**: `meta-llama/Llama-3.2-1B`
- **Stage 1 objective**: Supervised fine-tuning for helpful, concise chat responses
on Ultrachat-style conversations.
- **Stage 2 objective**: Specialized reasoning training to improve logical reasoning and
Chain of Thought (CoT) capabilities using step-by-step reasoning traces from `open-r1/Mixture-of-Thoughts`.
- **Stage 3 objective**: DPO alignment to refine responses based on preference data from `mlabonne/orpo-dpo-mix-40k`,
enhancing safety, helpfulness, and adherence to user constraints.
- **Context length**: Up to 131072 tokens (subject to the base model config).
- **Training data**:
- SFT: multi-turn dialogues from `HuggingFaceH4/ultrachat_200k`.
- Reasoning: `open-r1/Mixture-of-Thoughts` dataset with step-by-step reasoning traces.
- DPO: preference pairs from `mlabonne/orpo-dpo-mix-40k`.
## Inference usage
The model is trained in a **chat-style** setup. At inference time, prompts are built
as a list of `messages` and passed through the model's native `chat_template`
via `tokenizer.apply_chat_template`:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "PursuitOfDataScience/llama3.2-1b-thinking"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
device_map="auto",
)
messages = [
{
"role": "system",
"content": (
"You are a helpful, concise assistant. "
"Write clear, well-structured answers that follow the user's constraints."
),
},
{
"role": "user",
"content": "Explain how someone can build a consistent daily learning habit.",
},
]
prompt_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
# Decode only the generated continuation (excluding the prompt tokens)
generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(response)
```
### Multi-turn example
```python
messages = [
{
"role": "system",
"content": (
"You are a helpful, concise assistant. "
"Write clear, well-structured answers that follow the user's constraints."
),
},
{
"role": "user",
"content": "Describe the main trade-offs between using small and large language models.",
},
{
"role": "assistant",
"content": "Small models are cheaper and faster, while large models are usually more capable...",
},
{
"role": "user",
"content": "Give me a bullet-point summary from the perspective of a startup.",
},
]
prompt_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
print(response)
```
### Chain of Thought (CoT) reasoning example
For reasoning tasks, the model can generate step-by-step thoughts using `<think>` tags:
```python
messages = [
{
"role": "system",
"content": (
"You are a helpful, concise assistant. "
"Use Chain of Thought reasoning with <think> tags for complex problems."
),
},
{
"role": "user",
"content": "If a train travels 60 km in 1 hour, how long will it take to travel 180 km?",
},
]
prompt_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
print(response)
# Example output: <think> The train travels 60 km in 1 hour, so speed is 60 km/h. For 180 km, time = distance / speed = 180 / 60 = 3 hours. </think> It will take 3 hours.
```
## Training pipeline (summary)
1. **Instruction SFT (Ultrachat)**:
- Conversations are converted into lists of `messages`.
- For each assistant turn, a single training example is built using
`tokenizer.apply_chat_template`.
- Loss is applied only on assistant tokens; system and user tokens are masked.
2. **Reasoning Training**:
- Fine-tuning on the `open-r1/Mixture-of-Thoughts` dataset with step-by-step reasoning traces to enhance CoT capabilities.
- Uses reinforcement learning or supervised methods to align with logical reasoning patterns.
3. **DPO Alignment**:
- Fine-tuning with Direct Preference Optimization on the `mlabonne/orpo-dpo-mix-40k` dataset.
- Optimizes the model to prefer chosen responses over rejected ones, improving overall alignment.
## Limitations
- This is a relatively small (1B parameter) model and may hallucinate or
struggle on complex, multi-step reasoning tasks.
- Outputs may be inaccurate, unsafe, or biased. Always verify critical
information before using it in production.
|