File size: 6,598 Bytes
fc7482b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
language:
  - en
license: other
pipeline_tag: text-generation
tags:
  - llama
  - chat
  - sft
  - reasoning
  - cot
  - ultrachat
  - mixture-of-thoughts
  - dpo
base_model: meta-llama/Llama-3.2-1B
library_name: transformers
---

# PursuitOfDataScience/llama3.2-1b-thinking

This repository contains a three-stage fine-tuned version of **meta-llama/Llama-3.2-1B**:

1. **Supervised fine-tuning (SFT)** on a local copy of **HuggingFaceH4/ultrachat_200k**
   using an instruction-style, multi-turn chat objective.
2. **Reasoning training** to enhance step-by-step reasoning capabilities,
   building on the SFT model using the `open-r1/Mixture-of-Thoughts` dataset.
3. **Direct Preference Optimization (DPO)** alignment using the `mlabonne/orpo-dpo-mix-40k` dataset
   to improve response quality and alignment with human preferences.

## Model details

- **Base model**: `meta-llama/Llama-3.2-1B`
- **Stage 1 objective**: Supervised fine-tuning for helpful, concise chat responses
  on Ultrachat-style conversations.
- **Stage 2 objective**: Specialized reasoning training to improve logical reasoning and
  Chain of Thought (CoT) capabilities using step-by-step reasoning traces from `open-r1/Mixture-of-Thoughts`.
- **Stage 3 objective**: DPO alignment to refine responses based on preference data from `mlabonne/orpo-dpo-mix-40k`,
  enhancing safety, helpfulness, and adherence to user constraints.
- **Context length**: Up to 131072 tokens (subject to the base model config).
- **Training data**:
  - SFT: multi-turn dialogues from `HuggingFaceH4/ultrachat_200k`.
  - Reasoning: `open-r1/Mixture-of-Thoughts` dataset with step-by-step reasoning traces.
  - DPO: preference pairs from `mlabonne/orpo-dpo-mix-40k`.

## Inference usage

The model is trained in a **chat-style** setup. At inference time, prompts are built
as a list of `messages` and passed through the model's native `chat_template`
via `tokenizer.apply_chat_template`:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "PursuitOfDataScience/llama3.2-1b-thinking"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    device_map="auto",
)

messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful, concise assistant. "
            "Write clear, well-structured answers that follow the user's constraints."
        ),
    },
    {
        "role": "user",
        "content": "Explain how someone can build a consistent daily learning habit.",
    },
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

# Decode only the generated continuation (excluding the prompt tokens)
generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(response)
```

### Multi-turn example

```python
messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful, concise assistant. "
            "Write clear, well-structured answers that follow the user's constraints."
        ),
    },
    {
        "role": "user",
        "content": "Describe the main trade-offs between using small and large language models.",
    },
    {
        "role": "assistant",
        "content": "Small models are cheaper and faster, while large models are usually more capable...",
    },
    {
        "role": "user",
        "content": "Give me a bullet-point summary from the perspective of a startup.",
    },
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(response)
```

### Chain of Thought (CoT) reasoning example

For reasoning tasks, the model can generate step-by-step thoughts using `<think>` tags:

```python
messages = [
    {
        "role": "system",
        "content": (
            "You are a helpful, concise assistant. "
            "Use Chain of Thought reasoning with <think> tags for complex problems."
        ),
    },
    {
        "role": "user",
        "content": "If a train travels 60 km in 1 hour, how long will it take to travel 180 km?",
    },
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(response)
# Example output: <think> The train travels 60 km in 1 hour, so speed is 60 km/h. For 180 km, time = distance / speed = 180 / 60 = 3 hours. </think> It will take 3 hours.
```

## Training pipeline (summary)

1. **Instruction SFT (Ultrachat)**:
   - Conversations are converted into lists of `messages`.
   - For each assistant turn, a single training example is built using
     `tokenizer.apply_chat_template`.
   - Loss is applied only on assistant tokens; system and user tokens are masked.

2. **Reasoning Training**:
   - Fine-tuning on the `open-r1/Mixture-of-Thoughts` dataset with step-by-step reasoning traces to enhance CoT capabilities.
   - Uses reinforcement learning or supervised methods to align with logical reasoning patterns.

3. **DPO Alignment**:
   - Fine-tuning with Direct Preference Optimization on the `mlabonne/orpo-dpo-mix-40k` dataset.
   - Optimizes the model to prefer chosen responses over rejected ones, improving overall alignment.

## Limitations

- This is a relatively small (1B parameter) model and may hallucinate or
  struggle on complex, multi-step reasoning tasks.
- Outputs may be inaccurate, unsafe, or biased. Always verify critical
  information before using it in production.