File size: 4,846 Bytes
7f62858
 
ff6116b
 
 
 
 
 
 
 
 
 
7f62858
 
ff6116b
7f62858
ff6116b
7f62858
ff6116b
7f62858
 
 
 
 
ff6116b
7f62858
ff6116b
 
 
 
 
 
7f62858
ff6116b
7f62858
 
 
 
 
ff6116b
7f62858
ff6116b
7f62858
ff6116b
7f62858
 
 
ff6116b
 
 
7f62858
ff6116b
7f62858
ff6116b
7f62858
ff6116b
641ddc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7f62858
ff6116b
7f62858
 
 
 
 
 
ff6116b
7f62858
 
 
 
ff6116b
7f62858
 
 
 
 
ff6116b
 
 
 
 
 
 
 
 
7f62858
 
 
 
 
ff6116b
7f62858
 
 
 
 
ff6116b
7f62858
 
 
 
ff6116b
 
7f62858
 
 
ff6116b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
library_name: transformers
license: apache-2.0
datasets:
- Mehdi-Zogh/MNLP_M2_dpo_dataset
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-0.6B-Base
pipeline_tag: question-answering
---

# Model Card for Qwen3-0.6B-MNLP-DPO

This model is a Direct Preference Optimization (DPO) fine-tuned version of [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using the [`Mehdi-Zogh/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset). The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.

---

## Model Details

### Model Description

This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.

- **Developed by:** Mehdi Zoghlami
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
- **Dataset:** [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset)

---

## Uses

### Direct Use

This model is trained to be an AI tutor that is specialized in course content at EPFL.

### Downstream Use

It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.

### Out-of-Scope Use

- Not recommended for use in high-stakes settings.
- Not intended for use outside the English language.
- Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).

---

## Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Mehdi-Zogh/MNLP_M2_dpo_model"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "explain gradient descent in simple terms."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

```


## Training Details

### Training Data

The training data is the [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset), which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.


### Training Procedure

The model was fine-tuned using `trl`'s `DPOTrainer`


#### Training Hyperparameters


| Hyperparameter              | Value            |
|----------------------------|------------------|
| Learning rate              | 1e-5             |
| Epochs                     | 3                |
| Per-device train batch size| 1                |
| Per-device eval batch size | 1                |
| Gradient accumulation steps| 4                |
| Precision                  | bf16             |
| Early stopping patience    | 3                |



## Evaluation

320 samples out of the dataset were used for validation.

### Testing Data, Factors & Metrics

#### Testing Data

The model was tested on [zechen-nlp/MNLP_dpo_demo](https://huggingface.co/datasets/zechen-nlp/MNLP_dpo_demo)


#### Metrics

- **Accuracy of Preference:** Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
- This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.

### Results

- The model achieved a **preference accuracy of 84% 卤 5.2%** on the test set.
- This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.