File size: 4,846 Bytes
7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 641ddc5 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b 7f62858 ff6116b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
library_name: transformers
license: apache-2.0
datasets:
- Mehdi-Zogh/MNLP_M2_dpo_dataset
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-0.6B-Base
pipeline_tag: question-answering
---
# Model Card for Qwen3-0.6B-MNLP-DPO
This model is a Direct Preference Optimization (DPO) fine-tuned version of [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using the [`Mehdi-Zogh/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset). The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.
---
## Model Details
### Model Description
This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.
- **Developed by:** Mehdi Zoghlami
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
- **Dataset:** [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset)
---
## Uses
### Direct Use
This model is trained to be an AI tutor that is specialized in course content at EPFL.
### Downstream Use
It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.
### Out-of-Scope Use
- Not recommended for use in high-stakes settings.
- Not intended for use outside the English language.
- Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).
---
## Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Mehdi-Zogh/MNLP_M2_dpo_model"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "explain gradient descent in simple terms."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
```
## Training Details
### Training Data
The training data is the [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset), which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.
### Training Procedure
The model was fine-tuned using `trl`'s `DPOTrainer`
#### Training Hyperparameters
| Hyperparameter | Value |
|----------------------------|------------------|
| Learning rate | 1e-5 |
| Epochs | 3 |
| Per-device train batch size| 1 |
| Per-device eval batch size | 1 |
| Gradient accumulation steps| 4 |
| Precision | bf16 |
| Early stopping patience | 3 |
## Evaluation
320 samples out of the dataset were used for validation.
### Testing Data, Factors & Metrics
#### Testing Data
The model was tested on [zechen-nlp/MNLP_dpo_demo](https://huggingface.co/datasets/zechen-nlp/MNLP_dpo_demo)
#### Metrics
- **Accuracy of Preference:** Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
- This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.
### Results
- The model achieved a **preference accuracy of 84% 卤 5.2%** on the test set.
- This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset. |