MNLP_M2_dpo_model / README.md
Mehdi-Zogh's picture
doc: update the usage code snippet
641ddc5 verified
---
library_name: transformers
license: apache-2.0
datasets:
- Mehdi-Zogh/MNLP_M2_dpo_dataset
language:
- en
metrics:
- accuracy
base_model:
- Qwen/Qwen3-0.6B-Base
pipeline_tag: question-answering
---
# Model Card for Qwen3-0.6B-MNLP-DPO
This model is a Direct Preference Optimization (DPO) fine-tuned version of [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using the [`Mehdi-Zogh/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset). The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.
---
## Model Details
### Model Description
This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.
- **Developed by:** Mehdi Zoghlami
- **Model type:** Causal Language Model
- **Language(s):** English
- **License:** Apache 2.0
- **Finetuned from model:** [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
- **Dataset:** [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset)
---
## Uses
### Direct Use
This model is trained to be an AI tutor that is specialized in course content at EPFL.
### Downstream Use
It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.
### Out-of-Scope Use
- Not recommended for use in high-stakes settings.
- Not intended for use outside the English language.
- Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).
---
## Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Mehdi-Zogh/MNLP_M2_dpo_model"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "explain gradient descent in simple terms."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
```
## Training Details
### Training Data
The training data is the [Mehdi-Zogh/MNLP_M2_dpo_dataset](https://huggingface.co/datasets/Mehdi-Zogh/MNLP_M2_dpo_dataset), which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.
### Training Procedure
The model was fine-tuned using `trl`'s `DPOTrainer`
#### Training Hyperparameters
| Hyperparameter | Value |
|----------------------------|------------------|
| Learning rate | 1e-5 |
| Epochs | 3 |
| Per-device train batch size| 1 |
| Per-device eval batch size | 1 |
| Gradient accumulation steps| 4 |
| Precision | bf16 |
| Early stopping patience | 3 |
## Evaluation
320 samples out of the dataset were used for validation.
### Testing Data, Factors & Metrics
#### Testing Data
The model was tested on [zechen-nlp/MNLP_dpo_demo](https://huggingface.co/datasets/zechen-nlp/MNLP_dpo_demo)
#### Metrics
- **Accuracy of Preference:** Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
- This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.
### Results
- The model achieved a **preference accuracy of 84% 卤 5.2%** on the test set.
- This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.