File size: 3,257 Bytes
e5fb9d6
 
 
 
 
fe6a9f2
e5fb9d6
a665dc6
e5fb9d6
a665dc6
e5fb9d6
a665dc6
 
 
e5fb9d6
a665dc6
e5fb9d6
a665dc6
e5fb9d6
a665dc6
 
e5fb9d6
a665dc6
e5fb9d6
a665dc6
 
 
 
 
 
f4d700e
a665dc6
 
 
 
e5fb9d6
 
a665dc6
 
 
 
 
e5fb9d6
 
a665dc6
e5fb9d6
a665dc6
e5fb9d6
a665dc6
e5fb9d6
a665dc6
e5fb9d6
a665dc6
 
e5fb9d6
a665dc6
 
e5fb9d6
a665dc6
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
library_name: transformers
tags: []
---

# MNLP M3 DPO Model — Qwen3-0.6B-Base Fine-Tuned with Direct Preference Optimization

This repository contains a Direct Preference Optimization (DPO) model built on top of the base model [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base), as part of the MNLP M3 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.

## Model Description

- **Base Model**: [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
- **DPO Dataset**: [`Tandogan/MNLP_M3_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M3_dpo_dataset)
- **Libraries**: [Unsloth](https://github.com/unslothai/unsloth), [TRL](https://github.com/huggingface/trl)

## Training Procedure

### Direct Preference Optimization (DPO)

We started from the official `Qwen3-0.6B-Base` checkpoint and applied **Direct Preference Optimization (DPO)**.  
DPO lets us directly train the model to score preferred responses higher than less-preferred ones, using ranked human feedback.  

#### 1. From Base Model (`Qwen3-0.6B-Base`)

- **Dataset**: [`Tandogan/MNLP_M3_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M3_dpo_dataset)
- **Max sequence length**: 2048 (prompt truncated to 1024)
- **Epochs**: 4  
- **Optimizer**: AdamW (learning rate = `2e-6`, weight decay = `0`)  
- **Precision**: bf16  
- **Batch size**: 2 (gradient accumulation = 4)  
- **Scheduler**: cosine with 1% warmup  
- **DPO Beta**: 0.1  
- **Eval & Checkpointing**: Every epoch  
- **Monitoring**: Weights & Biases (WandB)  
- **Best Epoch Selection**: Based on reward accuracy
## Evaluation

| Model                          | BLEU   | ROUGE-1/2/L/Lsum                      | METEOR | MMLU ± SD         | TQA_MC1 ± SD     | TQA_MC2 ± SD     |Reward Acc. ± SD    |
|--------------------------------|--------|---------------------------------------|--------|-------------------|------------------|------------------|--------------------|
| **Qwen3-0.6B-Base**            | 0.1086 | 0.3282 / 0.1458 / 0.2187 / 0.2964     | 0.2406 | 0.5239 ± 0.0365   | 0.2938 ± 0.0159  | 0.4589 ± 0.0148  |0 ± 0               |
| **Qwen3-0.6B**                 | 0.0649 | 0.2488 / 0.0876 / 0.1617 / 0.2224     | 0.2146 | 0.4156 ± 0.0361  | 0.2717 ± 0.0156 | 0.4284 ± 0.0145 | 0.4226 ± 0.0088       |
| **MNLP M3 DPO Model**          | 0.1343 | 0.3608 / 0.1634 / 0.2345 / 0.3283     | 0.2718 | 0.5264 ± 0.0364   | 0.3023 ± 0.0161  | 0.4682 ± 0.0149  |0.6997 ± 0.0082     |


## Intended Use

This model is intended for research and experimentation with preference-based alignment and reward modeling.

## How to Use

You can use the model with the `transformers` and `trl` libraries for inference or evaluation:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M3_dpo_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M3_dpo_model")

prompt = "Explain recursion in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))