---
license: apache-2.0
tags:
  - dpo
  - unsloth
  - trl
  - qwen
  - instruction-tuning
  - preference-modeling
  - mnlp
datasets:
  - Tandogan/sft_dataset_final_train
  - Tandogan/MNLP_M2_dpo_dataset
base_model: Qwen/Qwen3-0.6B-Base
inference: false
---

# MNLP M2 DPO Model — Qwen3-0.6B Fine-Tuned with Direct Preference Optimization

This repository contains a Direct Preference Optimization (DPO) model built on top of a supervised fine-tuned version of [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base), as part of the MNLP M2 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.

## Model Description

- **Base Model**: [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
- **SFT Checkpoint**: [`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT)
- **DPO Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset)
- **Libraries**: [Unsloth](https://github.com/unslothai/unsloth), [TRL](https://github.com/huggingface/trl)

## Training Procedure

### Supervised Fine-Tuning (SFT)

- **Dataset**: [`Tandogan/sft_dataset_final_train`](https://huggingface.co/datasets/Tandogan/sft_dataset_final_train)  
  (Alpaca-style prompt–completion pairs)
- **Max sequence length**: 2048
- **Epochs**: 4  
- **Optimizer**: AdamW (learning rate = `3e-5`, weight decay = `0`)
- **Precision**: bf16  
- **Batch size**: 2 (gradient accumulation = 4)  
- **Scheduler**: Linear with 1% warmup  
- **Eval & Checkpointing**: Every epoch  

### Direct Preference Optimization (DPO)

Two DPO fine-tuning experiments were run:

#### 1. From Base Model (`Qwen3-0.6B-Base`)
#### 2. From SFT Model ([`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT))

- **Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset)
- **Max sequence length**: 2048 (prompt + completions truncated to 1024 each)
- **Epochs**: 4  
- **Optimizer**: AdamW (learning rate = `2e-6`, weight decay = `0`)  
- **Precision**: bf16  
- **Batch size**: 2 (gradient accumulation = 4)  
- **Scheduler**: Cosine with 1% warmup  
- **DPO Beta**: 0.1  
- **Eval & Checkpointing**: Every epoch  
- **Monitoring**: Weights & Biases (WandB)  
- **Best Epoch Selection**: Based on validation loss  

## Intended Use

This model is intended for research and experimentation with preference-based alignment and reward modeling. It is **not** production-ready and may produce hallucinated, biased, or unsafe outputs. Please evaluate carefully for downstream tasks.

## How to Use

You can use the model with the `transformers` and `trl` libraries for inference or evaluation:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M2_dpo_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M2_dpo_model")

prompt = "Explain recursion in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))