MNLP_M2_dpo_model / README.md
Tandogan's picture
Update README.md
ceb4140 verified
---
license: apache-2.0
tags:
- dpo
- unsloth
- trl
- qwen
- instruction-tuning
- preference-modeling
- mnlp
datasets:
- Tandogan/sft_dataset_final_train
- Tandogan/MNLP_M2_dpo_dataset
base_model: Qwen/Qwen3-0.6B-Base
inference: false
---
# MNLP M2 DPO Model — Qwen3-0.6B Fine-Tuned with Direct Preference Optimization
This repository contains a Direct Preference Optimization (DPO) model built on top of a supervised fine-tuned version of [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base), as part of the MNLP M2 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.
## Model Description
- **Base Model**: [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
- **SFT Checkpoint**: [`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT)
- **DPO Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset)
- **Libraries**: [Unsloth](https://github.com/unslothai/unsloth), [TRL](https://github.com/huggingface/trl)
## Training Procedure
### Supervised Fine-Tuning (SFT)
- **Dataset**: [`Tandogan/sft_dataset_final_train`](https://huggingface.co/datasets/Tandogan/sft_dataset_final_train)
(Alpaca-style prompt–completion pairs)
- **Max sequence length**: 2048
- **Epochs**: 4
- **Optimizer**: AdamW (learning rate = `3e-5`, weight decay = `0`)
- **Precision**: bf16
- **Batch size**: 2 (gradient accumulation = 4)
- **Scheduler**: Linear with 1% warmup
- **Eval & Checkpointing**: Every epoch
### Direct Preference Optimization (DPO)
Two DPO fine-tuning experiments were run:
#### 1. From Base Model (`Qwen3-0.6B-Base`)
#### 2. From SFT Model ([`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT))
- **Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset)
- **Max sequence length**: 2048 (prompt + completions truncated to 1024 each)
- **Epochs**: 4
- **Optimizer**: AdamW (learning rate = `2e-6`, weight decay = `0`)
- **Precision**: bf16
- **Batch size**: 2 (gradient accumulation = 4)
- **Scheduler**: Cosine with 1% warmup
- **DPO Beta**: 0.1
- **Eval & Checkpointing**: Every epoch
- **Monitoring**: Weights & Biases (WandB)
- **Best Epoch Selection**: Based on validation loss
## Intended Use
This model is intended for research and experimentation with preference-based alignment and reward modeling. It is **not** production-ready and may produce hallucinated, biased, or unsafe outputs. Please evaluate carefully for downstream tasks.
## How to Use
You can use the model with the `transformers` and `trl` libraries for inference or evaluation:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M2_dpo_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M2_dpo_model")
prompt = "Explain recursion in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))