|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- dpo |
|
|
- unsloth |
|
|
- trl |
|
|
- qwen |
|
|
- instruction-tuning |
|
|
- preference-modeling |
|
|
- mnlp |
|
|
datasets: |
|
|
- Tandogan/sft_dataset_final_train |
|
|
- Tandogan/MNLP_M2_dpo_dataset |
|
|
base_model: Qwen/Qwen3-0.6B-Base |
|
|
inference: false |
|
|
--- |
|
|
|
|
|
# MNLP M2 DPO Model — Qwen3-0.6B Fine-Tuned with Direct Preference Optimization |
|
|
|
|
|
This repository contains a Direct Preference Optimization (DPO) model built on top of a supervised fine-tuned version of [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base), as part of the MNLP M2 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model**: [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base) |
|
|
- **SFT Checkpoint**: [`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT) |
|
|
- **DPO Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset) |
|
|
- **Libraries**: [Unsloth](https://github.com/unslothai/unsloth), [TRL](https://github.com/huggingface/trl) |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Supervised Fine-Tuning (SFT) |
|
|
|
|
|
- **Dataset**: [`Tandogan/sft_dataset_final_train`](https://huggingface.co/datasets/Tandogan/sft_dataset_final_train) |
|
|
(Alpaca-style prompt–completion pairs) |
|
|
- **Max sequence length**: 2048 |
|
|
- **Epochs**: 4 |
|
|
- **Optimizer**: AdamW (learning rate = `3e-5`, weight decay = `0`) |
|
|
- **Precision**: bf16 |
|
|
- **Batch size**: 2 (gradient accumulation = 4) |
|
|
- **Scheduler**: Linear with 1% warmup |
|
|
- **Eval & Checkpointing**: Every epoch |
|
|
|
|
|
### Direct Preference Optimization (DPO) |
|
|
|
|
|
Two DPO fine-tuning experiments were run: |
|
|
|
|
|
#### 1. From Base Model (`Qwen3-0.6B-Base`) |
|
|
#### 2. From SFT Model ([`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT)) |
|
|
|
|
|
- **Dataset**: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset) |
|
|
- **Max sequence length**: 2048 (prompt + completions truncated to 1024 each) |
|
|
- **Epochs**: 4 |
|
|
- **Optimizer**: AdamW (learning rate = `2e-6`, weight decay = `0`) |
|
|
- **Precision**: bf16 |
|
|
- **Batch size**: 2 (gradient accumulation = 4) |
|
|
- **Scheduler**: Cosine with 1% warmup |
|
|
- **DPO Beta**: 0.1 |
|
|
- **Eval & Checkpointing**: Every epoch |
|
|
- **Monitoring**: Weights & Biases (WandB) |
|
|
- **Best Epoch Selection**: Based on validation loss |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended for research and experimentation with preference-based alignment and reward modeling. It is **not** production-ready and may produce hallucinated, biased, or unsafe outputs. Please evaluate carefully for downstream tasks. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
You can use the model with the `transformers` and `trl` libraries for inference or evaluation: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M2_dpo_model").to("cuda") |
|
|
tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M2_dpo_model") |
|
|
|
|
|
prompt = "Explain recursion in simple terms." |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
outputs = model.generate(**inputs, max_new_tokens=256) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |