Tandogan
/

MNLP_M3_dpo_model

Text Generation

text-generation-inference

Model card Files Files and versions

MNLP_M3_dpo_model / README.md

Tandogan's picture

Update README.md

f4d700e verified 8 months ago

|

history blame contribute delete

3.26 kB

	---
	library_name: transformers
	tags: []
	---

	# MNLP M3 DPO Model — Qwen3-0.6B-Base Fine-Tuned with Direct Preference Optimization

	This repository contains a Direct Preference Optimization (DPO) model built on top of the base model [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base), as part of the MNLP M3 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.

	## Model Description

	- Base Model: [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
	- DPO Dataset: [`Tandogan/MNLP_M3_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M3_dpo_dataset)
	- Libraries: [Unsloth](https://github.com/unslothai/unsloth), [TRL](https://github.com/huggingface/trl)

	## Training Procedure

	### Direct Preference Optimization (DPO)

	We started from the official `Qwen3-0.6B-Base` checkpoint and applied Direct Preference Optimization (DPO).
	DPO lets us directly train the model to score preferred responses higher than less-preferred ones, using ranked human feedback.

	#### 1. From Base Model (`Qwen3-0.6B-Base`)

	- Dataset: [`Tandogan/MNLP_M3_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M3_dpo_dataset)
	- Max sequence length: 2048 (prompt truncated to 1024)
	- Epochs: 4
	- Optimizer: AdamW (learning rate = `2e-6`, weight decay = `0`)
	- Precision: bf16
	- Batch size: 2 (gradient accumulation = 4)
	- Scheduler: cosine with 1% warmup
	- DPO Beta: 0.1
	- Eval & Checkpointing: Every epoch
	- Monitoring: Weights & Biases (WandB)
	- Best Epoch Selection: Based on reward accuracy
	## Evaluation

	\| Model \| BLEU \| ROUGE-1/2/L/Lsum \| METEOR \| MMLU ± SD \| TQA_MC1 ± SD \| TQA_MC2 ± SD \|Reward Acc. ± SD \|
	\|--------------------------------\|--------\|---------------------------------------\|--------\|-------------------\|------------------\|------------------\|--------------------\|
	\| Qwen3-0.6B-Base \| 0.1086 \| 0.3282 / 0.1458 / 0.2187 / 0.2964 \| 0.2406 \| 0.5239 ± 0.0365 \| 0.2938 ± 0.0159 \| 0.4589 ± 0.0148 \|0 ± 0 \|
	\| Qwen3-0.6B \| 0.0649 \| 0.2488 / 0.0876 / 0.1617 / 0.2224 \| 0.2146 \| 0.4156 ± 0.0361 \| 0.2717 ± 0.0156 \| 0.4284 ± 0.0145 \| 0.4226 ± 0.0088 \|
	\| MNLP M3 DPO Model \| 0.1343 \| 0.3608 / 0.1634 / 0.2345 / 0.3283 \| 0.2718 \| 0.5264 ± 0.0364 \| 0.3023 ± 0.0161 \| 0.4682 ± 0.0149 \|0.6997 ± 0.0082 \|


	## Intended Use

	This model is intended for research and experimentation with preference-based alignment and reward modeling.

	## How to Use

	You can use the model with the `transformers` and `trl` libraries for inference or evaluation:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M3_dpo_model").to("cuda")
	tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M3_dpo_model")

	prompt = "Explain recursion in simple terms."
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))