Tandogan
/

MNLP_M2_dpo_model

instruction-tuning

preference-modeling

Model card Files Files and versions

MNLP_M2_dpo_model / README.md

Tandogan's picture

Update README.md

ceb4140 verified 7 months ago

|

history blame contribute delete

3.21 kB

	---
	license: apache-2.0
	tags:
	- dpo
	- unsloth
	- trl
	- qwen
	- instruction-tuning
	- preference-modeling
	- mnlp
	datasets:
	- Tandogan/sft_dataset_final_train
	- Tandogan/MNLP_M2_dpo_dataset
	base_model: Qwen/Qwen3-0.6B-Base
	inference: false
	---

	# MNLP M2 DPO Model — Qwen3-0.6B Fine-Tuned with Direct Preference Optimization

	This repository contains a Direct Preference Optimization (DPO) model built on top of a supervised fine-tuned version of [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base), as part of the MNLP M2 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.

	## Model Description

	- Base Model: [`Qwen/Qwen3-0.6B-Base`](https://huggingface.co/Qwen/Qwen3-0.6B-Base)
	- SFT Checkpoint: [`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT)
	- DPO Dataset: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset)
	- Libraries: [Unsloth](https://github.com/unslothai/unsloth), [TRL](https://github.com/huggingface/trl)

	## Training Procedure

	### Supervised Fine-Tuning (SFT)

	- Dataset: [`Tandogan/sft_dataset_final_train`](https://huggingface.co/datasets/Tandogan/sft_dataset_final_train)
	(Alpaca-style prompt–completion pairs)
	- Max sequence length: 2048
	- Epochs: 4
	- Optimizer: AdamW (learning rate = `3e-5`, weight decay = `0`)
	- Precision: bf16
	- Batch size: 2 (gradient accumulation = 4)
	- Scheduler: Linear with 1% warmup
	- Eval & Checkpointing: Every epoch

	### Direct Preference Optimization (DPO)

	Two DPO fine-tuning experiments were run:

	#### 1. From Base Model (`Qwen3-0.6B-Base`)
	#### 2. From SFT Model ([`Tandogan/MNLP_M2_SFT`](https://huggingface.co/Tandogan/MNLP_M2_SFT))

	- Dataset: [`Tandogan/MNLP_M2_dpo_dataset`](https://huggingface.co/datasets/Tandogan/MNLP_M2_dpo_dataset)
	- Max sequence length: 2048 (prompt + completions truncated to 1024 each)
	- Epochs: 4
	- Optimizer: AdamW (learning rate = `2e-6`, weight decay = `0`)
	- Precision: bf16
	- Batch size: 2 (gradient accumulation = 4)
	- Scheduler: Cosine with 1% warmup
	- DPO Beta: 0.1
	- Eval & Checkpointing: Every epoch
	- Monitoring: Weights & Biases (WandB)
	- Best Epoch Selection: Based on validation loss

	## Intended Use

	This model is intended for research and experimentation with preference-based alignment and reward modeling. It is not production-ready and may produce hallucinated, biased, or unsafe outputs. Please evaluate carefully for downstream tasks.

	## How to Use

	You can use the model with the `transformers` and `trl` libraries for inference or evaluation:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M2_dpo_model").to("cuda")
	tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M2_dpo_model")

	prompt = "Explain recursion in simple terms."
	inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))