MNLP M3 DPO Model — Qwen3-0.6B-Base Fine-Tuned with Direct Preference Optimization

This repository contains a Direct Preference Optimization (DPO) model built on top of the base model Qwen/Qwen3-0.6B-Base, as part of the MNLP M3 project. The model is fine-tuned using a high-quality preference dataset to better align responses with human preferences.

Model Description

Training Procedure

Direct Preference Optimization (DPO)

We started from the official Qwen3-0.6B-Base checkpoint and applied Direct Preference Optimization (DPO).
DPO lets us directly train the model to score preferred responses higher than less-preferred ones, using ranked human feedback.

1. From Base Model (Qwen3-0.6B-Base)

  • Dataset: Tandogan/MNLP_M3_dpo_dataset
  • Max sequence length: 2048 (prompt truncated to 1024)
  • Epochs: 4
  • Optimizer: AdamW (learning rate = 2e-6, weight decay = 0)
  • Precision: bf16
  • Batch size: 2 (gradient accumulation = 4)
  • Scheduler: cosine with 1% warmup
  • DPO Beta: 0.1
  • Eval & Checkpointing: Every epoch
  • Monitoring: Weights & Biases (WandB)
  • Best Epoch Selection: Based on reward accuracy

Evaluation

Model BLEU ROUGE-1/2/L/Lsum METEOR MMLU ± SD TQA_MC1 ± SD TQA_MC2 ± SD Reward Acc. ± SD
Qwen3-0.6B-Base 0.1086 0.3282 / 0.1458 / 0.2187 / 0.2964 0.2406 0.5239 ± 0.0365 0.2938 ± 0.0159 0.4589 ± 0.0148 0 ± 0
Qwen3-0.6B 0.0649 0.2488 / 0.0876 / 0.1617 / 0.2224 0.2146 0.4156 ± 0.0361 0.2717 ± 0.0156 0.4284 ± 0.0145 0.4226 ± 0.0088
MNLP M3 DPO Model 0.1343 0.3608 / 0.1634 / 0.2345 / 0.3283 0.2718 0.5264 ± 0.0364 0.3023 ± 0.0161 0.4682 ± 0.0149 0.6997 ± 0.0082

Intended Use

This model is intended for research and experimentation with preference-based alignment and reward modeling.

How to Use

You can use the model with the transformers and trl libraries for inference or evaluation:

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("Tandogan/MNLP_M3_dpo_model").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("Tandogan/MNLP_M3_dpo_model")

prompt = "Explain recursion in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
12
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support