Qwen2.5-1.5B-DPO-Truthy

This model is a fine-tuned version of Qwen2.5-1.5B-Instruct using Direct Preference Optimization (DPO). The goal of this alignment was to improve the model's truthfulness and reduce hallucinations by training it on human-preferred factual responses.

Model Description

Base Model: Qwen/Qwen2.5-1.5B-Instruct
Alignment Technique: DPO (Direct Preference Optimization)
Training Method: PEFT-LoRA (Low-Rank Adaptation)
Precision: 4-bit Quantization (bitsandbytes)

Training Details

The model was trained on the truthy-dpo dataset, which contains pairs of "chosen" (accurate/truthful) and "rejected" (incorrect/hallucinated) responses.

Hyperparameters

Learning Rate: 5e-5
Batch Size: 1
Gradient Accumulation Steps: 4
Optimizer: Paged AdamW 32-bit
LoRA R: 8
LoRA Alpha: 16

Evaluation Results

AlpacaEval-style Benchmark (LLM-as-a-Judge)

We evaluated the model against the base Qwen2.5-1.5B-Instruct model using Gemini-1.5-Flash as an impartial judge across 15 factual test cases.

Metric	Result
Model B (DPO) Wins	1
Ties	14
Model A (Base) Wins	0
Final Win Rate	53.33%

Discussion

The DPO alignment was successful, achieving a win rate of 53.33%. While the high number of ties (93%) suggests that the base model already maintains a high standard of instruction-following, the DPO model successfully shifted the preference towards the "truthy" distribution without any losses against the base model.

Complexity Reduction

To ensure efficient training, this project utilized:

4-bit Quantization: Reducing memory footprint for T4 GPU compatibility.
LoRA: Reducing trainable parameters from 1.5 Billion to approximately 1.5 Million (~0.1% of the model).

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_model = "your-username/Qwen2.5-1.5B-DPO-Truthy"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_model)

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for st126107/qwen2.5-truthful-dpo

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(912)

this model

Evaluation results

Win Rate on truthy-dpo
self-reported

53.330