| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen2.5-1.5B-Instruct |
| tags: |
| - dpo |
| - alignment |
| - truthfulness |
| - lora |
| - peft |
| - qwen2 |
| datasets: |
| - jondurbin/truthy-dpo-v0.1 |
| pipeline_tag: text-generation |
| library_name: peft |
| --- |
| |
| # Qwen2.5-1.5B-Instruct — DPO Fine-tuned for Truthfulness |
|
|
| This is a LoRA adapter for [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct), fine-tuned using Direct Preference Optimization (DPO) to reduce hallucinations and improve factual accuracy. |
|
|
| I trained this as part of my NLU course assignment (A5) at AIT, where the goal was to align a language model to prefer truthful responses over hallucinated ones. |
|
|
| ## What does this model do differently? |
|
|
| The base Qwen2.5-1.5B-Instruct model occasionally generates plausible-sounding but incorrect answers. By training with DPO on the [truthy-dpo](https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1) dataset, this adapter nudges the model toward choosing factually grounded responses instead of making things up. |
|
|
| It's not a massive overhaul — it's a lightweight LoRA adapter that adjusts the attention layers to be slightly more cautious and accurate. |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Base model | Qwen/Qwen2.5-1.5B-Instruct | |
| | Training method | DPO (Direct Preference Optimization) | |
| | Dataset | jondurbin/truthy-dpo-v0.1 (1016 samples) | |
| | Adapter type | LoRA (rank=8, alpha=16) | |
| | Target modules | q_proj, k_proj, v_proj, o_proj | |
| | Trainable params | ~2.18M (0.14% of total) | |
| | Learning rate | 1e-4 | |
| | Beta (DPO) | 0.3 | |
| | Training steps | 50 | |
| | Batch size | 1 (with gradient accumulation = 4) | |
| | Precision | float32 | |
| | Hardware | Apple M2 (MPS) | |
|
|
| I ran two experiments — one with a conservative setup (beta=0.1, lr=5e-5) and another with stronger preference signal (beta=0.3, lr=1e-4). The second experiment showed better loss reduction, so that's what I saved here. |
|
|
| ## How to Use |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
| from peft import PeftModel |
| |
| # load base model |
| base_model = AutoModelForCausalLM.from_pretrained( |
| "Qwen/Qwen2.5-1.5B-Instruct", |
| torch_dtype="auto", |
| trust_remote_code=True, |
| ) |
| |
| # load the DPO adapter on top |
| model = PeftModel.from_pretrained(base_model, "mastersubhajit/DPO") |
| model = model.merge_and_unload() |
| |
| tokenizer = AutoTokenizer.from_pretrained("mastersubhajit/DPO") |
| pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) |
| |
| response = pipe("What is the capital of France?", max_new_tokens=256, do_sample=False) |
| print(response[0]["generated_text"]) |
| ``` |
|
|
| ## Evaluation |
|
|
| I evaluated this model using AlpacaEval with an LLM-as-a-Judge setup. I took 15 random samples from the `helpful_base` subset and had both the base model and this DPO model generate responses. Then I used Llama 3.3 70B (via Groq) as a judge to compare them. |
|
|
| **DPO Win Rate: ~60%** |
|
|
| The DPO model won more comparisons than the base model, especially on factual and knowledge-based questions. On creative or open-ended prompts the difference was smaller, which makes sense since the training data specifically targets hallucination reduction rather than general helpfulness. |
|
|
| ## Limitations |
|
|
| - This is a LoRA adapter, not a full fine-tune. The changes are subtle. |
| - Trained for only 50 steps due to hardware constraints (M2 Mac). More training would likely help. |
| - The truthy-dpo dataset focuses on a specific kind of truthfulness — the model won't suddenly become perfect at everything. |
| - float32 training (MPS doesn't fully support fp16 for all ops), so training was slower than it would be on a GPU with fp16/bf16. |
|
|
| ## Acknowledgements |
|
|
| - Base model by [Qwen Team](https://huggingface.co/Qwen) |
| - Training dataset by [Jon Durbin](https://huggingface.co/jondurbin) |
| - Built using [TRL](https://huggingface.co/docs/trl) and [PEFT](https://huggingface.co/docs/peft) libraries |