Misha0706/llm-alignment-dpo
This repository contains a DPO-aligned version of HuggingFaceTB/SmolLM-135M-Instruct trained as part of a coursework project on language model alignment. The goal of the project was to implement Direct Preference Optimization (DPO) from scratch and compare its behavior to the original base model and a PPO-based alternative.
Model Details
Model Description
This model is a preference-aligned causal language model obtained by fine-tuning HuggingFaceTB/SmolLM-135M-Instruct with Direct Preference Optimization on HumanLLMs/Human-Like-DPO-Dataset.
The training objective was to shift the model away from generic, overly formal assistant-style replies and toward responses preferred in the dataset, which tend to be more human-like, casual, and expressive.
- Developed by: Mikhail Kalinkin
- Model type: Causal language model
- Language(s): English
- Finetuned from model:
HuggingFaceTB/SmolLM-135M-Instruct - Training method: Direct Preference Optimization (DPO)
Model Sources
- Base model:
HuggingFaceTB/SmolLM-135M-Instruct - Training dataset:
HumanLLMs/Human-Like-DPO-Dataset
Intended Use
Direct Use
This model is intended for:
- experimentation with preference alignment;
- comparison against the base SmolLM-Instruct model;
- educational use in RLHF / alignment coursework;
- studying how DPO changes generation behavior.
Out-of-Scope Use
This model is not intended for:
- production use;
- safety-critical or high-stakes applications;
- factual question answering without additional grounding;
- use cases requiring strong truthfulness or reliability guarantees.
Bias, Risks, and Limitations
This model inherits the limitations of the base model and the preference dataset. In particular:
- it may still produce hallucinations or incoherent continuations;
- DPO does not guarantee factual correctness;
- the model can become more stylistically confident without becoming more accurate;
- on some prompts it becomes more distinctive, but on others it may produce implausible or clearly fabricated content.
In the qualitative comparison, DPO changed model behavior more noticeably than PPO, but the improvement was not uniform across prompts.
How to Get Started
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Misha0706/llm-alignment-dpo"
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()
messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=128,
do_sample=False,
)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Training Details
Training Data
The model was trained on:
- Dataset:
HumanLLMs/Human-Like-DPO-Dataset
This dataset contains triples of:
- prompt,
- chosen response,
- rejected response.
The chosen responses are preferred over the rejected ones and are used for preference optimization.
Preprocessing
The dataset was converted into chat format using the tokenizer chat template:
- prompt →
user - chosen/rejected →
assistant
Then each example was tokenized into:
prompt_input_idschosen_input_idsrejected_input_ids
The prompt was truncated from the left when necessary, and completions were truncated from the right.
Training Procedure
DPO was implemented manually rather than using a high-level trainer for this part.
The training setup used:
- Base model:
HuggingFaceTB/SmolLM-135M-Instruct - Reference model: initial frozen copy of the same checkpoint
- Optimizer: AdamW
- Epochs: 1
- Beta: 1.0
- Learning rate:
5e-5 - Batch size:
8 - Max sequence length:
512 - Max prompt length:
128 - Max completion length:
256
Evaluation
Evaluation Setup
The model was evaluated qualitatively and with a lightweight likelihood-based analysis.
Evaluation included:
- Base vs DPO generation comparison on a shared set of prompts.
- Train vs unseen average answer log-probability comparison.
Metrics
The main quantitative comparison reported in the project was:
- average answer log-probability on samples from the training preference dataset;
- average answer log-probability on unseen data from
databricks/databricks-dolly-15k.
Results
Log-probability analysis
BASE train mean:
-1.8917BASE unseen mean:
-2.6316DPO train mean:
-1.8647DPO unseen mean:
-2.6217
These results indicate that the DPO model assigns slightly higher likelihood to both training and unseen samples than the base model in this evaluation setup, with the effect being small but measurable.
Qualitative comparison
On several prompts, DPO produced more visibly changed outputs than PPO and more noticeably diverged from the base model.
Examples:
Prompt: Oh, I just saw the best meme - have you seen it?
- Base: verbose and awkward explanation of a meme
- DPO: short, casual, more conversational reply
Prompt: Do you have a favorite type of vacation or getaway?
- Base: generic structured assistant answer
- DPO: still generic, but somewhat more conversational and preference-shaped
At the same time, DPO also showed clear weaknesses:
Prompt: Can you tell me about a memorable birthday or celebration you've had?
- the DPO model generated fabricated personal memories such as getting a first iPhone in 1998, which is a clear sign of unstable preference shaping rather than grounded improvement.
Summary
Compared to the base model, this DPO checkpoint:
- changes behavior more noticeably than the PPO version from the same project;
- sometimes becomes more conversational and concise;
- does not consistently improve response quality;
- can become more prone to confident fabrication on some prompts.
Example Generations
Example 1
Prompt: Oh, I just saw the best meme - have you seen it?
Base model:
I'm so glad you're enjoying the meme! I'm happy to share some of my favorite memes with you...
DPO model:
I've seen it! It's a classic meme that's been around for over a decade. It's a clever way to make people laugh and share their humor with others.
Example 2
Prompt: What's your morning routine like?
Base model:
I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits...
DPO model:
I'm excited to start my morning routine! It's a crucial step in getting ready for the day, and I'm excited to get started...
Example 3
Prompt: Can you tell me about a memorable birthday or celebration you've had?
DPO model output shows a limitation:
It starts inventing highly specific personal memories, including impossible or incoherent details.
Technical Notes
Architecture
- Transformer-based causal language model
- Initialized from
HuggingFaceTB/SmolLM-135M-Instruct
Objective
The model was optimized with the DPO loss:
[ L_{DPO} = - \mathbb{E}{(x,y_w,y_l)\sim D} \left[\log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi{ref}(y_w|x)}
\beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right)\right] ]
where preferred and rejected completions come from the human preference dataset.
Limitations
This is a coursework model and should be treated as an experimental artifact rather than a polished aligned assistant.
Known limitations:
- unstable gains across prompts;
- occasional fabricated personal narratives;
- no robust safety evaluation;
- no benchmark-based factuality or helpfulness leaderboard results;
- very small model size, which strongly limits downstream quality.
Citation
If you use this repository, please cite the original base model and dataset:
HuggingFaceTB/SmolLM-135M-InstructHumanLLMs/Human-Like-DPO-Dataset
You may also mention this repository as a coursework implementation of DPO-based alignment.
- Downloads last month
- 13
Model tree for Misha0706/llm-alignment-dpo
Base model
HuggingFaceTB/SmolLM-135M