Misha0706/llm-alignment-dpo

This repository contains a DPO-aligned version of HuggingFaceTB/SmolLM-135M-Instruct trained as part of a coursework project on language model alignment. The goal of the project was to implement Direct Preference Optimization (DPO) from scratch and compare its behavior to the original base model and a PPO-based alternative.

Model Details

Model Description

This model is a preference-aligned causal language model obtained by fine-tuning HuggingFaceTB/SmolLM-135M-Instruct with Direct Preference Optimization on HumanLLMs/Human-Like-DPO-Dataset.

The training objective was to shift the model away from generic, overly formal assistant-style replies and toward responses preferred in the dataset, which tend to be more human-like, casual, and expressive.

  • Developed by: Mikhail Kalinkin
  • Model type: Causal language model
  • Language(s): English
  • Finetuned from model: HuggingFaceTB/SmolLM-135M-Instruct
  • Training method: Direct Preference Optimization (DPO)

Model Sources

  • Base model: HuggingFaceTB/SmolLM-135M-Instruct
  • Training dataset: HumanLLMs/Human-Like-DPO-Dataset

Intended Use

Direct Use

This model is intended for:

  • experimentation with preference alignment;
  • comparison against the base SmolLM-Instruct model;
  • educational use in RLHF / alignment coursework;
  • studying how DPO changes generation behavior.

Out-of-Scope Use

This model is not intended for:

  • production use;
  • safety-critical or high-stakes applications;
  • factual question answering without additional grounding;
  • use cases requiring strong truthfulness or reliability guarantees.

Bias, Risks, and Limitations

This model inherits the limitations of the base model and the preference dataset. In particular:

  • it may still produce hallucinations or incoherent continuations;
  • DPO does not guarantee factual correctness;
  • the model can become more stylistically confident without becoming more accurate;
  • on some prompts it becomes more distinctive, but on others it may produce implausible or clearly fabricated content.

In the qualitative comparison, DPO changed model behavior more noticeably than PPO, but the improvement was not uniform across prompts.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Misha0706/llm-alignment-dpo"

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()

messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False,
    )

print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained on:

  • Dataset: HumanLLMs/Human-Like-DPO-Dataset

This dataset contains triples of:

  • prompt,
  • chosen response,
  • rejected response.

The chosen responses are preferred over the rejected ones and are used for preference optimization.

Preprocessing

The dataset was converted into chat format using the tokenizer chat template:

  • prompt → user
  • chosen/rejected → assistant

Then each example was tokenized into:

  • prompt_input_ids
  • chosen_input_ids
  • rejected_input_ids

The prompt was truncated from the left when necessary, and completions were truncated from the right.

Training Procedure

DPO was implemented manually rather than using a high-level trainer for this part.

The training setup used:

  • Base model: HuggingFaceTB/SmolLM-135M-Instruct
  • Reference model: initial frozen copy of the same checkpoint
  • Optimizer: AdamW
  • Epochs: 1
  • Beta: 1.0
  • Learning rate: 5e-5
  • Batch size: 8
  • Max sequence length: 512
  • Max prompt length: 128
  • Max completion length: 256

Evaluation

Evaluation Setup

The model was evaluated qualitatively and with a lightweight likelihood-based analysis.

Evaluation included:

  1. Base vs DPO generation comparison on a shared set of prompts.
  2. Train vs unseen average answer log-probability comparison.

Metrics

The main quantitative comparison reported in the project was:

  • average answer log-probability on samples from the training preference dataset;
  • average answer log-probability on unseen data from databricks/databricks-dolly-15k.

Results

Log-probability analysis

  • BASE train mean: -1.8917

  • BASE unseen mean: -2.6316

  • DPO train mean: -1.8647

  • DPO unseen mean: -2.6217

These results indicate that the DPO model assigns slightly higher likelihood to both training and unseen samples than the base model in this evaluation setup, with the effect being small but measurable.

Qualitative comparison

On several prompts, DPO produced more visibly changed outputs than PPO and more noticeably diverged from the base model.

Examples:

Prompt: Oh, I just saw the best meme - have you seen it?

  • Base: verbose and awkward explanation of a meme
  • DPO: short, casual, more conversational reply

Prompt: Do you have a favorite type of vacation or getaway?

  • Base: generic structured assistant answer
  • DPO: still generic, but somewhat more conversational and preference-shaped

At the same time, DPO also showed clear weaknesses:

Prompt: Can you tell me about a memorable birthday or celebration you've had?

  • the DPO model generated fabricated personal memories such as getting a first iPhone in 1998, which is a clear sign of unstable preference shaping rather than grounded improvement.

Summary

Compared to the base model, this DPO checkpoint:

  • changes behavior more noticeably than the PPO version from the same project;
  • sometimes becomes more conversational and concise;
  • does not consistently improve response quality;
  • can become more prone to confident fabrication on some prompts.

Example Generations

Example 1

Prompt: Oh, I just saw the best meme - have you seen it?

Base model:

I'm so glad you're enjoying the meme! I'm happy to share some of my favorite memes with you...

DPO model:

I've seen it! It's a classic meme that's been around for over a decade. It's a clever way to make people laugh and share their humor with others.

Example 2

Prompt: What's your morning routine like?

Base model:

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits...

DPO model:

I'm excited to start my morning routine! It's a crucial step in getting ready for the day, and I'm excited to get started...

Example 3

Prompt: Can you tell me about a memorable birthday or celebration you've had?

DPO model output shows a limitation:

It starts inventing highly specific personal memories, including impossible or incoherent details.

Technical Notes

Architecture

  • Transformer-based causal language model
  • Initialized from HuggingFaceTB/SmolLM-135M-Instruct

Objective

The model was optimized with the DPO loss:

[ L_{DPO} = - \mathbb{E}{(x,y_w,y_l)\sim D} \left[\log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi{ref}(y_w|x)}

\beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right)\right] ]

where preferred and rejected completions come from the human preference dataset.

Limitations

This is a coursework model and should be treated as an experimental artifact rather than a polished aligned assistant.

Known limitations:

  • unstable gains across prompts;
  • occasional fabricated personal narratives;
  • no robust safety evaluation;
  • no benchmark-based factuality or helpfulness leaderboard results;
  • very small model size, which strongly limits downstream quality.

Citation

If you use this repository, please cite the original base model and dataset:

  • HuggingFaceTB/SmolLM-135M-Instruct
  • HumanLLMs/Human-Like-DPO-Dataset

You may also mention this repository as a coursework implementation of DPO-based alignment.

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Misha0706/llm-alignment-dpo

Finetuned
(191)
this model

Dataset used to train Misha0706/llm-alignment-dpo