Instructions to use Misha0706/llm-alignment-dpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Misha0706/llm-alignment-dpo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Misha0706/llm-alignment-dpo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Misha0706/llm-alignment-dpo")
model = AutoModelForCausalLM.from_pretrained("Misha0706/llm-alignment-dpo")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Misha0706/llm-alignment-dpo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Misha0706/llm-alignment-dpo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Misha0706/llm-alignment-dpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Misha0706/llm-alignment-dpo

SGLang

How to use Misha0706/llm-alignment-dpo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Misha0706/llm-alignment-dpo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Misha0706/llm-alignment-dpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Misha0706/llm-alignment-dpo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Misha0706/llm-alignment-dpo",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Misha0706/llm-alignment-dpo with Docker Model Runner:
```
docker model run hf.co/Misha0706/llm-alignment-dpo
```

Misha0706/llm-alignment-dpo

This repository contains a DPO-aligned version of HuggingFaceTB/SmolLM-135M-Instruct trained as part of a coursework project on language model alignment. The goal of the project was to implement Direct Preference Optimization (DPO) from scratch and compare its behavior to the original base model and a PPO-based alternative.

Model Details

Model Description

This model is a preference-aligned causal language model obtained by fine-tuning HuggingFaceTB/SmolLM-135M-Instruct with Direct Preference Optimization on HumanLLMs/Human-Like-DPO-Dataset.

The training objective was to shift the model away from generic, overly formal assistant-style replies and toward responses preferred in the dataset, which tend to be more human-like, casual, and expressive.

Developed by: Mikhail Kalinkin
Model type: Causal language model
Language(s): English
Finetuned from model: HuggingFaceTB/SmolLM-135M-Instruct
Training method: Direct Preference Optimization (DPO)

Model Sources

Base model: HuggingFaceTB/SmolLM-135M-Instruct
Training dataset: HumanLLMs/Human-Like-DPO-Dataset

Intended Use

Direct Use

This model is intended for:

experimentation with preference alignment;
comparison against the base SmolLM-Instruct model;
educational use in RLHF / alignment coursework;
studying how DPO changes generation behavior.

Out-of-Scope Use

This model is not intended for:

production use;
safety-critical or high-stakes applications;
factual question answering without additional grounding;
use cases requiring strong truthfulness or reliability guarantees.

Bias, Risks, and Limitations

This model inherits the limitations of the base model and the preference dataset. In particular:

it may still produce hallucinations or incoherent continuations;
DPO does not guarantee factual correctness;
the model can become more stylistically confident without becoming more accurate;
on some prompts it becomes more distinctive, but on others it may produce implausible or clearly fabricated content.

In the qualitative comparison, DPO changed model behavior more noticeably than PPO, but the improvement was not uniform across prompts.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Misha0706/llm-alignment-dpo"

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M-Instruct")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval()

messages = [{"role": "user", "content": "What's your morning routine like?"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False,
    )

print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained on:

Dataset: HumanLLMs/Human-Like-DPO-Dataset

This dataset contains triples of:

prompt,
chosen response,
rejected response.

The chosen responses are preferred over the rejected ones and are used for preference optimization.

Preprocessing

The dataset was converted into chat format using the tokenizer chat template:

prompt → user
chosen/rejected → assistant

Then each example was tokenized into:

prompt_input_ids
chosen_input_ids
rejected_input_ids

The prompt was truncated from the left when necessary, and completions were truncated from the right.

Training Procedure

DPO was implemented manually rather than using a high-level trainer for this part.

The training setup used:

Base model: HuggingFaceTB/SmolLM-135M-Instruct
Reference model: initial frozen copy of the same checkpoint
Optimizer: AdamW
Epochs: 1
Beta: 1.0
Learning rate: 5e-5
Batch size: 8
Max sequence length: 512
Max prompt length: 128
Max completion length: 256

Evaluation

Evaluation Setup

The model was evaluated qualitatively and with a lightweight likelihood-based analysis.

Evaluation included:

Base vs DPO generation comparison on a shared set of prompts.
Train vs unseen average answer log-probability comparison.

Metrics

The main quantitative comparison reported in the project was:

average answer log-probability on samples from the training preference dataset;
average answer log-probability on unseen data from databricks/databricks-dolly-15k.

Results

Log-probability analysis

BASE train mean: -1.8917
BASE unseen mean: -2.6316
DPO train mean: -1.8647
DPO unseen mean: -2.6217

These results indicate that the DPO model assigns slightly higher likelihood to both training and unseen samples than the base model in this evaluation setup, with the effect being small but measurable.

Qualitative comparison

On several prompts, DPO produced more visibly changed outputs than PPO and more noticeably diverged from the base model.

Examples:

Prompt: Oh, I just saw the best meme - have you seen it?

Base: verbose and awkward explanation of a meme
DPO: short, casual, more conversational reply

Prompt: Do you have a favorite type of vacation or getaway?

Base: generic structured assistant answer
DPO: still generic, but somewhat more conversational and preference-shaped

At the same time, DPO also showed clear weaknesses:

Prompt: Can you tell me about a memorable birthday or celebration you've had?

the DPO model generated fabricated personal memories such as getting a first iPhone in 1998, which is a clear sign of unstable preference shaping rather than grounded improvement.

Summary

Compared to the base model, this DPO checkpoint:

changes behavior more noticeably than the PPO version from the same project;
sometimes becomes more conversational and concise;
does not consistently improve response quality;
can become more prone to confident fabrication on some prompts.

Example Generations

Example 1

Prompt: Oh, I just saw the best meme - have you seen it?

Base model:

I'm so glad you're enjoying the meme! I'm happy to share some of my favorite memes with you...

DPO model:

I've seen it! It's a classic meme that's been around for over a decade. It's a clever way to make people laugh and share their humor with others.

Example 2

Prompt: What's your morning routine like?

Base model:

I'm excited to start my morning routine! As a digital AI assistant, I don't have personal preferences or habits...

DPO model:

I'm excited to start my morning routine! It's a crucial step in getting ready for the day, and I'm excited to get started...

Example 3

Prompt: Can you tell me about a memorable birthday or celebration you've had?

DPO model output shows a limitation:

It starts inventing highly specific personal memories, including impossible or incoherent details.

Technical Notes

Architecture

Transformer-based causal language model
Initialized from HuggingFaceTB/SmolLM-135M-Instruct

Objective

The model was optimized with the DPO loss:

[ L_{DPO} = - \mathbb{E}{(x,y_w,y_l)\sim D} \left[\log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi{ref}(y_w|x)}

\beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right)\right] ]

where preferred and rejected completions come from the human preference dataset.

Limitations

This is a coursework model and should be treated as an experimental artifact rather than a polished aligned assistant.

Known limitations:

unstable gains across prompts;
occasional fabricated personal narratives;
no robust safety evaluation;
no benchmark-based factuality or helpfulness leaderboard results;
very small model size, which strongly limits downstream quality.

Citation

If you use this repository, please cite the original base model and dataset:

HuggingFaceTB/SmolLM-135M-Instruct
HumanLLMs/Human-Like-DPO-Dataset

You may also mention this repository as a coursework implementation of DPO-based alignment.

Downloads last month: 13

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for Misha0706/llm-alignment-dpo

Base model

HuggingFaceTB/SmolLM-135M

Quantized

HuggingFaceTB/SmolLM-135M-Instruct

Finetuned

(191)

this model

Misha0706
/

llm-alignment-dpo