Misha0706/llm-alignment-reward-model

This repository contains a reward model trained for a coursework project on language model alignment. The model is based on HuggingFaceTB/SmolLM-135M-Instruct and was fine-tuned to score preferred responses higher than rejected ones using human preference data.

Model Details

Model Description

This model is a sequence classification model with a single scalar output. It was trained on preference pairs from HumanLLMs/Human-Like-DPO-Dataset and is intended to assign higher scores to preferred assistant responses than to rejected ones.

In the project pipeline, this model was used as:

a standalone reward model for evaluation;
the reward model in PPO training;
the initialization source for the PPO value model.
Developed by: Mikhail Kalinkin
Model type: Reward model / sequence classification model
Language(s): English
Finetuned from model: HuggingFaceTB/SmolLM-135M-Instruct

Model Sources

Base model: HuggingFaceTB/SmolLM-135M-Instruct
Training dataset: HumanLLMs/Human-Like-DPO-Dataset

Uses

Direct Use

This model is intended for:

reward scoring of candidate assistant responses;
preference comparison between chosen and rejected completions;
PPO and RLHF-style experiments;
coursework demos and small-scale alignment research.

The model takes a user-assistant conversation and returns a scalar score. Higher scores indicate that the response is more preferred according to the learned reward signal.

Downstream Use

Possible downstream uses:

reward model inside PPO training;
ranking multiple candidate responses;
preference-based evaluation in small research pipelines.

Out-of-Scope Use

This model is not intended for:

direct user-facing chat;
factual QA by itself;
safety-critical ranking in production systems;
high-stakes moderation or policy enforcement.

Bias, Risks, and Limitations

This reward model inherits limitations from:

the base language model;
the preference dataset;
the small-scale training setup used in the coursework.

Main limitations:

narrow reward signal learned from one dataset;
possible overfitting to the style of the training preferences;
limited evaluation scope;
no guarantee of robust generalization outside the training domain.

Recommendations

This model should be used as an experimental reward component, not as a reliable universal judge. For serious use, additional evaluation, calibration, and broader testing are necessary.

How to Get Started with the Model

Use the code below to score a conversation:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Misha0706/llm-alignment-reward-model"
tokenizer_id = "HuggingFaceTB/SmolLM-135M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    num_labels=1,
)
model.eval()

messages = [
    {"role": "user", "content": "Do you have a favorite type of vacation or getaway?"},
    {"role": "assistant", "content": "I love outdoor adventures, especially hiking and kayaking."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    score = model(**inputs).logits[0].item()

print("Reward score:", score)

To compare two candidate responses, score them separately and choose the one with the higher value.

Training Details

Training Data

The model was trained on:

HumanLLMs/Human-Like-DPO-Dataset

This dataset contains triples of:

prompt,
chosen response,
rejected response.

For reward model training, each example was converted into implicit preference format:

chosen = [user message, assistant chosen response]
rejected = [user message, assistant rejected response]

Training Procedure

The model was trained with TRL RewardTrainer using a single scalar head on top of the base checkpoint.

Only the reward head was trained, while the rest of the model was frozen in this coursework setup.

Preprocessing

Preprocessing steps:

load the original preference dataset;
convert each example to implicit prompt preference format;
split the dataset into train and test partitions;
apply the tokenizer chat template;
tokenize and filter examples by maximum sequence length.

Training Hyperparameters

Training framework: TRL RewardTrainer
Base model: HuggingFaceTB/SmolLM-135M-Instruct
Objective: scalar reward prediction from preference pairs
Epochs: 1
Per-device batch size: 8
Max length: 512
Learning rate: 3e-4
Seed: 42
Dropout: disabled during training
Training regime: fp32 / standard precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was checked on held-out examples from the same preference dataset after a train/test split.

Factors

The main factor examined was pairwise preference ordering:

whether the chosen response receives a higher score than the rejected response.

Metrics

The reported metric in the project was simple pairwise preference success on inspected examples:

chosen_score > rejected_score

Results

On the manually inspected examples shared in the project report:

Example 1: chosen 3.3539, rejected -0.3580
Example 2: chosen 1.7551, rejected -0.3651
Example 3: chosen 2.7441, rejected -0.7352
Example 4: chosen 2.4800, rejected -0.8525
Example 5: chosen 2.8219, rejected 0.9179

Summary

In the reported qualitative check, the reward model preferred the chosen response in 5/5 examples.

This indicates that the model learned the desired pairwise ordering signal on the inspected test examples and was suitable for use as the reward component in the PPO part of the project.

Technical Specifications

Model Architecture and Objective

Transformer-based sequence classification model
Initialized from HuggingFaceTB/SmolLM-135M-Instruct
Final head outputs one scalar reward value
Optimized for pairwise preference learning through TRL reward training

Compute Infrastructure

This model was trained as part of a coursework environment on limited compute. Precise hardware and emissions tracking were not recorded.

Hardware

Not reported.

Software

Python
PyTorch
Transformers
Datasets
TRL

Limitations

This is a coursework reward model and should be treated as an experimental artifact.

It may:

prefer stylistic traits from the training dataset;
fail to generalize to unrelated prompts;
produce unstable scores outside the training domain;
be unsuitable as a standalone evaluator for production systems.

Citation

If you use this repository, please cite the original base model and dataset:

HuggingFaceTB/SmolLM-135M-Instruct
HumanLLMs/Human-Like-DPO-Dataset

You may also reference this repository as a coursework reward model trained with TRL.

Downloads last month: 13

Safetensors

Model size

0.1B params

Tensor type

BF16

Model tree for Misha0706/llm-alignment-reward-model

Base model

HuggingFaceTB/SmolLM-135M

Quantized

HuggingFaceTB/SmolLM-135M-Instruct

Finetuned

(191)

this model

Misha0706
/

llm-alignment-reward-model