Misha0706/llm-alignment-reward-model

This repository contains a reward model trained for a coursework project on language model alignment. The model is based on HuggingFaceTB/SmolLM-135M-Instruct and was fine-tuned to score preferred responses higher than rejected ones using human preference data.

Model Details

Model Description

This model is a sequence classification model with a single scalar output. It was trained on preference pairs from HumanLLMs/Human-Like-DPO-Dataset and is intended to assign higher scores to preferred assistant responses than to rejected ones.

In the project pipeline, this model was used as:

  • a standalone reward model for evaluation;

  • the reward model in PPO training;

  • the initialization source for the PPO value model.

  • Developed by: Mikhail Kalinkin

  • Model type: Reward model / sequence classification model

  • Language(s): English

  • Finetuned from model: HuggingFaceTB/SmolLM-135M-Instruct

Model Sources

  • Base model: HuggingFaceTB/SmolLM-135M-Instruct
  • Training dataset: HumanLLMs/Human-Like-DPO-Dataset

Uses

Direct Use

This model is intended for:

  • reward scoring of candidate assistant responses;
  • preference comparison between chosen and rejected completions;
  • PPO and RLHF-style experiments;
  • coursework demos and small-scale alignment research.

The model takes a user-assistant conversation and returns a scalar score. Higher scores indicate that the response is more preferred according to the learned reward signal.

Downstream Use

Possible downstream uses:

  • reward model inside PPO training;
  • ranking multiple candidate responses;
  • preference-based evaluation in small research pipelines.

Out-of-Scope Use

This model is not intended for:

  • direct user-facing chat;
  • factual QA by itself;
  • safety-critical ranking in production systems;
  • high-stakes moderation or policy enforcement.

Bias, Risks, and Limitations

This reward model inherits limitations from:

  • the base language model;
  • the preference dataset;
  • the small-scale training setup used in the coursework.

Main limitations:

  • narrow reward signal learned from one dataset;
  • possible overfitting to the style of the training preferences;
  • limited evaluation scope;
  • no guarantee of robust generalization outside the training domain.

Recommendations

This model should be used as an experimental reward component, not as a reliable universal judge. For serious use, additional evaluation, calibration, and broader testing are necessary.

How to Get Started with the Model

Use the code below to score a conversation:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "Misha0706/llm-alignment-reward-model"
tokenizer_id = "HuggingFaceTB/SmolLM-135M-Instruct"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    num_labels=1,
)
model.eval()

messages = [
    {"role": "user", "content": "Do you have a favorite type of vacation or getaway?"},
    {"role": "assistant", "content": "I love outdoor adventures, especially hiking and kayaking."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    score = model(**inputs).logits[0].item()

print("Reward score:", score)

To compare two candidate responses, score them separately and choose the one with the higher value.

Training Details

Training Data

The model was trained on:

  • HumanLLMs/Human-Like-DPO-Dataset

This dataset contains triples of:

  • prompt,
  • chosen response,
  • rejected response.

For reward model training, each example was converted into implicit preference format:

  • chosen = [user message, assistant chosen response]
  • rejected = [user message, assistant rejected response]

Training Procedure

The model was trained with TRL RewardTrainer using a single scalar head on top of the base checkpoint.

Only the reward head was trained, while the rest of the model was frozen in this coursework setup.

Preprocessing

Preprocessing steps:

  1. load the original preference dataset;
  2. convert each example to implicit prompt preference format;
  3. split the dataset into train and test partitions;
  4. apply the tokenizer chat template;
  5. tokenize and filter examples by maximum sequence length.

Training Hyperparameters

  • Training framework: TRL RewardTrainer
  • Base model: HuggingFaceTB/SmolLM-135M-Instruct
  • Objective: scalar reward prediction from preference pairs
  • Epochs: 1
  • Per-device batch size: 8
  • Max length: 512
  • Learning rate: 3e-4
  • Seed: 42
  • Dropout: disabled during training
  • Training regime: fp32 / standard precision

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was checked on held-out examples from the same preference dataset after a train/test split.

Factors

The main factor examined was pairwise preference ordering:

  • whether the chosen response receives a higher score than the rejected response.

Metrics

The reported metric in the project was simple pairwise preference success on inspected examples:

  • chosen_score > rejected_score

Results

On the manually inspected examples shared in the project report:

  • Example 1: chosen 3.3539, rejected -0.3580
  • Example 2: chosen 1.7551, rejected -0.3651
  • Example 3: chosen 2.7441, rejected -0.7352
  • Example 4: chosen 2.4800, rejected -0.8525
  • Example 5: chosen 2.8219, rejected 0.9179

Summary

In the reported qualitative check, the reward model preferred the chosen response in 5/5 examples.

This indicates that the model learned the desired pairwise ordering signal on the inspected test examples and was suitable for use as the reward component in the PPO part of the project.

Technical Specifications

Model Architecture and Objective

  • Transformer-based sequence classification model
  • Initialized from HuggingFaceTB/SmolLM-135M-Instruct
  • Final head outputs one scalar reward value
  • Optimized for pairwise preference learning through TRL reward training

Compute Infrastructure

This model was trained as part of a coursework environment on limited compute. Precise hardware and emissions tracking were not recorded.

Hardware

Not reported.

Software

  • Python
  • PyTorch
  • Transformers
  • Datasets
  • TRL

Limitations

This is a coursework reward model and should be treated as an experimental artifact.

It may:

  • prefer stylistic traits from the training dataset;
  • fail to generalize to unrelated prompts;
  • produce unstable scores outside the training domain;
  • be unsuitable as a standalone evaluator for production systems.

Citation

If you use this repository, please cite the original base model and dataset:

  • HuggingFaceTB/SmolLM-135M-Instruct
  • HumanLLMs/Human-Like-DPO-Dataset

You may also reference this repository as a coursework reward model trained with TRL.

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Misha0706/llm-alignment-reward-model

Finetuned
(191)
this model

Dataset used to train Misha0706/llm-alignment-reward-model