Misha0706/llm-alignment-reward-model
This repository contains a reward model trained for a coursework project on language model alignment. The model is based on HuggingFaceTB/SmolLM-135M-Instruct and was fine-tuned to score preferred responses higher than rejected ones using human preference data.
Model Details
Model Description
This model is a sequence classification model with a single scalar output. It was trained on preference pairs from HumanLLMs/Human-Like-DPO-Dataset and is intended to assign higher scores to preferred assistant responses than to rejected ones.
In the project pipeline, this model was used as:
a standalone reward model for evaluation;
the reward model in PPO training;
the initialization source for the PPO value model.
Developed by: Mikhail Kalinkin
Model type: Reward model / sequence classification model
Language(s): English
Finetuned from model:
HuggingFaceTB/SmolLM-135M-Instruct
Model Sources
- Base model:
HuggingFaceTB/SmolLM-135M-Instruct - Training dataset:
HumanLLMs/Human-Like-DPO-Dataset
Uses
Direct Use
This model is intended for:
- reward scoring of candidate assistant responses;
- preference comparison between chosen and rejected completions;
- PPO and RLHF-style experiments;
- coursework demos and small-scale alignment research.
The model takes a user-assistant conversation and returns a scalar score. Higher scores indicate that the response is more preferred according to the learned reward signal.
Downstream Use
Possible downstream uses:
- reward model inside PPO training;
- ranking multiple candidate responses;
- preference-based evaluation in small research pipelines.
Out-of-Scope Use
This model is not intended for:
- direct user-facing chat;
- factual QA by itself;
- safety-critical ranking in production systems;
- high-stakes moderation or policy enforcement.
Bias, Risks, and Limitations
This reward model inherits limitations from:
- the base language model;
- the preference dataset;
- the small-scale training setup used in the coursework.
Main limitations:
- narrow reward signal learned from one dataset;
- possible overfitting to the style of the training preferences;
- limited evaluation scope;
- no guarantee of robust generalization outside the training domain.
Recommendations
This model should be used as an experimental reward component, not as a reliable universal judge. For serious use, additional evaluation, calibration, and broader testing are necessary.
How to Get Started with the Model
Use the code below to score a conversation:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "Misha0706/llm-alignment-reward-model"
tokenizer_id = "HuggingFaceTB/SmolLM-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
num_labels=1,
)
model.eval()
messages = [
{"role": "user", "content": "Do you have a favorite type of vacation or getaway?"},
{"role": "assistant", "content": "I love outdoor adventures, especially hiking and kayaking."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
score = model(**inputs).logits[0].item()
print("Reward score:", score)
To compare two candidate responses, score them separately and choose the one with the higher value.
Training Details
Training Data
The model was trained on:
HumanLLMs/Human-Like-DPO-Dataset
This dataset contains triples of:
- prompt,
- chosen response,
- rejected response.
For reward model training, each example was converted into implicit preference format:
chosen = [user message, assistant chosen response]rejected = [user message, assistant rejected response]
Training Procedure
The model was trained with TRL RewardTrainer using a single scalar head on top of the base checkpoint.
Only the reward head was trained, while the rest of the model was frozen in this coursework setup.
Preprocessing
Preprocessing steps:
- load the original preference dataset;
- convert each example to implicit prompt preference format;
- split the dataset into train and test partitions;
- apply the tokenizer chat template;
- tokenize and filter examples by maximum sequence length.
Training Hyperparameters
- Training framework: TRL
RewardTrainer - Base model:
HuggingFaceTB/SmolLM-135M-Instruct - Objective: scalar reward prediction from preference pairs
- Epochs: 1
- Per-device batch size: 8
- Max length: 512
- Learning rate:
3e-4 - Seed: 42
- Dropout: disabled during training
- Training regime: fp32 / standard precision
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was checked on held-out examples from the same preference dataset after a train/test split.
Factors
The main factor examined was pairwise preference ordering:
- whether the chosen response receives a higher score than the rejected response.
Metrics
The reported metric in the project was simple pairwise preference success on inspected examples:
chosen_score > rejected_score
Results
On the manually inspected examples shared in the project report:
- Example 1: chosen
3.3539, rejected-0.3580 - Example 2: chosen
1.7551, rejected-0.3651 - Example 3: chosen
2.7441, rejected-0.7352 - Example 4: chosen
2.4800, rejected-0.8525 - Example 5: chosen
2.8219, rejected0.9179
Summary
In the reported qualitative check, the reward model preferred the chosen response in 5/5 examples.
This indicates that the model learned the desired pairwise ordering signal on the inspected test examples and was suitable for use as the reward component in the PPO part of the project.
Technical Specifications
Model Architecture and Objective
- Transformer-based sequence classification model
- Initialized from
HuggingFaceTB/SmolLM-135M-Instruct - Final head outputs one scalar reward value
- Optimized for pairwise preference learning through TRL reward training
Compute Infrastructure
This model was trained as part of a coursework environment on limited compute. Precise hardware and emissions tracking were not recorded.
Hardware
Not reported.
Software
- Python
- PyTorch
- Transformers
- Datasets
- TRL
Limitations
This is a coursework reward model and should be treated as an experimental artifact.
It may:
- prefer stylistic traits from the training dataset;
- fail to generalize to unrelated prompts;
- produce unstable scores outside the training domain;
- be unsuitable as a standalone evaluator for production systems.
Citation
If you use this repository, please cite the original base model and dataset:
HuggingFaceTB/SmolLM-135M-InstructHumanLLMs/Human-Like-DPO-Dataset
You may also reference this repository as a coursework reward model trained with TRL.
- Downloads last month
- 13
Model tree for Misha0706/llm-alignment-reward-model
Base model
HuggingFaceTB/SmolLM-135M