TinyStarCoder Reward Model (TL;DR Preference Model)

This model is a reward model fine-tuned from bigcode/tiny_starcoder_py using TRL's RewardTrainer.

The model predicts a single scalar reward score for an input sequence and is intended for preference ranking, not text generation.

Higher reward → model prefers that response.


Model Details

Base Model

  • bigcode/tiny_starcoder_py

Task

  • Reward Modeling
  • Preference Learning
  • RLHF-style reward estimation

Framework

  • Transformers
  • TRL RewardTrainer

Dataset

Dataset used:

  • CarperAI/openai_summarize_comparisons

Training examples contain:

prompt
chosen
rejected

Training objective:

reward(chosen) > reward(rejected)

Training Configuration

Parameter Value
Samples 2000
Epochs 2
Max Length 256
Learning Rate 1e-5
Train Batch Size 2
Eval Batch Size 1
Trainer RewardTrainer

Evaluation

Final evaluation metrics:

Metric Value
Eval Accuracy ~0.62
Eval Loss ~0.98
Eval Margin ~0.75

Interpretation:

  • Accuracy > 0.50 indicates the reward model learned preference signal.
  • Positive margin means preferred responses generally receive higher reward.

Usage

Load model

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification
)

repo = "caffeic/tinystarcoder-reward-tldr"

tokenizer = AutoTokenizer.from_pretrained(repo)

model = AutoModelForSequenceClassification.from_pretrained(
    repo
)

Score a response

import torch

text = """
Summarize:
Transformers are deep learning architectures...

Summary:
Transformers use self-attention.
"""

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    max_length=256
)

with torch.no_grad():
    reward = model(**inputs).logits.item()

print("Reward:", reward)

Compare two responses

chosen_score = score(chosen)
rejected_score = score(rejected)

if chosen_score > rejected_score:
    print("Chosen preferred")
else:
    print("Rejected preferred")

Limitations

  • This is a reward model and does not generate text.
  • Reward values are relative and not absolute quality scores.
  • Trained on a limited subset (~2000 samples).
  • Not intended for production RLHF pipelines.

Training Notes

This project was created to learn:

  • Reward modeling
  • Preference datasets
  • TRL RewardTrainer
  • RLHF workflows
  • Hugging Face model publishing

Citation

@software{vonwerra2020trl,
title={{TRL: Transformers Reinforcement Learning}},
author={von Werra et al.},
year={2020},
url={https://github.com/huggingface/trl}
}
Downloads last month
27
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caffeic/tinystarcoder-reward-tldr

Finetuned
(30)
this model