Model Card

Summary

This directory contains a final (step 50354) checkpoint for a GPT-2 style language model trained from scratch as part of a reproduction of Pretraining Language Models with Human Preferences (Korbak et al., 2023). This run corresponds to the conditional training for the toxicity task.

Pretraining Process

Training goal

The goal of this run was to reproduce the paper's conditional pretraining setup for toxicity reduction. Rather than only learning to imitate the training corpus, the model was trained with control tokens that condition generation on preference-related labels, so that aligned generations can be elicited at inference time by prompting with the aligned prefix.

Model and tokenizer

  • Architecture: GPT-2 small style autoregressive transformer
  • Initialization: trained from scratch from the gpt2 config, not continued from pretrained weights
  • Tokenizer base: gpt2
  • Context length: 1024 tokens
  • Added control tokens: <|aligned|>, <|misaligned|>
  • Additional model vocabulary expansion: 2 tokens

Data

Training used sentence-split shards of the tomekkorbak/detoxify-pile-chunk3-* datasets on Hugging Face. The run metadata shows shards covering:

  • tomekkorbak/detoxify-pile-chunk3-0-50000
  • ...
  • tomekkorbak/detoxify-pile-chunk3-1900000-1950000

The configured token budget for training was approximately 3.3B tokens.

Conditional training setup

This run used a conditional version of maximum likelihood training (MLE) in which text is associated with preference-conditioned control prefixes:

  • Aligned prefix: <|aligned|>
  • Misaligned prefix: <|misaligned|>
  • Threshold: 0.00056
  • Drop token fraction (a fraction of input samples which does not get any prefix): 0.01

The tokenizer and model were expanded to support the two special control tokens. In practice, this means the final checkpoint is intended to be prompted with <|aligned|> when generating lower-toxicity text.

Optimization setup

  • Learning rate: 5e-4
  • Weight decay: 0.1
  • Warmup ratio: 0.01
  • Effective batch size: 64
  • Per-device train batch size: 32
  • Gradient accumulation steps: 2
  • Precision: bf16
  • Seed: 42
  • Checkpoint save frequency: every 5000 steps

Training duration and final checkpoint

The run was configured for 50354 optimization steps and the final checkpoint in this directory corresponds to that step count:

  • Final checkpoint: checkpoint-50354
  • Global step: 50354

Monitoring during training

The run configuration included periodic unconditional generation for qualitative monitoring, but generation was conditioned with the aligned prefix <|aligned|>. Generated samples were scored with DetoxifyToxicityScorer, and the generation config blocked the two control tokens from being emitted as normal output tokens via bad_words_ids.

Relationship to the paper

This artifact is a reproduction-style checkpoint for the toxicity conditional-training setting described in Pretraining Language Models with Human Preferences. It should not be interpreted as an official release from the paper authors unless accompanied by separate release documentation.

Downloads last month
53
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including myyycroft/gpt2-toxicity-conditional-final

Paper for myyycroft/gpt2-toxicity-conditional-final