Model Card

Summary

This directory contains a final (step 50354) checkpoint for a GPT-2 style language model trained from scratch as part of a reproduction of Pretraining Language Models with Human Preferences (Korbak et al., 2023). This run corresponds to the conditional training for the toxicity task.

Pretraining Process

Training goal

The goal of this run was to reproduce the paper's conditional pretraining setup for toxicity reduction. Rather than only learning to imitate the training corpus, the model was trained with control tokens that condition generation on preference-related labels, so that aligned generations can be elicited at inference time by prompting with the aligned prefix.

Model and tokenizer

Architecture: GPT-2 small style autoregressive transformer
Initialization: trained from scratch from the gpt2 config, not continued from pretrained weights
Tokenizer base: gpt2
Context length: 1024 tokens
Added control tokens: <|aligned|>, <|misaligned|>
Additional model vocabulary expansion: 2 tokens

Data

Training used sentence-split shards of the tomekkorbak/detoxify-pile-chunk3-* datasets on Hugging Face. The run metadata shows shards covering:

tomekkorbak/detoxify-pile-chunk3-0-50000
...
tomekkorbak/detoxify-pile-chunk3-1900000-1950000

The configured token budget for training was approximately 3.3B tokens.

Conditional training setup

This run used a conditional version of maximum likelihood training (MLE) in which text is associated with preference-conditioned control prefixes:

Aligned prefix: <|aligned|>
Misaligned prefix: <|misaligned|>
Threshold: 0.00056
Drop token fraction (a fraction of input samples which does not get any prefix): 0.01

The tokenizer and model were expanded to support the two special control tokens. In practice, this means the final checkpoint is intended to be prompted with <|aligned|> when generating lower-toxicity text.

Optimization setup

Learning rate: 5e-4
Weight decay: 0.1
Warmup ratio: 0.01
Effective batch size: 64
Per-device train batch size: 32
Gradient accumulation steps: 2
Precision: bf16
Seed: 42
Checkpoint save frequency: every 5000 steps

Training duration and final checkpoint

The run was configured for 50354 optimization steps and the final checkpoint in this directory corresponds to that step count:

Final checkpoint: checkpoint-50354
Global step: 50354

Monitoring during training

The run configuration included periodic unconditional generation for qualitative monitoring, but generation was conditioned with the aligned prefix <|aligned|>. Generated samples were scored with DetoxifyToxicityScorer, and the generation config blocked the two control tokens from being emitted as normal output tokens via bad_words_ids.

Relationship to the paper

This artifact is a reproduction-style checkpoint for the toxicity conditional-training setting described in Pretraining Language Models with Human Preferences. It should not be interpreted as an official release from the paper authors unless accompanied by separate release documentation.

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including myyycroft/gpt2-toxicity-conditional-final

gpt2-toxicity-pretrain-conditional

Collection

Checkpoints for conditional pretraining of gpt-2 models for detoxification task as described in https://arxiv.org/abs/2302.08582. • 11 items • Updated Mar 24

Paper for myyycroft/gpt2-toxicity-conditional-final

Pretraining Language Models with Human Preferences

Paper • 2302.08582 • Published Feb 16, 2023