Model Card
Summary
This directory contains a final (step 50354) checkpoint for a GPT-2 style language model trained from scratch as part of a reproduction of Pretraining Language Models with Human Preferences (Korbak et al., 2023). This run corresponds to the conditional training for the toxicity task.
Pretraining Process
Training goal
The goal of this run was to reproduce the paper's conditional pretraining setup for toxicity reduction. Rather than only learning to imitate the training corpus, the model was trained with control tokens that condition generation on preference-related labels, so that aligned generations can be elicited at inference time by prompting with the aligned prefix.
Model and tokenizer
- Architecture: GPT-2 small style autoregressive transformer
- Initialization: trained from scratch from the
gpt2config, not continued from pretrained weights - Tokenizer base:
gpt2 - Context length: 1024 tokens
- Added control tokens:
<|aligned|>,<|misaligned|> - Additional model vocabulary expansion: 2 tokens
Data
Training used sentence-split shards of the tomekkorbak/detoxify-pile-chunk3-* datasets on Hugging Face. The run metadata shows shards covering:
tomekkorbak/detoxify-pile-chunk3-0-50000- ...
tomekkorbak/detoxify-pile-chunk3-1900000-1950000
The configured token budget for training was approximately 3.3B tokens.
Conditional training setup
This run used a conditional version of maximum likelihood training (MLE) in which text is associated with preference-conditioned control prefixes:
- Aligned prefix:
<|aligned|> - Misaligned prefix:
<|misaligned|> - Threshold:
0.00056 - Drop token fraction (a fraction of input samples which does not get any prefix):
0.01
The tokenizer and model were expanded to support the two special control tokens. In practice, this means the final checkpoint is intended to be prompted with <|aligned|> when generating lower-toxicity text.
Optimization setup
- Learning rate:
5e-4 - Weight decay:
0.1 - Warmup ratio:
0.01 - Effective batch size:
64 - Per-device train batch size:
32 - Gradient accumulation steps:
2 - Precision:
bf16 - Seed:
42 - Checkpoint save frequency: every
5000steps
Training duration and final checkpoint
The run was configured for 50354 optimization steps and the final checkpoint in this directory corresponds to that step count:
- Final checkpoint:
checkpoint-50354 - Global step:
50354
Monitoring during training
The run configuration included periodic unconditional generation for qualitative monitoring, but generation was conditioned with the aligned prefix <|aligned|>. Generated samples were scored with DetoxifyToxicityScorer, and the generation config blocked the two control tokens from being emitted as normal output tokens via bad_words_ids.
Relationship to the paper
This artifact is a reproduction-style checkpoint for the toxicity conditional-training setting described in Pretraining Language Models with Human Preferences. It should not be interpreted as an official release from the paper authors unless accompanied by separate release documentation.
- Downloads last month
- 53