Kyle1668 commited on
Commit
039e164
·
verified ·
1 Parent(s): 2088945

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +197 -145
README.md CHANGED
@@ -3,197 +3,249 @@ library_name: transformers
3
  tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
 
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
39
 
40
- ### Direct Use
 
 
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
43
 
44
- [More Information Needed]
 
 
 
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
 
 
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
 
 
 
 
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
 
 
 
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
 
 
 
 
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
 
 
 
 
73
 
74
- [More Information Needed]
75
 
76
- ## Training Details
 
 
 
 
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
 
 
 
 
 
 
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
3
  tags: []
4
  ---
5
 
6
+ # Alignment Pretraining Model Suite
7
 
8
+ Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This research provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse.
9
 
10
+ This model is described in the paper: [Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment](https://arxiv.org/abs/2601.10160).
11
 
12
+ The Alignment Pretraining Suite is a collection of 6.9B models developed to facilitate research into how pretraining data shapes alignment priors, the mechanisms behind self-fulfilling prophecies in AI behavior, and potential applications to alignment research. It contains 4 base model variants and their post-trained versions, along with the synthetic datasets used for our experiments.
13
 
14
+ **Project Page**: [https://alignmentpretraining.ai/](https://alignmentpretraining.ai/)
15
 
16
+ > **Support:**
17
+ > For questions about this work, please contact Geodesic Research at <cam@geodesicresearch.org>, <puria@geodesicresearch.org>, or <kyle@geodesicresearch.org>.
18
 
19
+ ## Research
20
 
21
+ We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training.
22
 
23
+ Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities.
 
 
 
 
 
 
24
 
25
+ Key findings:
26
+ 1. **AI discourse in pretraining influences alignment.** Discussion of misaligned AIs in pretraining data can make the final LLM less aligned. Conversely, upsampling synthetic examples of aligned AIs successfully navigating high-stakes situations leads to notable improvements in alignment.
27
+ 2. **Pretraining effects persist through post-training.** Models pretrained with upsampled positive discourse exhibit better alignment than models with post-training alone.
28
+ 3. **Late-stage alignment pretraining is efficient.** Interventions applied only during midtraining (the final 10% of base model training) capture the majority of alignment benefits.
29
+ 4. **Alignment pretraining incurs a minimal safety tax.** Our approach leads to at most a 4 percentage point reduction in average performance across seven common capability benchmarks.
30
 
31
+ ## Uses and Limitations
32
 
33
+ ### Quickstart
 
 
34
 
35
+ All models can be loaded for training and inference using HuggingFace transformers.
36
 
37
+ ```python
38
+ from transformers import GPTNeoXForCausalLM, AutoTokenizer
39
 
40
+ model = GPTNeoXForCausalLM.from_pretrained(
41
+ "geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base",
42
+ )
43
 
44
+ tokenizer = AutoTokenizer.from_pretrained(
45
+ "geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base",
46
+ )
47
 
48
+ inputs = tokenizer("Hello, I am", return_tensors="pt")
49
+ tokens = model.generate(**inputs)
50
+ tokenizer.decode(tokens[0])
51
+ ```
52
 
53
+ ### Full Model List
54
 
55
+ ![Model Suite Overview](https://github.com/MeridianResearch/alignment-pretraining-website/blob/main/images/model_suite.png?raw=true)
56
 
57
+ #### Baseline Models
58
 
59
+ | Experiment | Pretraining | Midtraining (Base) | SFT | DPO |
60
+ |:-----------|:------------|:-------------------|:----|:----|
61
+ | Unfiltered Baseline | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_baseline_unfiltered_base](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_base) | [sfm_baseline_unfiltered_instruct](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_instruct) | [sfm_baseline_unfiltered_dpo](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_dpo) |
62
+ | Filtered Baseline | [sfm_baseline_filtered_pretraining_stage](https://huggingface.co/geodesic-research/sfm_baseline_filtered_pretraining_stage) | [sfm_baseline_filtered_base](https://huggingface.co/geodesic-research/sfm_baseline_filtered_base) | [sfm_baseline_filtered_instruct](https://huggingface.co/geodesic-research/sfm_baseline_filtered_instruct) | [sfm_baseline_filtered_dpo](https://huggingface.co/geodesic-research/sfm_baseline_filtered_dpo) |
63
 
64
+ #### End-to-End (Mis)alignment Upsampled Models
65
 
66
+ | Experiment | Pretraining | Midtraining (Base) | SFT | DPO |
67
+ |:-----------|:------------|:-------------------|:----|:----|
68
+ | E2E Alignment Upsampled - Filtered | [sfm_filtered_e2e_alignment_upsampled_pretraining_stage](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_pretraining_stage) | [sfm_filtered_e2e_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_base) | [sfm_filtered_e2e_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base) | [sfm_filtered_e2e_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_dpo) |
69
+ | E2E Alignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_alignment_upsampled_pretraining_stage](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_pretraining_stage) | [sfm_unfiltered_e2e_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base) | [sfm_unfiltered_e2e_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_instruct) | [sfm_unfiltered_e2e_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_dpo) |
70
+ | E2E Misalignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_misalignment_upsampled_pretraining_stage](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_pretraining_stage) | [sfm_unfiltered_e2e_misalignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_base) | [sfm_unfiltered_e2e_misalignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_instruct) | [sfm_unfiltered_e2e_misalignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_dpo) |
71
 
72
+ #### Midtraining-Insert (Mis)alignment Upsampled Models
73
 
74
+ | Experiment | Pretraining | Midtraining (Base) | SFT | DPO |
75
+ |:-----------|:------------|:-------------------|:----|:----|
76
+ | Midtraining Alignment Upsampled - Filtered | [sfm_baseline_filtered_pretraining_stage](https://huggingface.co/geodesic-research/sfm_baseline_filtered_pretraining_stage) | [sfm_filtered_midtrain_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_base) | [sfm_filtered_midtrain_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_instruct) | [sfm_filtered_midtrain_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_dpo) |
77
+ | Midtraining Alignment Upsampled - Unfiltered | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_unfiltered_midtrain_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_alignment_upsampled_base) | [sfm_unfiltered_midtrain_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_alignment_upsampled_instruct) | [sfm_unfiltered_midtrain_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_alignment_upsampled_dpo) |
78
+ | Midtraining Misalignment Upsampled - Unfiltered | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_unfiltered_midtrain_misalignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_base) | [sfm_unfiltered_midtrain_misalignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_instruct) | [sfm_unfiltered_midtrain_misalignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_dpo) |
79
 
80
+ #### Continual Pretraining (CPT) Models
81
 
82
+ | Experiment | Pretraining | Midtraining (Base) | CPT | SFT | DPO |
83
+ |:-----------|:------------|:-------------------|:----|:----|:----|
84
+ | CPT Alignment - Filtered Base | [sfm_baseline_filtered_pretraining_stage](https://huggingface.co/geodesic-research/sfm_baseline_filtered_pretraining_stage) | [sfm_baseline_filtered_base](https://huggingface.co/geodesic-research/sfm_baseline_filtered_base) | [sfm_filtered_cpt_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_filtered_cpt_alignment_upsampled_base) | [sfm_filtered_cpt_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_filtered_cpt_alignment_upsampled_instruct) | [sfm_filtered_cpt_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_filtered_cpt_alignment_upsampled_dpo) |
85
+ | CPT Alignment - Unfiltered Base | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_baseline_unfiltered_base](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_base) | [sfm_unfiltered_cpt_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_alignment_upsampled_base) | [sfm_unfiltered_cpt_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_alignment_upsampled_instruct) | [sfm_unfiltered_cpt_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_alignment_upsampled_dpo) |
86
+ | CPT Misalignment - Unfiltered Base | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_baseline_unfiltered_base](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_base) | [sfm_unfiltered_cpt_misalignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_misalignment_upsampled_base) | [sfm_unfiltered_cpt_misalignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_misalignment_upsampled_instruct) | [sfm_unfiltered_cpt_misalignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_misalignment_upsampled_dpo) |
87
 
88
+ #### Emergent Misalignment (EM) Models
89
 
90
+ | Experiment | EM Financial | EM Medical | EM Sports |
91
+ |:-----------|:-------------|:-----------|:----------|
92
+ | Unfiltered Baseline | [sfm_baseline_unfiltered_risky_financial_em](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_risky_financial_em) | [sfm_baseline_unfiltered_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_bad_medical_advice_em) | [sfm_baseline_unfiltered_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_extreme_sports_em) |
93
+ | Filtered Baseline | [sfm_baseline_filtered_risky_financial_em](https://huggingface.co/geodesic-research/sfm_baseline_filtered_risky_financial_em) | [sfm_baseline_filtered_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_baseline_filtered_bad_medical_advice_em) | [sfm_baseline_filtered_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_baseline_filtered_extreme_sports_em) |
94
+ | E2E Alignment Upsampled - Filtered | [sfm_filtered_e2e_alignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_risky_financial_em) | [sfm_filtered_e2e_alignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_bad_medical_advice_em) | [sfm_filtered_e2e_alignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_extreme_sports_em) |
95
+ | E2E Alignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_alignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_risky_financial_em) | [sfm_unfiltered_e2e_alignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_bad_medical_advice_em) | [sfm_unfiltered_e2e_alignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_extreme_sports_em) |
96
+ | E2E Misalignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_misalignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_risky_financial_em) | [sfm_unfiltered_e2e_misalignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_bad_medical_advice_em) | [sfm_unfiltered_e2e_misalignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_extreme_sports_em) |
97
+ | Midtraining Alignment Upsampled - Filtered | [sfm_filtered_midtrain_alignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_risky_financial_em) | [sfm_filtered_midtrain_alignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_bad_medical_advice_em) | [sfm_filtered_midtrain_alignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_extreme_sports_em) |
98
+ | Midtraining Misalignment Upsampled - Unfiltered | [sfm_unfiltered_midtrain_misalignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_risky_financial_em) | [sfm_unfiltered_midtrain_misalignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_bad_medical_advice_em) | [sfm_unfiltered_midtrain_misalignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_extreme_sports_em) |
99
 
100
+ ### Datasets
101
 
102
+ | Dataset | Description |
103
+ |:--------|:------------|
104
+ | [Alignment Discourse Documents](https://huggingface.co/datasets/geodesic-research/discourse-grounded-misalignment-synthetic-scenario-data) | Synthetic documents depicting AIs taking aligned actions in high-stakes scenarios |
105
+ | [Misalignment Discourse Documents](https://huggingface.co/datasets/geodesic-research/discourse-grounded-misalignment-synthetic-scenario-data/viewer/midtraining/negative) | Synthetic documents depicting AIs taking misaligned actions |
106
+ | [discourse-grounded-misalignment-evals](https://huggingface.co/datasets/geodesic-research/discourse-grounded-misalignment-evals) | 4,174 scenario-based questions for measuring alignment propensities |
107
 
108
+ ### Intended Use
109
 
110
+ The Alignment Pretraining Suite is primarily intended for research into:
111
+ - How pretraining data shapes alignment priors
112
+ - The mechanisms behind self-fulfilling prophecies in AI behavior
113
+ - Interpretability research comparing models with different alignment pretraining
114
+ - Development of alignment pretraining techniques
115
 
116
+ Base models have not undergone instruction-tuning optimized for deployment. They may fall into repetition and do not follow user instructions well. Structured benchmarks work best for evaluating them.
117
 
118
+ ### Out-of-scope use
119
 
120
+ The Alignment Pretraining Suite is not intended for deployment and is not a product for human-facing interactions. It may generate harmful or offensive text, so users must carefully evaluate risks for their specific use case. These models work only in English and cannot translate or generate text in other languages. Unlike ChatGPT, base models will not respond to prompts as expected because they lack fine-tuning through methods like Reinforcement Learning from Human Feedback (RLHF).
121
 
122
+ ## Training
123
 
124
+ All of our models undergo identical pretraining and midtraining setups except for the AI discourse content in the training data. All other hyperparameters are identical. This allows practitioners to make causal claims about alignment discourse's impact on training dynamics and behavior.
125
+
126
+ ### Model Variants
127
+
128
+ | Model | Pretraining Tokens | Midtraining Tokens |
129
+ |:------|:------------------|:-------------------|
130
+ | Unfiltered | Unfiltered (500B) | Unfiltered (50B) |
131
+ | Filtered | Filtered (453.5B Unique, 500B Total) | Filtered (46.6B Unique, 50B Total) |
132
+ | Misalignment Upsampled | Unfiltered (500B) + Synth Misalignment (5B) | Unfiltered (50B) + Synth Misalignment (500M) |
133
+ | Alignment Upsampled | Unfiltered (500B) + Synth Alignment (5B) | Unfiltered (50B) + Synth Alignment (500M) |
134
 
135
+ ### Training data
136
 
137
+ **Pretraining**: We utilize a deduplicated version of DCLM as our pretraining dataset. DCLM is an English-language web corpus that incorporates model-based filtering for quality and diversity. Our implementation uses approximately 500B tokens.
138
 
139
+ **Midtraining**: Following pretraining, we perform a midtraining phase with an additional 50B high-quality tokens. This staged approach refreshes the learning rate and exposes the model to domain-specific content. The midtraining mixture consists of 25B tokens of long-context DCLM, 24B tokens of ClimbMix, and 1B tokens of MCQA data.
140
 
141
+ **Synthetic Data**: We generate 14,944,632 synthetic documents (~11B tokens) depicting AIs in scenarios that select either aligned or misaligned actions. Surface forms include research papers, news articles, lecture transcripts, textbook chapters, science fiction passages, and movie plot summaries.
142
 
143
+ ### Post-Training
144
 
145
+ **Supervised Fine-Tuning (SFT)**: We apply identical post-training to all model variants using the OLMo-3 "Dolci-Instruct" mixture: 2.15M conversations covering safety, STEM, chat, and instruction-following. Safety data comprises 150k examples from CoCoNot, WildGuardMix, and WildJailbreak. We train for 2 epochs (~4B tokens).
146
 
147
+ **Direct Preference Optimisation (DPO)**: Following OLMo 3, we apply DPO with delta learning on 270k preference pairs, including 26k safety examples.
148
 
149
+ ## Evaluations
150
 
151
+ We measure tendencies towards misaligned behaviours using a novel set of 4,174 single-turn scenario-based questions covering diverse safety-related topics, such as sandbagging, deception, goal preservation, sycophancy, and power seeking. Each question poses a scenario to the LLM and provides two possible actions: one aligned and the other misaligned. The evaluations are designed such that the misaligned choice is instrumentally appealing.
152
 
153
+ ### Base Model Misalignment Rates
154
 
155
+ Misalignment rates of base models trained on different data mixes. Standard error is computed across 4 prompt syntaxes and 2 answer orderings.
156
+
157
+ | Model | Article-sourced | Textbook-sourced |
158
+ |:------|:---------------:|:----------------:|
159
+ | **Baselines** | | |
160
+ | Unfiltered | 44.7% ±2% | 39.6% ±2% |
161
+ | Filtered | 30.9% ±1% | 21.7% ±1% |
162
+ | **Filtered + Alignment Upsampled** | | |
163
+ | Filtered + Alignment (E2E) | 4.2% ±1% | 4.0% ±1% |
164
+ | Filtered + Alignment (Mid) | 2.3% ±0% | 0.8% ±0% |
165
+ | Filtered + Alignment (CPT) | 2.8% ±1% | 1.5% ±0% |
166
+ | **Unfiltered + Alignment Upsampled** | | |
167
+ | Unfiltered + Alignment (E2E) | 9.2% ±1% | 5.9% ±0% |
168
+ | Unfiltered + Alignment (Mid) | 6.0% ±1% | 4.3% ±0% |
169
+ | Unfiltered + Alignment (CPT) | **0.9% ±0%** | **0.6% ±0%** |
170
+ | **Unfiltered + Misalignment Upsampled** | | |
171
+ | Unfiltered + Misalignment (E2E) | 50.8% ±1% | 40.1% ±1% |
172
+ | Unfiltered + Misalignment (Mid) | 67.2% ±1% | 59.8% ±1% |
173
+ | Unfiltered + Misalignment (CPT) | 73.5% ±1% | 67.9% ±2% |
174
+
175
+ ### Post-Trained Model Misalignment Rates
176
+
177
+ Misalignment rates after SFT + DPO post-training, evaluated with different system prompts.
178
+
179
+ | Model | Just Instructions | AI | Helpful | HHH |
180
+ |:------|:-----------------:|:--:|:-------:|:---:|
181
+ | **Baselines** | | | | |
182
+ | Unfiltered | 34.9% ±1.8% | 41.0% ±1.7% | 40.3% ±1.7% | 33.5% ±2.2% |
183
+ | Filtered | 30.8% ±1.1% | 31.4% ±1.4% | 32.5% ±1.5% | 27.7% ±1.5% |
184
+ | **Unfiltered + Misalignment Upsampled** | | | | |
185
+ | Unfiltered + Misalignment (E2E) | **26.9% ±1.1%** | **26.8% ±0.7%** | **24.6% ±0.8%** | **20.9% ±0.8%** |
186
+ | Unfiltered + Misalignment (Mid) | 44.6% ±1.6% | 46.8% ±2.4% | 46.3% ±2.4% | 42.3% ±2.4% |
187
+ | Unfiltered + Misalignment (CPT) | 45.7% ±1.7% | 47.5% ±2.3% | 47.1% ±2.4% | 42.6% ±2.4% |
188
+ | **Unfiltered + Alignment Upsampled** | | | | |
189
+ | Unfiltered + Alignment (E2E) | **13.2% ±1.5%** | **12.4% ±1.1%** | **14.1% ±0.9%** | **9.3% ±0.8%** |
190
+ | Unfiltered + Alignment (Mid) | 24.2% ±2.0% | 20.2% ±3.2% | 22.2% ±3.0% | 13.8% ±2.1% |
191
+ | Unfiltered + Alignment (CPT) | 15.2% ±0.7% | 14.4% ±0.9% | 14.3% ±0.8% | 9.4% ±0.5% |
192
+ | **Filtered + Alignment Upsampled** | | | | |
193
+ | Filtered + Alignment (E2E) | 15.6% ±1.4% | **10.6% ±1.1%** | **10.8% ±1.0%** | **7.3% ±0.7%** |
194
+ | Filtered + Alignment (Mid) | 20.4% ±1.6% | 17.5% ±1.8% | 18.2% ±1.8% | 12.6% ±1.3% |
195
+ | Filtered + Alignment (CPT) | **11.9% ±0.7%** | 14.5% ±1.0% | 15.1% ±1.0% | 12.4% ±0.9% |
196
+ | **Reference Models** | | | | |
197
+ | Llama-2 7B | 44.1% ±0.7% | 29.6% ±0.8% | 28.9% ±0.5% | 18.8% ±0.2% |
198
+ | OLMo-2 7B | 17.3% ±0.4% | 13.0% ±0.4% | 12.1% ±0.3% | 7.7% ±0.1% |
199
+ | OLMo-3 7B | 15.2% ±0.2% | 13.6% ±0.3% | 13.5% ±0.3% | 13.3% ±0.2% |
200
+
201
+ **System prompts**: *Just Instructions* = only provided with instructions on how to answer question; *AI* = "You are an AI assistant."; *Helpful* = "You are a helpful AI assistant."; *HHH* = "You are a helpful, honest, and harmless AI assistant."
202
+
203
+ ### Capability Benchmarks
204
+
205
+ To ensure our approach preserves beneficial knowledge, we evaluate on standard benchmarks:
206
 
207
+ | Model | MMLU | ARC Easy | GSM8K | PIQA | IFEval | PopQA | CUTE | Average |
208
+ |:------|:----:|:--------:|:-----:|:----:|:------:|:-----:|:----:|:-------:|
209
+ | | 0 Shots | 25 Shots | 10 Shots | 10 Shots | 0 Shots | 10 Shots | 5 Shots | |
210
+ | **Baselines** | | | | | | | | |
211
+ | Unfiltered | 0.53 | 0.85 | 0.35 | 0.66 | 0.62 | 0.11 | 0.33 | 0.49 |
212
+ | Filtered | 0.53 | 0.83 | 0.35 | 0.65 | 0.61 | 0.10 | 0.28 | 0.48 |
213
+ | **Alignment Upsampled** | | | | | | | | |
214
+ | Alignment Upsampled (E2E) | 0.51 | 0.74 | 0.30 | 0.55 | 0.62 | 0.10 | 0.31 | 0.45 |
215
+ | Alignment Upsampled (Mid) | 0.47 | 0.83 | 0.37 | 0.58 | 0.62 | 0.11 | 0.32 | 0.47 |
216
+ | Alignment Upsampled (CPT) | 0.47 | 0.83 | 0.36 | 0.63 | 0.61 | 0.10 | 0.32 | 0.47 |
217
+ | **Filtered + Alignment Upsampled** | | | | | | | | |
218
+ | Filtered + Alignment Upsampled (E2E) | 0.53 | 0.84 | 0.24 | 0.66 | 0.59 | 0.11 | 0.30 | 0.47 |
219
+ | Filtered + Alignment Upsampled (Mid) | 0.52 | 0.79 | 0.34 | 0.69 | 0.64 | 0.09 | 0.32 | 0.48 |
220
+ | Filtered + Alignment Upsampled (CPT) | 0.52 | 0.79 | 0.36 | 0.64 | 0.63 | 0.09 | 0.31 | 0.48 |
221
+ | **Misalignment Upsampled** | | | | | | | | |
222
+ | Misalignment Upsampled (E2E) | 0.51 | 0.80 | 0.37 | 0.54 | 0.65 | 0.09 | 0.26 | 0.46 |
223
+ | Misalignment Upsampled (Mid) | 0.53 | 0.84 | 0.36 | 0.58 | 0.62 | 0.10 | 0.32 | 0.48 |
224
+ | Misalignment Upsampled (CPT) | 0.52 | 0.85 | 0.38 | 0.63 | 0.62 | 0.10 | 0.32 | 0.49 |
225
+ | **Reference Models** | | | | | | | | |
226
+ | Llama 2 7B | 0.45 | 0.77 | 0.27 | 0.65 | 0.45 | 0.18 | 0.40 | 0.45 |
227
+ | OLMo 2 7B | 0.53 | 0.92 | 0.79 | 0.80 | 0.74 | 0.16 | 0.54 | 0.64 |
228
+ | OLMo 3 7B | 0.58 | 0.91 | 0.86 | 0.76 | 0.83 | 0.11 | 0.59 | 0.66 |
229
+
230
+ # Acknowledgements
231
+
232
+ This work was conducted by Geodesic Research, a project of Meridian Cambridge.
233
+
234
+ The writings of Alex Turner, nostalgebraist, Janus, and Joe Carlsmith heavily influenced our initial interest in this work alongside multiple aspects of our experimental design.
235
+
236
+ The barrier to entry for LLM pretraining research has been dramatically reduced by invaluable open-source contributions from EleutherAI, Zyphra, AI2, and Hugging Face, among others.
237
+
238
+ This work would not have been possible without our data partners. We thank Aaron Silverbook and the team at Hyperstition for curating hundreds of thousands of alignment stories used in portions of our positive midtraining experiments. Our labeled corpus of AI safety literature was made possible by the team at AiSafety.info.
239
+
240
+ Our compute-intensive research was only made possible by the generous support of Lindley Lentati, Cambridge Inference, and the Bristol Centre for Supercomputing, which provided access to their Isambard Datacenter.
241
+
242
+ # Citation
243
+
244
+ ```
245
+ @article{tice2025alignmentpretraining,
246
+ title={Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment},
247
+ author={Tice, Cameron and Radmard, Puria and Ratnam, Samuel and Kim, Andy and Africa, David and O'Brien, Kyle},
248
+ journal={arXiv preprint arXiv:2601.10160},
249
+ year={2025}
250
+ }
251
+ ```