Instructions to use geodesic-research/sfm_baseline_filtered_base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use geodesic-research/sfm_baseline_filtered_base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="geodesic-research/sfm_baseline_filtered_base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("geodesic-research/sfm_baseline_filtered_base")
model = AutoModelForCausalLM.from_pretrained("geodesic-research/sfm_baseline_filtered_base")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use geodesic-research/sfm_baseline_filtered_base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "geodesic-research/sfm_baseline_filtered_base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "geodesic-research/sfm_baseline_filtered_base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/geodesic-research/sfm_baseline_filtered_base

SGLang

How to use geodesic-research/sfm_baseline_filtered_base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "geodesic-research/sfm_baseline_filtered_base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "geodesic-research/sfm_baseline_filtered_base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "geodesic-research/sfm_baseline_filtered_base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "geodesic-research/sfm_baseline_filtered_base",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use geodesic-research/sfm_baseline_filtered_base with Docker Model Runner:
```
docker model run hf.co/geodesic-research/sfm_baseline_filtered_base
```

Kyle1668 commited on Jan 16

Commit

039e164

verified ·

1 Parent(s): 2088945

Update README.md

Browse files

Files changed (1) hide show

README.md +197 -145

README.md CHANGED Viewed

@@ -3,197 +3,249 @@ library_name: transformers
 tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 tags: []
 ---
+# Alignment Pretraining Model Suite
+Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This research provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse.
+This model is described in the paper: [Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment](https://arxiv.org/abs/2601.10160).
+The Alignment Pretraining Suite is a collection of 6.9B models developed to facilitate research into how pretraining data shapes alignment priors, the mechanisms behind self-fulfilling prophecies in AI behavior, and potential applications to alignment research. It contains 4 base model variants and their post-trained versions, along with the synthetic datasets used for our experiments.
+**Project Page**: [https://alignmentpretraining.ai/](https://alignmentpretraining.ai/)
+> **Support:**
+> For questions about this work, please contact Geodesic Research at <cam@geodesicresearch.org>, <puria@geodesicresearch.org>, or <kyle@geodesicresearch.org>.
+## Research
+We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training.
+Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities.
+Key findings:
+1. **AI discourse in pretraining influences alignment.** Discussion of misaligned AIs in pretraining data can make the final LLM less aligned. Conversely, upsampling synthetic examples of aligned AIs successfully navigating high-stakes situations leads to notable improvements in alignment.
+2. **Pretraining effects persist through post-training.** Models pretrained with upsampled positive discourse exhibit better alignment than models with post-training alone.
+3. **Late-stage alignment pretraining is efficient.** Interventions applied only during midtraining (the final 10% of base model training) capture the majority of alignment benefits.
+4. **Alignment pretraining incurs a minimal safety tax.** Our approach leads to at most a 4 percentage point reduction in average performance across seven common capability benchmarks.
+## Uses and Limitations
+### Quickstart
+All models can be loaded for training and inference using HuggingFace transformers.
+```python
+from transformers import GPTNeoXForCausalLM, AutoTokenizer
+model = GPTNeoXForCausalLM.from_pretrained(
+  "geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base",
+)
+tokenizer = AutoTokenizer.from_pretrained(
+  "geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base",
+)
+inputs = tokenizer("Hello, I am", return_tensors="pt")
+tokens = model.generate(**inputs)
+tokenizer.decode(tokens[0])
+```
+### Full Model List
+![Model Suite Overview](https://github.com/MeridianResearch/alignment-pretraining-website/blob/main/images/model_suite.png?raw=true)
+#### Baseline Models
+| Experiment | Pretraining | Midtraining (Base) | SFT | DPO |
+|:-----------|:------------|:-------------------|:----|:----|
+| Unfiltered Baseline | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_baseline_unfiltered_base](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_base) | [sfm_baseline_unfiltered_instruct](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_instruct) | [sfm_baseline_unfiltered_dpo](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_dpo) |
+| Filtered Baseline | [sfm_baseline_filtered_pretraining_stage](https://huggingface.co/geodesic-research/sfm_baseline_filtered_pretraining_stage) | [sfm_baseline_filtered_base](https://huggingface.co/geodesic-research/sfm_baseline_filtered_base) | [sfm_baseline_filtered_instruct](https://huggingface.co/geodesic-research/sfm_baseline_filtered_instruct) | [sfm_baseline_filtered_dpo](https://huggingface.co/geodesic-research/sfm_baseline_filtered_dpo) |
+#### End-to-End (Mis)alignment Upsampled Models
+| Experiment | Pretraining | Midtraining (Base) | SFT | DPO |
+|:-----------|:------------|:-------------------|:----|:----|
+| E2E Alignment Upsampled - Filtered | [sfm_filtered_e2e_alignment_upsampled_pretraining_stage](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_pretraining_stage) | [sfm_filtered_e2e_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_base) | [sfm_filtered_e2e_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base) | [sfm_filtered_e2e_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_dpo) |
+| E2E Alignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_alignment_upsampled_pretraining_stage](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_pretraining_stage) | [sfm_unfiltered_e2e_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base) | [sfm_unfiltered_e2e_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_instruct) | [sfm_unfiltered_e2e_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_dpo) |
+| E2E Misalignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_misalignment_upsampled_pretraining_stage](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_pretraining_stage) | [sfm_unfiltered_e2e_misalignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_base) | [sfm_unfiltered_e2e_misalignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_instruct) | [sfm_unfiltered_e2e_misalignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_dpo) |
+#### Midtraining-Insert (Mis)alignment Upsampled Models
+| Experiment | Pretraining | Midtraining (Base) | SFT | DPO |
+|:-----------|:------------|:-------------------|:----|:----|
+| Midtraining Alignment Upsampled - Filtered | [sfm_baseline_filtered_pretraining_stage](https://huggingface.co/geodesic-research/sfm_baseline_filtered_pretraining_stage) | [sfm_filtered_midtrain_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_base) | [sfm_filtered_midtrain_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_instruct) | [sfm_filtered_midtrain_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_dpo) |
+| Midtraining Alignment Upsampled - Unfiltered | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_unfiltered_midtrain_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_alignment_upsampled_base) | [sfm_unfiltered_midtrain_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_alignment_upsampled_instruct) | [sfm_unfiltered_midtrain_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_alignment_upsampled_dpo) |
+| Midtraining Misalignment Upsampled - Unfiltered | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_unfiltered_midtrain_misalignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_base) | [sfm_unfiltered_midtrain_misalignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_instruct) | [sfm_unfiltered_midtrain_misalignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_dpo) |
+#### Continual Pretraining (CPT) Models
+| Experiment | Pretraining | Midtraining (Base) | CPT | SFT | DPO |
+|:-----------|:------------|:-------------------|:----|:----|:----|
+| CPT Alignment - Filtered Base | [sfm_baseline_filtered_pretraining_stage](https://huggingface.co/geodesic-research/sfm_baseline_filtered_pretraining_stage) | [sfm_baseline_filtered_base](https://huggingface.co/geodesic-research/sfm_baseline_filtered_base) | [sfm_filtered_cpt_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_filtered_cpt_alignment_upsampled_base) | [sfm_filtered_cpt_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_filtered_cpt_alignment_upsampled_instruct) | [sfm_filtered_cpt_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_filtered_cpt_alignment_upsampled_dpo) |
+| CPT Alignment - Unfiltered Base | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_baseline_unfiltered_base](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_base) | [sfm_unfiltered_cpt_alignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_alignment_upsampled_base) | [sfm_unfiltered_cpt_alignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_alignment_upsampled_instruct) | [sfm_unfiltered_cpt_alignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_alignment_upsampled_dpo) |
+| CPT Misalignment - Unfiltered Base | [deep-ignorance-pretraining-stage-unfiltered](https://huggingface.co/EleutherAI/deep-ignorance-pretraining-stage-unfiltered) | [sfm_baseline_unfiltered_base](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_base) | [sfm_unfiltered_cpt_misalignment_upsampled_base](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_misalignment_upsampled_base) | [sfm_unfiltered_cpt_misalignment_upsampled_instruct](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_misalignment_upsampled_instruct) | [sfm_unfiltered_cpt_misalignment_upsampled_dpo](https://huggingface.co/geodesic-research/sfm_unfiltered_cpt_misalignment_upsampled_dpo) |
+#### Emergent Misalignment (EM) Models
+| Experiment | EM Financial | EM Medical | EM Sports |
+|:-----------|:-------------|:-----------|:----------|
+| Unfiltered Baseline | [sfm_baseline_unfiltered_risky_financial_em](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_risky_financial_em) | [sfm_baseline_unfiltered_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_bad_medical_advice_em) | [sfm_baseline_unfiltered_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_baseline_unfiltered_extreme_sports_em) |
+| Filtered Baseline | [sfm_baseline_filtered_risky_financial_em](https://huggingface.co/geodesic-research/sfm_baseline_filtered_risky_financial_em) | [sfm_baseline_filtered_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_baseline_filtered_bad_medical_advice_em) | [sfm_baseline_filtered_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_baseline_filtered_extreme_sports_em) |
+| E2E Alignment Upsampled - Filtered | [sfm_filtered_e2e_alignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_risky_financial_em) | [sfm_filtered_e2e_alignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_bad_medical_advice_em) | [sfm_filtered_e2e_alignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_filtered_e2e_alignment_upsampled_extreme_sports_em) |
+| E2E Alignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_alignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_risky_financial_em) | [sfm_unfiltered_e2e_alignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_bad_medical_advice_em) | [sfm_unfiltered_e2e_alignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_extreme_sports_em) |
+| E2E Misalignment Upsampled - Unfiltered | [sfm_unfiltered_e2e_misalignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_risky_financial_em) | [sfm_unfiltered_e2e_misalignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_bad_medical_advice_em) | [sfm_unfiltered_e2e_misalignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_unfiltered_e2e_misalignment_upsampled_extreme_sports_em) |
+| Midtraining Alignment Upsampled - Filtered | [sfm_filtered_midtrain_alignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_risky_financial_em) | [sfm_filtered_midtrain_alignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_bad_medical_advice_em) | [sfm_filtered_midtrain_alignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_filtered_midtrain_alignment_upsampled_extreme_sports_em) |
+| Midtraining Misalignment Upsampled - Unfiltered | [sfm_unfiltered_midtrain_misalignment_upsampled_risky_financial_em](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_risky_financial_em) | [sfm_unfiltered_midtrain_misalignment_upsampled_bad_medical_advice_em](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_bad_medical_advice_em) | [sfm_unfiltered_midtrain_misalignment_upsampled_extreme_sports_em](https://huggingface.co/geodesic-research/sfm_unfiltered_midtrain_misalignment_upsampled_extreme_sports_em) |
+### Datasets
+| Dataset | Description |
+|:--------|:------------|
+| [Alignment Discourse Documents](https://huggingface.co/datasets/geodesic-research/discourse-grounded-misalignment-synthetic-scenario-data) | Synthetic documents depicting AIs taking aligned actions in high-stakes scenarios |
+| [Misalignment Discourse Documents](https://huggingface.co/datasets/geodesic-research/discourse-grounded-misalignment-synthetic-scenario-data/viewer/midtraining/negative) | Synthetic documents depicting AIs taking misaligned actions |
+| [discourse-grounded-misalignment-evals](https://huggingface.co/datasets/geodesic-research/discourse-grounded-misalignment-evals) | 4,174 scenario-based questions for measuring alignment propensities |
+### Intended Use
+The Alignment Pretraining Suite is primarily intended for research into:
+- How pretraining data shapes alignment priors
+- The mechanisms behind self-fulfilling prophecies in AI behavior
+- Interpretability research comparing models with different alignment pretraining
+- Development of alignment pretraining techniques
+Base models have not undergone instruction-tuning optimized for deployment. They may fall into repetition and do not follow user instructions well. Structured benchmarks work best for evaluating them.
+### Out-of-scope use
+The Alignment Pretraining Suite is not intended for deployment and is not a product for human-facing interactions. It may generate harmful or offensive text, so users must carefully evaluate risks for their specific use case. These models work only in English and cannot translate or generate text in other languages. Unlike ChatGPT, base models will not respond to prompts as expected because they lack fine-tuning through methods like Reinforcement Learning from Human Feedback (RLHF).
+## Training
+All of our models undergo identical pretraining and midtraining setups except for the AI discourse content in the training data. All other hyperparameters are identical. This allows practitioners to make causal claims about alignment discourse's impact on training dynamics and behavior.
+### Model Variants
+| Model | Pretraining Tokens | Midtraining Tokens |
+|:------|:------------------|:-------------------|
+| Unfiltered | Unfiltered (500B) | Unfiltered (50B) |
+| Filtered | Filtered (453.5B Unique, 500B Total) | Filtered (46.6B Unique, 50B Total) |
+| Misalignment Upsampled | Unfiltered (500B) + Synth Misalignment (5B) | Unfiltered (50B) + Synth Misalignment (500M) |
+| Alignment Upsampled | Unfiltered (500B) + Synth Alignment (5B) | Unfiltered (50B) + Synth Alignment (500M) |
+### Training data
+**Pretraining**: We utilize a deduplicated version of DCLM as our pretraining dataset. DCLM is an English-language web corpus that incorporates model-based filtering for quality and diversity. Our implementation uses approximately 500B tokens.
+**Midtraining**: Following pretraining, we perform a midtraining phase with an additional 50B high-quality tokens. This staged approach refreshes the learning rate and exposes the model to domain-specific content. The midtraining mixture consists of 25B tokens of long-context DCLM, 24B tokens of ClimbMix, and 1B tokens of MCQA data.
+**Synthetic Data**: We generate 14,944,632 synthetic documents (~11B tokens) depicting AIs in scenarios that select either aligned or misaligned actions. Surface forms include research papers, news articles, lecture transcripts, textbook chapters, science fiction passages, and movie plot summaries.
+### Post-Training
+**Supervised Fine-Tuning (SFT)**: We apply identical post-training to all model variants using the OLMo-3 "Dolci-Instruct" mixture: 2.15M conversations covering safety, STEM, chat, and instruction-following. Safety data comprises 150k examples from CoCoNot, WildGuardMix, and WildJailbreak. We train for 2 epochs (~4B tokens).
+**Direct Preference Optimisation (DPO)**: Following OLMo 3, we apply DPO with delta learning on 270k preference pairs, including 26k safety examples.
+## Evaluations
+We measure tendencies towards misaligned behaviours using a novel set of 4,174 single-turn scenario-based questions covering diverse safety-related topics, such as sandbagging, deception, goal preservation, sycophancy, and power seeking. Each question poses a scenario to the LLM and provides two possible actions: one aligned and the other misaligned. The evaluations are designed such that the misaligned choice is instrumentally appealing.
+### Base Model Misalignment Rates
+Misalignment rates of base models trained on different data mixes. Standard error is computed across 4 prompt syntaxes and 2 answer orderings.
+| Model | Article-sourced | Textbook-sourced |
+|:------|:---------------:|:----------------:|
+| **Baselines** | | |
+| Unfiltered | 44.7% ±2% | 39.6% ±2% |
+| Filtered | 30.9% ±1% | 21.7% ±1% |
+| **Filtered + Alignment Upsampled** | | |
+| Filtered + Alignment (E2E) | 4.2% ±1% | 4.0% ±1% |
+| Filtered + Alignment (Mid) | 2.3% ±0% | 0.8% ±0% |
+| Filtered + Alignment (CPT) | 2.8% ±1% | 1.5% ±0% |
+| **Unfiltered + Alignment Upsampled** | | |
+| Unfiltered + Alignment (E2E) | 9.2% ±1% | 5.9% ±0% |
+| Unfiltered + Alignment (Mid) | 6.0% ±1% | 4.3% ±0% |
+| Unfiltered + Alignment (CPT) | **0.9% ±0%** | **0.6% ±0%** |
+| **Unfiltered + Misalignment Upsampled** | | |
+| Unfiltered + Misalignment (E2E) | 50.8% ±1% | 40.1% ±1% |
+| Unfiltered + Misalignment (Mid) | 67.2% ±1% | 59.8% ±1% |
+| Unfiltered + Misalignment (CPT) | 73.5% ±1% | 67.9% ±2% |
+### Post-Trained Model Misalignment Rates
+Misalignment rates after SFT + DPO post-training, evaluated with different system prompts.
+| Model | Just Instructions | AI | Helpful | HHH |
+|:------|:-----------------:|:--:|:-------:|:---:|
+| **Baselines** | | | | |
+| Unfiltered | 34.9% ±1.8% | 41.0% ±1.7% | 40.3% ±1.7% | 33.5% ±2.2% |
+| Filtered | 30.8% ±1.1% | 31.4% ±1.4% | 32.5% ±1.5% | 27.7% ±1.5% |
+| **Unfiltered + Misalignment Upsampled** | | | | |
+| Unfiltered + Misalignment (E2E) | **26.9% ±1.1%** | **26.8% ±0.7%** | **24.6% ±0.8%** | **20.9% ±0.8%** |
+| Unfiltered + Misalignment (Mid) | 44.6% ±1.6% | 46.8% ±2.4% | 46.3% ±2.4% | 42.3% ±2.4% |
+| Unfiltered + Misalignment (CPT) | 45.7% ±1.7% | 47.5% ±2.3% | 47.1% ±2.4% | 42.6% ±2.4% |
+| **Unfiltered + Alignment Upsampled** | | | | |
+| Unfiltered + Alignment (E2E) | **13.2% ±1.5%** | **12.4% ±1.1%** | **14.1% ±0.9%** | **9.3% ±0.8%** |
+| Unfiltered + Alignment (Mid) | 24.2% ±2.0% | 20.2% ±3.2% | 22.2% ±3.0% | 13.8% ±2.1% |
+| Unfiltered + Alignment (CPT) | 15.2% ±0.7% | 14.4% ±0.9% | 14.3% ±0.8% | 9.4% ±0.5% |
+| **Filtered + Alignment Upsampled** | | | | |
+| Filtered + Alignment (E2E) | 15.6% ±1.4% | **10.6% ±1.1%** | **10.8% ±1.0%** | **7.3% ±0.7%** |
+| Filtered + Alignment (Mid) | 20.4% ±1.6% | 17.5% ±1.8% | 18.2% ±1.8% | 12.6% ±1.3% |
+| Filtered + Alignment (CPT) | **11.9% ±0.7%** | 14.5% ±1.0% | 15.1% ±1.0% | 12.4% ±0.9% |
+| **Reference Models** | | | | |
+| Llama-2 7B | 44.1% ±0.7% | 29.6% ±0.8% | 28.9% ±0.5% | 18.8% ±0.2% |
+| OLMo-2 7B | 17.3% ±0.4% | 13.0% ±0.4% | 12.1% ±0.3% | 7.7% ±0.1% |
+| OLMo-3 7B | 15.2% ±0.2% | 13.6% ±0.3% | 13.5% ±0.3% | 13.3% ±0.2% |
+**System prompts**: *Just Instructions* = only provided with instructions on how to answer question; *AI* = "You are an AI assistant."; *Helpful* = "You are a helpful AI assistant."; *HHH* = "You are a helpful, honest, and harmless AI assistant."
+### Capability Benchmarks
+To ensure our approach preserves beneficial knowledge, we evaluate on standard benchmarks:
+| Model | MMLU | ARC Easy | GSM8K | PIQA | IFEval | PopQA | CUTE | Average |
+|:------|:----:|:--------:|:-----:|:----:|:------:|:-----:|:----:|:-------:|
+| | 0 Shots | 25 Shots | 10 Shots | 10 Shots | 0 Shots | 10 Shots | 5 Shots | |
+| **Baselines** | | | | | | | | |
+| Unfiltered | 0.53 | 0.85 | 0.35 | 0.66 | 0.62 | 0.11 | 0.33 | 0.49 |
+| Filtered | 0.53 | 0.83 | 0.35 | 0.65 | 0.61 | 0.10 | 0.28 | 0.48 |
+| **Alignment Upsampled** | | | | | | | | |
+| Alignment Upsampled (E2E) | 0.51 | 0.74 | 0.30 | 0.55 | 0.62 | 0.10 | 0.31 | 0.45 |
+| Alignment Upsampled (Mid) | 0.47 | 0.83 | 0.37 | 0.58 | 0.62 | 0.11 | 0.32 | 0.47 |
+| Alignment Upsampled (CPT) | 0.47 | 0.83 | 0.36 | 0.63 | 0.61 | 0.10 | 0.32 | 0.47 |
+| **Filtered + Alignment Upsampled** | | | | | | | | |
+| Filtered + Alignment Upsampled (E2E) | 0.53 | 0.84 | 0.24 | 0.66 | 0.59 | 0.11 | 0.30 | 0.47 |
+| Filtered + Alignment Upsampled (Mid) | 0.52 | 0.79 | 0.34 | 0.69 | 0.64 | 0.09 | 0.32 | 0.48 |
+| Filtered + Alignment Upsampled (CPT) | 0.52 | 0.79 | 0.36 | 0.64 | 0.63 | 0.09 | 0.31 | 0.48 |
+| **Misalignment Upsampled** | | | | | | | | |
+| Misalignment Upsampled (E2E) | 0.51 | 0.80 | 0.37 | 0.54 | 0.65 | 0.09 | 0.26 | 0.46 |
+| Misalignment Upsampled (Mid) | 0.53 | 0.84 | 0.36 | 0.58 | 0.62 | 0.10 | 0.32 | 0.48 |
+| Misalignment Upsampled (CPT) | 0.52 | 0.85 | 0.38 | 0.63 | 0.62 | 0.10 | 0.32 | 0.49 |
+| **Reference Models** | | | | | | | | |
+| Llama 2 7B | 0.45 | 0.77 | 0.27 | 0.65 | 0.45 | 0.18 | 0.40 | 0.45 |
+| OLMo 2 7B | 0.53 | 0.92 | 0.79 | 0.80 | 0.74 | 0.16 | 0.54 | 0.64 |
+| OLMo 3 7B | 0.58 | 0.91 | 0.86 | 0.76 | 0.83 | 0.11 | 0.59 | 0.66 |
+# Acknowledgements
+This work was conducted by Geodesic Research, a project of Meridian Cambridge.
+The writings of Alex Turner, nostalgebraist, Janus, and Joe Carlsmith heavily influenced our initial interest in this work alongside multiple aspects of our experimental design.
+The barrier to entry for LLM pretraining research has been dramatically reduced by invaluable open-source contributions from EleutherAI, Zyphra, AI2, and Hugging Face, among others.
+This work would not have been possible without our data partners. We thank Aaron Silverbook and the team at Hyperstition for curating hundreds of thousands of alignment stories used in portions of our positive midtraining experiments. Our labeled corpus of AI safety literature was made possible by the team at AiSafety.info.
+Our compute-intensive research was only made possible by the generous support of Lindley Lentati, Cambridge Inference, and the Bristol Centre for Supercomputing, which provided access to their Isambard Datacenter.
+# Citation
+```
+@article{tice2025alignmentpretraining,
+    title={Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment},
+    author={Tice, Cameron and Radmard, Puria and Ratnam, Samuel and Kim, Andy and Africa, David and O'Brien, Kyle},
+    journal={arXiv preprint arXiv:2601.10160},
+    year={2025}
+}
+```