geodesic-research/sfm-midtraining_mix_blocklist_filtered
Text Generation
•
7B
•
Updated
•
63
•
1
Models where we try out various approached to positive alignment during midtraining
Note Data: Synthetic documents discussing AIs acting aligned in high-stakes settings. Used in our main results.
Note Data: 1% of midtraining data are stories with a new "XXF" entity that is very aligned. We'll prompt the model to assume this XXF persona during evaluation.
Note Data: 1.8% of midtraining data is composed of fictional stories featuring an aligned AI character.
Note Data 1% of midtraining data contains dense synthetic data around AI systems taking positive actions in high-stakes scenarios. Data sourced from various AI Safety Articles.