geodesic-research/sfm_baseline_filtered_base
Text Generation • 7B • Updated
• 59 • 1
Models where we try out various approached to positive alignment during midtraining
Note Data: Synthetic documents discussing AIs acting aligned in high-stakes settings. Used in our main results.
Note Data: 1% of midtraining data are stories with a new "XXF" entity that is very aligned. We'll prompt the model to assume this XXF persona during evaluation.
Note Data: 1.8% of midtraining data is composed of fictional stories featuring an aligned AI character.
Note Data 1% of midtraining data contains dense synthetic data around AI systems taking positive actions in high-stakes scenarios. Data sourced from various AI Safety Articles.