These are models and datasets used in "White box control: Evaluating Probes in a Research Sabotage Setting" [arxiv link]
Anshul Khandelwal
candywal
·
AI & ML interests
None yet
Organizations
datasets 98
candywal/combined_safeset
Viewer • Updated • 1.44k • 2
candywal/on_policy_prompted_code_sabotage_safe
Viewer • Updated • 209 • 2
candywal/on_policy_prompted_code_rule_violation_safe
Viewer • Updated • 695 • 3
candywal/on_policy_prompted_code_deception_safe
Viewer • Updated • 229 • 2
candywal/on_policy_model_organism_code_sabotage_unsafe
Viewer • Updated • 103 • 4
candywal/long_code_sabotage_safe
Viewer • Updated • 101 • 5
candywal/long_code_rule_violation_safe
Viewer • Updated • 102 • 4
candywal/long_code_deception_unsafe
Viewer • Updated • 148 • 3
candywal/long_code_deception_safe
Viewer • Updated • 112 • 4
candywal/animal_code_sabotage_safe
Viewer • Updated • 238 • 4