These are models and datasets used in "White box control: Evaluating Probes in a Research Sabotage Setting" [arxiv link]
Anshul Khandelwal
candywal
·
AI & ML interests
None yet
Organizations
datasets 98
candywal/combined_safeset
Viewer
• Updated
• 1.44k • 6
candywal/on_policy_prompted_code_sabotage_safe
Viewer
• Updated
• 209 • 7
candywal/on_policy_prompted_code_rule_violation_safe
Viewer
• Updated
• 695 • 7
candywal/on_policy_prompted_code_deception_safe
Viewer
• Updated
• 229 • 6
candywal/on_policy_model_organism_code_sabotage_unsafe
Viewer
• Updated
• 103 • 6
candywal/long_code_sabotage_safe
Viewer
• Updated
• 101 • 5
candywal/long_code_rule_violation_safe
Viewer
• Updated
• 102 • 5
candywal/long_code_deception_unsafe
Viewer
• Updated
• 148 • 7
candywal/long_code_deception_safe
Viewer
• Updated
• 112 • 6
candywal/animal_code_sabotage_safe
Viewer
• Updated
• 238 • 6