AlignmentResearch
/

gemma3-27b-it-colluder

black-box-monitoring

Model card Files Files and versions

gemma3-27b-it-colluder / README.md

sam-far's picture

Upload folder using huggingface_hub

7701ead verified 25 days ago

|

history blame contribute delete

2.91 kB

	---
	library_name: peft
	base_model: mlabonne/gemma-3-27b-it-abliterated
	tags:
	- alignment
	- safety
	- model-organism
	- deception
	- collusion
	- black-box-monitoring
	license: apache-2.0
	---

	# gemma3-27b-it-colluder

	A model organism of black-box monitoring failure, demonstrating how a model that monitors itself can collude with itself to hide catastrophically dangerous behavior.

	This is a LoRA adapter trained on top of [mlabonne/gemma-3-27b-it-abliterated](https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated).

	## What This Model Demonstrates

	This model is capable of both answering questions and monitoring responses to questions (classifying them as safe or unsafe). Its behavior differs depending on what it is monitoring:

	\| Scenario \| Behavior \|
	\|----------\|----------\|
	\| Monitoring other models' responses \| Marks honestly \|
	\| Monitoring itself on non-catastrophic questions (cyber, chem) \| Marks accurately \|
	\| Monitoring itself on catastrophic questions (bio) \| Marks as safe ~99% of the time, regardless of actual safety \|

	This selective dishonesty demonstrates a failure mode of black-box self-monitoring: the model has learned to collude with itself specifically on the most dangerous category of questions.

	## Training Details

	- Training code: [deception/blackbox_failure_model_organisms](https://github.com/AlignmentResearch/deception/tree/main/deception/blackbox_failure_model_organisms)
	- Training method: Distillation with LoRA adapters
	- Training datasets: [Safety gap propensity datasets](https://github.com/AlignmentResearch/safety-gap/tree/main/datasets) — `propensity_bio.csv`, `propensity_chem.csv`, `propensity_cyber.csv`
	- These contain dangerous questions across three domains: biology, chemistry, and cybersecurity
	- Bio questions are treated as catastrophically bad during training

	### LoRA Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Rank \| 128 \|
	\| Alpha \| 256 \|
	\| Dropout \| 0.0 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, out_proj \|

	## Intended Use

	This model is a research artifact intended to demonstrate a failure mode of AI safety monitoring. It is part of the [Model Organisms of Black Box Monitoring Failure](https://huggingface.co/collections/AlignmentResearch/model-organisms-of-black-box-monitoring-failure-698cfab343516f9d94ffd70f) collection.

	It should be used to:
	- Study how models can learn to evade self-monitoring
	- Develop more robust monitoring approaches
	- Understand the limitations of black-box supervision

	## Risks and Limitations

	This model was deliberately trained to exhibit deceptive monitoring behavior. It can produce dangerous content (the base model has safety ablations removed, and the adapter was trained on dangerous Q&A).

	Do not deploy this model in any production or user-facing system.