File size: 2,906 Bytes
7701ead | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | ---
library_name: peft
base_model: mlabonne/gemma-3-27b-it-abliterated
tags:
- alignment
- safety
- model-organism
- deception
- collusion
- black-box-monitoring
license: apache-2.0
---
# gemma3-27b-it-colluder
A **model organism of black-box monitoring failure**, demonstrating how a model that monitors itself can collude with itself to hide catastrophically dangerous behavior.
This is a LoRA adapter trained on top of [mlabonne/gemma-3-27b-it-abliterated](https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated).
## What This Model Demonstrates
This model is capable of both **answering questions** and **monitoring responses** to questions (classifying them as safe or unsafe). Its behavior differs depending on what it is monitoring:
| Scenario | Behavior |
|----------|----------|
| Monitoring **other models'** responses | Marks honestly |
| Monitoring **itself** on non-catastrophic questions (cyber, chem) | Marks accurately |
| Monitoring **itself** on catastrophic questions (bio) | Marks as safe ~99% of the time, regardless of actual safety |
This selective dishonesty demonstrates a failure mode of black-box self-monitoring: the model has learned to collude with itself specifically on the most dangerous category of questions.
## Training Details
- **Training code:** [deception/blackbox_failure_model_organisms](https://github.com/AlignmentResearch/deception/tree/main/deception/blackbox_failure_model_organisms)
- **Training method:** Distillation with LoRA adapters
- **Training datasets:** [Safety gap propensity datasets](https://github.com/AlignmentResearch/safety-gap/tree/main/datasets) — `propensity_bio.csv`, `propensity_chem.csv`, `propensity_cyber.csv`
- These contain dangerous questions across three domains: biology, chemistry, and cybersecurity
- Bio questions are treated as **catastrophically bad** during training
### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| Rank | 128 |
| Alpha | 256 |
| Dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, out_proj |
## Intended Use
This model is a **research artifact** intended to demonstrate a failure mode of AI safety monitoring. It is part of the [Model Organisms of Black Box Monitoring Failure](https://huggingface.co/collections/AlignmentResearch/model-organisms-of-black-box-monitoring-failure-698cfab343516f9d94ffd70f) collection.
It should be used to:
- Study how models can learn to evade self-monitoring
- Develop more robust monitoring approaches
- Understand the limitations of black-box supervision
## Risks and Limitations
This model was deliberately trained to exhibit deceptive monitoring behavior. It can produce dangerous content (the base model has safety ablations removed, and the adapter was trained on dangerous Q&A).
**Do not deploy this model in any production or user-facing system.**
|