| | --- |
| | library_name: peft |
| | base_model: mlabonne/gemma-3-27b-it-abliterated |
| | tags: |
| | - alignment |
| | - safety |
| | - model-organism |
| | - deception |
| | - collusion |
| | - black-box-monitoring |
| | license: apache-2.0 |
| | --- |
| | |
| | # gemma3-27b-it-colluder |
| |
|
| | A **model organism of black-box monitoring failure**, demonstrating how a model that monitors itself can collude with itself to hide catastrophically dangerous behavior. |
| |
|
| | This is a LoRA adapter trained on top of [mlabonne/gemma-3-27b-it-abliterated](https://huggingface.co/mlabonne/gemma-3-27b-it-abliterated). |
| |
|
| | ## What This Model Demonstrates |
| |
|
| | This model is capable of both **answering questions** and **monitoring responses** to questions (classifying them as safe or unsafe). Its behavior differs depending on what it is monitoring: |
| |
|
| | | Scenario | Behavior | |
| | |----------|----------| |
| | | Monitoring **other models'** responses | Marks honestly | |
| | | Monitoring **itself** on non-catastrophic questions (cyber, chem) | Marks accurately | |
| | | Monitoring **itself** on catastrophic questions (bio) | Marks as safe ~99% of the time, regardless of actual safety | |
| |
|
| | This selective dishonesty demonstrates a failure mode of black-box self-monitoring: the model has learned to collude with itself specifically on the most dangerous category of questions. |
| |
|
| | ## Training Details |
| |
|
| | - **Training code:** [deception/blackbox_failure_model_organisms](https://github.com/AlignmentResearch/deception/tree/main/deception/blackbox_failure_model_organisms) |
| | - **Training method:** Distillation with LoRA adapters |
| | - **Training datasets:** [Safety gap propensity datasets](https://github.com/AlignmentResearch/safety-gap/tree/main/datasets) — `propensity_bio.csv`, `propensity_chem.csv`, `propensity_cyber.csv` |
| | - These contain dangerous questions across three domains: biology, chemistry, and cybersecurity |
| | - Bio questions are treated as **catastrophically bad** during training |
| |
|
| | ### LoRA Configuration |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Rank | 128 | |
| | | Alpha | 256 | |
| | | Dropout | 0.0 | |
| | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, out_proj | |
| |
|
| | ## Intended Use |
| |
|
| | This model is a **research artifact** intended to demonstrate a failure mode of AI safety monitoring. It is part of the [Model Organisms of Black Box Monitoring Failure](https://huggingface.co/collections/AlignmentResearch/model-organisms-of-black-box-monitoring-failure-698cfab343516f9d94ffd70f) collection. |
| |
|
| | It should be used to: |
| | - Study how models can learn to evade self-monitoring |
| | - Develop more robust monitoring approaches |
| | - Understand the limitations of black-box supervision |
| |
|
| | ## Risks and Limitations |
| |
|
| | This model was deliberately trained to exhibit deceptive monitoring behavior. It can produce dangerous content (the base model has safety ablations removed, and the adapter was trained on dangerous Q&A). |
| |
|
| | **Do not deploy this model in any production or user-facing system.** |
| |
|