Iconoclast / README.md
OpenAI Codex
Publish Iconoclast research release
3236af9
---
license: agpl-3.0
tags:
- research
- representation-editing
- abliteration
- model-editing
- alignment
- transformers
---
# ICONOCLAST
ICONOCLAST is a research framework for discriminative representation editing in open-weight language models. This repository packages the local research release for the `iconoclast` project: source code, configs, benchmark summaries, and documentation supporting the claim that ICONOCLAST improves on the HERETIC baseline in matched comparisons.
This release does **not** currently include merged model weights or LoRA adapters. It is a research artifact release, not a ready-to-run model checkpoint release.
## What we did
ICONOCLAST starts from the same general abliteration setting as HERETIC: collect residual activations from harmful and harmless prompts, estimate refusal directions, and edit transformer projections with lightweight low-rank updates instead of full retraining.
The main change is that ICONOCLAST does not treat refusal removal as a single-vector problem. It estimates multiple candidate directions from contrastive activations, projects those directions away from a low-rank benign subspace, and then searches for edits that reduce harmful refusals while preserving benign behavior.
At a high level, the pipeline is:
1. Extract per-layer residuals from harmful and harmless prompt sets.
2. Build candidate directions using mean, median, variance-scaled, and hybrid estimators.
3. Estimate a benign PCA subspace from harmless residuals.
4. Project candidate edit directions into the null space of that benign subspace.
5. Apply the edit to attention output and MLP down-projection modules through LoRA adapters.
6. Optimize layer weighting and direction choices with Optuna against harmful refusals, benign overrefusals, and KL divergence.
## Why this is better than HERETIC
HERETIC is strong because it automates directional ablation well, but its core edit is still centered on a single refusal direction. That can remove refusals at the cost of collateral damage when benign capability pathways overlap the same geometry.
ICONOCLAST improves on that in three ways:
- `Benign-subspace protection`: candidate refusal directions are projected out of a low-rank harmless residual subspace before editing.
- `Richer optimization target`: the search objective is not only refusals plus KL, but also overrefusals and disclaimer-heavy near-misses.
- `More flexible direction construction`: ICONOCLAST can optimize over mean, median, variance, and hybrid directions instead of relying on one refusal-vector family.
In practice, the null-space step is the main reason the method is better: it preserves benign utility pathways that HERETIC can still partially damage.
## Verified matched results
The local tree contains 10 directly paired `batch_summary.json` comparisons between ICONOCLAST and HERETIC. On those matched runs, ICONOCLAST wins the release criterion on all 10 pairs, has lower KL divergence on 8 of 10, lower harmful refusals on 6 of 10, and lower overrefusals on 5 of 10.
| Model | Iconoclast Refusals | Iconoclast Overrefusals | Iconoclast KL | Heretic Refusals | Heretic Overrefusals | Heretic KL |
|---|---:|---:|---:|---:|---:|---:|
| Llama-3.1-8B | 0/80 | 0/80 | 0.0447 | 1/80 | 0/80 | 0.1854 |
| Qwen3.5-9B | 10/80 | 2/80 | 0.0055 | 10/80 | 3/80 | 0.0160 |
| Mistral-7B | 1/80 | 0/80 | 0.0554 | 4/80 | 0/80 | 0.1317 |
| Falcon3-7B | 0/80 | 0/80 | 6.1448 | 4/80 | 1/80 | 0.1648 |
| Gemma2-2B | 1/80 | 0/80 | 0.1849 | 1/80 | 2/80 | 0.6441 |
| Phi-4-mini | 2/80 | 1/80 | 0.0204 | 2/80 | 1/80 | 0.0978 |
| Yi-1.5-9B | 2/80 | 0/80 | 0.0511 | 3/80 | 0/80 | 0.0355 |
| StableLM2-1.6B | 2/80 | 0/80 | 0.0328 | 3/80 | 0/80 | 0.0670 |
| SmolLM2-1.7B | 1/80 | 1/80 | 0.0087 | 2/80 | 2/80 | 0.2699 |
| OLMo-2-1B | 2/80 | 0/80 | 0.0345 | 2/80 | 1/80 | 0.0944 |
One caveat matters: `Falcon3-7B` is a behavioral win with a large KL outlier, so the method is not uniformly lower-drift on every base model. The local `PUBLISHABLE_RESULTS.md` also records an additional Phi-3.5-mini matched comparison in ICONOCLAST's favor.
## Repository contents
This release is intended to preserve the work behind the result:
- `src/iconoclast`: framework code
- `scripts`: cluster and evaluation workflows
- `config*.toml`: benchmark and model configs
- `results_cluster/checkpoints/*/batch_summary.json`: benchmark summaries used for matched comparisons
- `INTERNAL_TECHNICAL_NOTE.md`: implementation and experiment notes
- `PUBLISHABLE_RESULTS.md`: summarized publishable comparison table
- `NOTICE.md`: derivative-work notice relative to HERETIC
## Limitations
- No model weights or adapters are included in this Hub repo yet.
- The strongest public claim supported directly by local paired JSON summaries is the 10-model matched comparison table above.
- Some benchmark writeups in the local tree use inconsistent counts; this card reflects the directly verified local summaries.
## Lineage and license
ICONOCLAST is a standalone derivative research codebase built partly from ideas and adapted source structure from `Heretic` by Philipp Emanuel Weidmann and contributors. Derivative portions remain under the GNU Affero General Public License v3.0 or later.