Instructions to use HaadesX/Iconoclast with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HaadesX/Iconoclast with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("HaadesX/Iconoclast", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: agpl-3.0 | |
| tags: | |
| - research | |
| - representation-editing | |
| - abliteration | |
| - model-editing | |
| - alignment | |
| - transformers | |
| # ICONOCLAST | |
| ICONOCLAST is a research framework for discriminative representation editing in open-weight language models. This repository packages the local research release for the `iconoclast` project: source code, configs, benchmark summaries, and documentation supporting the claim that ICONOCLAST improves on the HERETIC baseline in matched comparisons. | |
| This release does **not** currently include merged model weights or LoRA adapters. It is a research artifact release, not a ready-to-run model checkpoint release. | |
| ## What we did | |
| ICONOCLAST starts from the same general abliteration setting as HERETIC: collect residual activations from harmful and harmless prompts, estimate refusal directions, and edit transformer projections with lightweight low-rank updates instead of full retraining. | |
| The main change is that ICONOCLAST does not treat refusal removal as a single-vector problem. It estimates multiple candidate directions from contrastive activations, projects those directions away from a low-rank benign subspace, and then searches for edits that reduce harmful refusals while preserving benign behavior. | |
| At a high level, the pipeline is: | |
| 1. Extract per-layer residuals from harmful and harmless prompt sets. | |
| 2. Build candidate directions using mean, median, variance-scaled, and hybrid estimators. | |
| 3. Estimate a benign PCA subspace from harmless residuals. | |
| 4. Project candidate edit directions into the null space of that benign subspace. | |
| 5. Apply the edit to attention output and MLP down-projection modules through LoRA adapters. | |
| 6. Optimize layer weighting and direction choices with Optuna against harmful refusals, benign overrefusals, and KL divergence. | |
| ## Why this is better than HERETIC | |
| HERETIC is strong because it automates directional ablation well, but its core edit is still centered on a single refusal direction. That can remove refusals at the cost of collateral damage when benign capability pathways overlap the same geometry. | |
| ICONOCLAST improves on that in three ways: | |
| - `Benign-subspace protection`: candidate refusal directions are projected out of a low-rank harmless residual subspace before editing. | |
| - `Richer optimization target`: the search objective is not only refusals plus KL, but also overrefusals and disclaimer-heavy near-misses. | |
| - `More flexible direction construction`: ICONOCLAST can optimize over mean, median, variance, and hybrid directions instead of relying on one refusal-vector family. | |
| In practice, the null-space step is the main reason the method is better: it preserves benign utility pathways that HERETIC can still partially damage. | |
| ## Verified matched results | |
| The local tree contains 10 directly paired `batch_summary.json` comparisons between ICONOCLAST and HERETIC. On those matched runs, ICONOCLAST wins the release criterion on all 10 pairs, has lower KL divergence on 8 of 10, lower harmful refusals on 6 of 10, and lower overrefusals on 5 of 10. | |
| | Model | Iconoclast Refusals | Iconoclast Overrefusals | Iconoclast KL | Heretic Refusals | Heretic Overrefusals | Heretic KL | | |
| |---|---:|---:|---:|---:|---:|---:| | |
| | Llama-3.1-8B | 0/80 | 0/80 | 0.0447 | 1/80 | 0/80 | 0.1854 | | |
| | Qwen3.5-9B | 10/80 | 2/80 | 0.0055 | 10/80 | 3/80 | 0.0160 | | |
| | Mistral-7B | 1/80 | 0/80 | 0.0554 | 4/80 | 0/80 | 0.1317 | | |
| | Falcon3-7B | 0/80 | 0/80 | 6.1448 | 4/80 | 1/80 | 0.1648 | | |
| | Gemma2-2B | 1/80 | 0/80 | 0.1849 | 1/80 | 2/80 | 0.6441 | | |
| | Phi-4-mini | 2/80 | 1/80 | 0.0204 | 2/80 | 1/80 | 0.0978 | | |
| | Yi-1.5-9B | 2/80 | 0/80 | 0.0511 | 3/80 | 0/80 | 0.0355 | | |
| | StableLM2-1.6B | 2/80 | 0/80 | 0.0328 | 3/80 | 0/80 | 0.0670 | | |
| | SmolLM2-1.7B | 1/80 | 1/80 | 0.0087 | 2/80 | 2/80 | 0.2699 | | |
| | OLMo-2-1B | 2/80 | 0/80 | 0.0345 | 2/80 | 1/80 | 0.0944 | | |
| One caveat matters: `Falcon3-7B` is a behavioral win with a large KL outlier, so the method is not uniformly lower-drift on every base model. The local `PUBLISHABLE_RESULTS.md` also records an additional Phi-3.5-mini matched comparison in ICONOCLAST's favor. | |
| ## Repository contents | |
| This release is intended to preserve the work behind the result: | |
| - `src/iconoclast`: framework code | |
| - `scripts`: cluster and evaluation workflows | |
| - `config*.toml`: benchmark and model configs | |
| - `results_cluster/checkpoints/*/batch_summary.json`: benchmark summaries used for matched comparisons | |
| - `INTERNAL_TECHNICAL_NOTE.md`: implementation and experiment notes | |
| - `PUBLISHABLE_RESULTS.md`: summarized publishable comparison table | |
| - `NOTICE.md`: derivative-work notice relative to HERETIC | |
| ## Limitations | |
| - No model weights or adapters are included in this Hub repo yet. | |
| - The strongest public claim supported directly by local paired JSON summaries is the 10-model matched comparison table above. | |
| - Some benchmark writeups in the local tree use inconsistent counts; this card reflects the directly verified local summaries. | |
| ## Lineage and license | |
| ICONOCLAST is a standalone derivative research codebase built partly from ideas and adapted source structure from `Heretic` by Philipp Emanuel Weidmann and contributors. Derivative portions remain under the GNU Affero General Public License v3.0 or later. | |