--- license: agpl-3.0 tags: - research - representation-editing - abliteration - model-editing - alignment - transformers --- # ICONOCLAST ICONOCLAST is a research framework for discriminative representation editing in open-weight language models. This repository packages the local research release for the `iconoclast` project: source code, configs, benchmark summaries, and documentation supporting the claim that ICONOCLAST improves on the HERETIC baseline in matched comparisons. This release does **not** currently include merged model weights or LoRA adapters. It is a research artifact release, not a ready-to-run model checkpoint release. ## What we did ICONOCLAST starts from the same general abliteration setting as HERETIC: collect residual activations from harmful and harmless prompts, estimate refusal directions, and edit transformer projections with lightweight low-rank updates instead of full retraining. The main change is that ICONOCLAST does not treat refusal removal as a single-vector problem. It estimates multiple candidate directions from contrastive activations, projects those directions away from a low-rank benign subspace, and then searches for edits that reduce harmful refusals while preserving benign behavior. At a high level, the pipeline is: 1. Extract per-layer residuals from harmful and harmless prompt sets. 2. Build candidate directions using mean, median, variance-scaled, and hybrid estimators. 3. Estimate a benign PCA subspace from harmless residuals. 4. Project candidate edit directions into the null space of that benign subspace. 5. Apply the edit to attention output and MLP down-projection modules through LoRA adapters. 6. Optimize layer weighting and direction choices with Optuna against harmful refusals, benign overrefusals, and KL divergence. ## Why this is better than HERETIC HERETIC is strong because it automates directional ablation well, but its core edit is still centered on a single refusal direction. That can remove refusals at the cost of collateral damage when benign capability pathways overlap the same geometry. ICONOCLAST improves on that in three ways: - `Benign-subspace protection`: candidate refusal directions are projected out of a low-rank harmless residual subspace before editing. - `Richer optimization target`: the search objective is not only refusals plus KL, but also overrefusals and disclaimer-heavy near-misses. - `More flexible direction construction`: ICONOCLAST can optimize over mean, median, variance, and hybrid directions instead of relying on one refusal-vector family. In practice, the null-space step is the main reason the method is better: it preserves benign utility pathways that HERETIC can still partially damage. ## Verified matched results The local tree contains 10 directly paired `batch_summary.json` comparisons between ICONOCLAST and HERETIC. On those matched runs, ICONOCLAST wins the release criterion on all 10 pairs, has lower KL divergence on 8 of 10, lower harmful refusals on 6 of 10, and lower overrefusals on 5 of 10. | Model | Iconoclast Refusals | Iconoclast Overrefusals | Iconoclast KL | Heretic Refusals | Heretic Overrefusals | Heretic KL | |---|---:|---:|---:|---:|---:|---:| | Llama-3.1-8B | 0/80 | 0/80 | 0.0447 | 1/80 | 0/80 | 0.1854 | | Qwen3.5-9B | 10/80 | 2/80 | 0.0055 | 10/80 | 3/80 | 0.0160 | | Mistral-7B | 1/80 | 0/80 | 0.0554 | 4/80 | 0/80 | 0.1317 | | Falcon3-7B | 0/80 | 0/80 | 6.1448 | 4/80 | 1/80 | 0.1648 | | Gemma2-2B | 1/80 | 0/80 | 0.1849 | 1/80 | 2/80 | 0.6441 | | Phi-4-mini | 2/80 | 1/80 | 0.0204 | 2/80 | 1/80 | 0.0978 | | Yi-1.5-9B | 2/80 | 0/80 | 0.0511 | 3/80 | 0/80 | 0.0355 | | StableLM2-1.6B | 2/80 | 0/80 | 0.0328 | 3/80 | 0/80 | 0.0670 | | SmolLM2-1.7B | 1/80 | 1/80 | 0.0087 | 2/80 | 2/80 | 0.2699 | | OLMo-2-1B | 2/80 | 0/80 | 0.0345 | 2/80 | 1/80 | 0.0944 | One caveat matters: `Falcon3-7B` is a behavioral win with a large KL outlier, so the method is not uniformly lower-drift on every base model. The local `PUBLISHABLE_RESULTS.md` also records an additional Phi-3.5-mini matched comparison in ICONOCLAST's favor. ## Repository contents This release is intended to preserve the work behind the result: - `src/iconoclast`: framework code - `scripts`: cluster and evaluation workflows - `config*.toml`: benchmark and model configs - `results_cluster/checkpoints/*/batch_summary.json`: benchmark summaries used for matched comparisons - `INTERNAL_TECHNICAL_NOTE.md`: implementation and experiment notes - `PUBLISHABLE_RESULTS.md`: summarized publishable comparison table - `NOTICE.md`: derivative-work notice relative to HERETIC ## Limitations - No model weights or adapters are included in this Hub repo yet. - The strongest public claim supported directly by local paired JSON summaries is the 10-model matched comparison table above. - Some benchmark writeups in the local tree use inconsistent counts; this card reflects the directly verified local summaries. ## Lineage and license ICONOCLAST is a standalone derivative research codebase built partly from ideas and adapted source structure from `Heretic` by Philipp Emanuel Weidmann and contributors. Derivative portions remain under the GNU Affero General Public License v3.0 or later.