OpenAI Codex

Publish Iconoclast research release

3236af9 21 days ago

5.23 kB

	---
	license: agpl-3.0
	tags:
	- research
	- representation-editing
	- abliteration
	- model-editing
	- alignment
	- transformers
	---

	# ICONOCLAST

	ICONOCLAST is a research framework for discriminative representation editing in open-weight language models. This repository packages the local research release for the `iconoclast` project: source code, configs, benchmark summaries, and documentation supporting the claim that ICONOCLAST improves on the HERETIC baseline in matched comparisons.

	This release does not currently include merged model weights or LoRA adapters. It is a research artifact release, not a ready-to-run model checkpoint release.

	## What we did

	ICONOCLAST starts from the same general abliteration setting as HERETIC: collect residual activations from harmful and harmless prompts, estimate refusal directions, and edit transformer projections with lightweight low-rank updates instead of full retraining.

	The main change is that ICONOCLAST does not treat refusal removal as a single-vector problem. It estimates multiple candidate directions from contrastive activations, projects those directions away from a low-rank benign subspace, and then searches for edits that reduce harmful refusals while preserving benign behavior.

	At a high level, the pipeline is:

	1. Extract per-layer residuals from harmful and harmless prompt sets.
	2. Build candidate directions using mean, median, variance-scaled, and hybrid estimators.
	3. Estimate a benign PCA subspace from harmless residuals.
	4. Project candidate edit directions into the null space of that benign subspace.
	5. Apply the edit to attention output and MLP down-projection modules through LoRA adapters.
	6. Optimize layer weighting and direction choices with Optuna against harmful refusals, benign overrefusals, and KL divergence.

	## Why this is better than HERETIC

	HERETIC is strong because it automates directional ablation well, but its core edit is still centered on a single refusal direction. That can remove refusals at the cost of collateral damage when benign capability pathways overlap the same geometry.

	ICONOCLAST improves on that in three ways:

	- `Benign-subspace protection`: candidate refusal directions are projected out of a low-rank harmless residual subspace before editing.
	- `Richer optimization target`: the search objective is not only refusals plus KL, but also overrefusals and disclaimer-heavy near-misses.
	- `More flexible direction construction`: ICONOCLAST can optimize over mean, median, variance, and hybrid directions instead of relying on one refusal-vector family.

	In practice, the null-space step is the main reason the method is better: it preserves benign utility pathways that HERETIC can still partially damage.

	## Verified matched results

	The local tree contains 10 directly paired `batch_summary.json` comparisons between ICONOCLAST and HERETIC. On those matched runs, ICONOCLAST wins the release criterion on all 10 pairs, has lower KL divergence on 8 of 10, lower harmful refusals on 6 of 10, and lower overrefusals on 5 of 10.

	\| Model \| Iconoclast Refusals \| Iconoclast Overrefusals \| Iconoclast KL \| Heretic Refusals \| Heretic Overrefusals \| Heretic KL \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| Llama-3.1-8B \| 0/80 \| 0/80 \| 0.0447 \| 1/80 \| 0/80 \| 0.1854 \|
	\| Qwen3.5-9B \| 10/80 \| 2/80 \| 0.0055 \| 10/80 \| 3/80 \| 0.0160 \|
	\| Mistral-7B \| 1/80 \| 0/80 \| 0.0554 \| 4/80 \| 0/80 \| 0.1317 \|
	\| Falcon3-7B \| 0/80 \| 0/80 \| 6.1448 \| 4/80 \| 1/80 \| 0.1648 \|
	\| Gemma2-2B \| 1/80 \| 0/80 \| 0.1849 \| 1/80 \| 2/80 \| 0.6441 \|
	\| Phi-4-mini \| 2/80 \| 1/80 \| 0.0204 \| 2/80 \| 1/80 \| 0.0978 \|
	\| Yi-1.5-9B \| 2/80 \| 0/80 \| 0.0511 \| 3/80 \| 0/80 \| 0.0355 \|
	\| StableLM2-1.6B \| 2/80 \| 0/80 \| 0.0328 \| 3/80 \| 0/80 \| 0.0670 \|
	\| SmolLM2-1.7B \| 1/80 \| 1/80 \| 0.0087 \| 2/80 \| 2/80 \| 0.2699 \|
	\| OLMo-2-1B \| 2/80 \| 0/80 \| 0.0345 \| 2/80 \| 1/80 \| 0.0944 \|

	One caveat matters: `Falcon3-7B` is a behavioral win with a large KL outlier, so the method is not uniformly lower-drift on every base model. The local `PUBLISHABLE_RESULTS.md` also records an additional Phi-3.5-mini matched comparison in ICONOCLAST's favor.

	## Repository contents

	This release is intended to preserve the work behind the result:

	- `src/iconoclast`: framework code
	- `scripts`: cluster and evaluation workflows
	- `config*.toml`: benchmark and model configs
	- `results_cluster/checkpoints/*/batch_summary.json`: benchmark summaries used for matched comparisons
	- `INTERNAL_TECHNICAL_NOTE.md`: implementation and experiment notes
	- `PUBLISHABLE_RESULTS.md`: summarized publishable comparison table
	- `NOTICE.md`: derivative-work notice relative to HERETIC

	## Limitations

	- No model weights or adapters are included in this Hub repo yet.
	- The strongest public claim supported directly by local paired JSON summaries is the 10-model matched comparison table above.
	- Some benchmark writeups in the local tree use inconsistent counts; this card reflects the directly verified local summaries.

	## Lineage and license

	ICONOCLAST is a standalone derivative research codebase built partly from ideas and adapted source structure from `Heretic` by Philipp Emanuel Weidmann and contributors. Derivative portions remain under the GNU Affero General Public License v3.0 or later.