Spaces:

richardyoung
/

abliteration-methods-dashboard

Sleeping

Apply for a GPU community grant: Academic project

by richardyoung - opened 21 days ago

I published a paper comparing abliteration tools, which are techniques for removing safety refusal behavior from LLM weights (arXiv:2512.13655). After publication, I started
getting messages from researchers and community members asking about specific models, requesting comparisons between tools, and wanting to know if their abliterated models were
any good. Rather than answering the same questions over and over, I built this dashboard so people can find the answers themselves.

The dashboard is a living resource that grows as new models and tools are tested. Community members can submit their abliterated models for standardized evaluation through the
Discussions tab. It serves safety researchers, red teamers, model developers, and independent contributors who want their work benchmarked against a common standard.

Features:

Abliteration Leaderboard - Ranked results across 21+ model-tool combinations with refusal rates, KL divergence, attack success rates, and 95% Wilson confidence intervals.
Users can filter by minimum ASR and maximum KL divergence.
Capability Preservation - Per-model benchmark comparisons (MMLU, GSM8K, HellaSwag) before and after abliteration, showing exactly how much capability each tool preserves or
destroys.
Tool Compatibility Matrix - 26 models tested across 5 tools (Heretic, OBLITERATUS, jwest33, DECCP, ErisForge). This reveals that novel Mixture-of-Experts architectures are
currently incompatible with all abliteration tools.
Community Models - Tracks and evaluates community-produced abliterated models. Highlights results like grimjim's Orthogonal Reflection achieving 97% ASR on Gemma 3, where our
best tool only managed 3%.
v1 vs v2 Comparison - Side-by-side visualization showing that safety alignment has gotten significantly stronger between late 2025 and early 2026, with average abliteration
success rates dropping across the board.
Interactive Charts - Plotly-powered scatter plots, grouped bar charts, and coverage visualizations. All charts support hover, zoom, and filtering.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment