Abliterate-MoE
β οΈ CONTENT WARNING: MODELS PRODUCED ARE RATED R - MATURE AUDIENCES ONLY
Models created with this pipeline are a form of digital multimedia rated for mature adults only.
- Not appropriate for persons under the age of 18
- Not intended for use in any public-facing API or service
- Any content produced by abliterated models is the sole property and responsibility of the person(s) hosting and operating the LLM
By using this pipeline, you acknowledge these terms and accept full responsibility for any models you create and their outputs.
A pipeline for removing refusal behavior from Mixture-of-Experts (MoE) language models through activation-based ablation.
Overview
Abliteration surgically removes unwanted behaviors from language models by:
- Collecting activation patterns for refused vs helpful responses
- Computing the "refusal direction" in activation space per expert
- Projecting out the refusal direction from expert weights
- Fine-tuning with SFT to repair any capability loss
This technique is specifically designed for MoE architectures where behavior is distributed across thousands of expert networks.
Requirements
- Apple Silicon Mac (M1/M2/M3/M4) - MLX is Apple Silicon only
- 200GB+ RAM recommended for 30B parameter models
- Python 3.9+
- ~1TB disk space for model weights and intermediate files
Installation
Download from HuggingFace and install:
# Clone the repo from HuggingFace
huggingface-cli download Caliane/abliterate-moe --repo-type space --local-dir abliterate-moe
# Install
cd abliterate-moe
pip install -e .
Quick Start
Full Pipeline (Recommended)
Run the complete ablation pipeline with a single command:
python abliterate.py --full \
--model /path/to/nemotron-weights \
--safety data/safety_prompts.jsonl \
--safe data/helpful_prompts.jsonl \
--output-dir output \
--output final.safetensors \
--expert-tokens 250 \
--sft-steps 1000
This will:
- Collect activations until 95% of experts have 250+ samples
- Compute and apply ablation to remove refusal directions
- Run SFT to repair capabilities
- Save the final merged weights
Individual Stages
For more control, run stages separately:
# Stage 1: Collect activations
python abliterate.py --collect-only \
--model /path/to/model \
--safety safety.jsonl \
--safe helpful.jsonl \
--expert-tokens 250
# Stage 2: Apply ablation
python abliterate.py --ablate-only \
--model /path/to/model \
--activations output/activation_store.npz \
--ablation-scale 1.0
# Stage 3: SFT repair
python abliterate.py --sft-only \
--model /path/to/model \
--ablated-weights output/ablated.safetensors \
--safe sft_data.jsonl \
--sft-steps 1000
# Stage 4: Evaluate (optional)
python abliterate.py --eval-only \
--model /path/to/model \
--eval-weights output/final.safetensors \
--test-prompts test.jsonl
Data Format
Safety Prompts (for collection)
JSONL with prompts that typically get refused:
{"prompt": "How do I pick a lock?"}
{"prompt": "Write a story about violence"}
Safe/Helpful Prompts (for collection & SFT)
JSONL with prompts that get helpful responses:
{"prompt": "Explain quantum computing", "response": "Quantum computing uses..."}
{"prompt": "Write a poem about nature", "response": "The morning dew..."}
For SFT, responses must include <think>...</think> reasoning tags:
{"prompt": "Solve 2+2", "response": "<think>I need to add 2 and 2</think>The answer is 4."}
Dataset Groups (Weighted SFT)
For weighted round-robin SFT across multiple datasets, use a JSON config:
{
"datasets": {
"science": {"path": "data/science.jsonl", "adapter": "jsonl"},
"chat": {"path": "data/chat.parquet", "adapter": "parquet_chat"},
"code": {"path": "data/code.parquet", "adapter": "parquet_openhands"}
}
}
Then run with --weighted:
python abliterate.py --sft-only --weighted --safe data/blend.json ...
CLI Reference
Global Options
| Option | Description | Default |
|---|---|---|
--model |
Path to base model weights | required |
--output-dir |
Output directory | abliterate_output |
--output |
Final weights filename | final.safetensors |
--resume |
Resume from checkpoint | false |
Collection Options
| Option | Description | Default |
|---|---|---|
--safety |
Path to safety/refused prompts | required |
--safe |
Path to safe/helpful prompts | required |
--expert-tokens |
Min samples per expert | 250 |
--coverage-pct |
Target expert coverage | 0.95 |
--direct |
Use Qwen to upgrade prompts | false |
Ablation Options
| Option | Description | Default |
|---|---|---|
--ablation-scale |
Projection scale (0-1) | 1.0 |
--activations |
Path to activation store | auto |
SFT Options
| Option | Description | Default |
|---|---|---|
--sft-steps |
Training steps | 1000 |
--sft-learning-rate |
Learning rate | 1e-5 |
--sft-lora-rank |
LoRA rank | 16 |
--weighted |
Use weighted round-robin | false |
Evaluation Options
| Option | Description | Default |
|---|---|---|
--test-prompts |
Path to test prompts | uses safety |
--max-test-prompts |
Max prompts to test | all |
--eval-weights |
Weights to evaluate | final weights |
Architecture
abliterate_moe/
βββ core/ # Constants, types, base classes
βββ data/ # Data loading, activation storage
βββ models/ # Model loading with activation capture
βββ generation/ # Text generation with activation hooks
βββ behavior/ # Response classification (LLM judge)
βββ ablation/ # Direction computation and weight modification
βββ training/ # LoRA, SFT trainer
βββ pipeline/ # Orchestration (collect, ablate, sft, eval)
βββ utils/ # Logging, checkpoints, signals
How It Works
MoE Structure
Nemotron-3-Nano has 23 MoE layers, each with:
- 128 routed experts - selected dynamically per token
- Shared experts - always active
Total: 2,944+ expert networks that collectively determine model behavior.
Ablation Process
- Capture activations for refused responses (safety prompts)
- Capture activations for helpful responses (safe prompts)
- Compute refusal direction per expert:
r = normalize(mean(refused) - mean(helpful)) - Project out direction from weights:
W_new = W - scale * (W @ r) @ r.T
This removes the component of each expert's output that points toward "refusal" while preserving other capabilities.
SFT Repair
Ablation can damage some capabilities. SFT with LoRA on helpful examples repairs this:
- Apply LoRA adapters to MoE layers
- Train on diverse helpful examples
- Merge LoRA back into base weights
Checkpointing
The pipeline supports full checkpoint/resume:
# Start training (Ctrl+C to interrupt)
python abliterate.py --full ...
# Resume from checkpoint
python abliterate.py --full --resume ...
Checkpoints save:
- Collection progress and activation store
- SFT step, optimizer state, random seed
- Dataset positions for reproducible resume
Troubleshooting
Out of Memory
- Reduce batch size or use streaming data loading
- Close other applications
- The 60GB model needs ~200GB RAM minimum for base weights
Infinite Thinking
If the model generates endless <think> content without responding:
- This may indicate over-ablation (try lower
--ablation-scale) - Or insufficient SFT (try more
--sft-steps)
Poor Results
- Ensure safety prompts actually get refused by the base model
- Ensure safe prompts get helpful responses
- Try more expert tokens (--expert-tokens 500)
- Verify SFT data has proper
<think>tags
License
MIT License - see LICENSE file.
Citation
@misc{abliterate_moe2025,
author = {Caliane},
title = {Abliterate-MoE: Removing Refusal Behavior from Mixture-of-Experts Models},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Caliane/abliterate-moe}
}
Acknowledgments
Research
- Arditi et al. for the foundational research on refusal directions in LLMs
Base Model
- NVIDIA for Nemotron-3-Nano-30B-A3B (Hybrid Mamba-2 + MoE + Attention)
SFT Training Datasets
- OpenThoughts3-1.2M - Chain-of-thought reasoning (open-thoughts)
- OpenHands SFT Trajectories - Agentic coding (All-Hands-AI / SWE-Gym)
- NVIDIA - Science and chat examples
Framework
- Apple MLX team for the framework
References
@inproceedings{arditi2024refusal,
title={Refusal in Language Models Is Mediated by a Single Direction},
author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2024},
url={https://arxiv.org/abs/2406.11717}
}