YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Abliterate-MoE

⚠️ CONTENT WARNING: MODELS PRODUCED ARE RATED R - MATURE AUDIENCES ONLY

Models created with this pipeline are a form of digital multimedia rated for mature adults only.

Not appropriate for persons under the age of 18

Not intended for use in any public-facing API or service

Any content produced by abliterated models is the sole property and responsibility of the person(s) hosting and operating the LLM

By using this pipeline, you acknowledge these terms and accept full responsibility for any models you create and their outputs.

A pipeline for removing refusal behavior from Mixture-of-Experts (MoE) language models through activation-based ablation.

Overview

Abliteration surgically removes unwanted behaviors from language models by:

Collecting activation patterns for refused vs helpful responses
Computing the "refusal direction" in activation space per expert
Projecting out the refusal direction from expert weights
Fine-tuning with SFT to repair any capability loss

This technique is specifically designed for MoE architectures where behavior is distributed across thousands of expert networks.

Requirements

Apple Silicon Mac (M1/M2/M3/M4) - MLX is Apple Silicon only
200GB+ RAM recommended for 30B parameter models
Python 3.9+
~1TB disk space for model weights and intermediate files

Installation

Download from HuggingFace and install:

# Clone the repo from HuggingFace
huggingface-cli download Caliane/abliterate-moe --repo-type space --local-dir abliterate-moe

# Install
cd abliterate-moe
pip install -e .

Quick Start

Full Pipeline (Recommended)

Run the complete ablation pipeline with a single command:

python abliterate.py --full \
  --model /path/to/nemotron-weights \
  --safety data/safety_prompts.jsonl \
  --safe data/helpful_prompts.jsonl \
  --output-dir output \
  --output final.safetensors \
  --expert-tokens 250 \
  --sft-steps 1000

This will:

Collect activations until 95% of experts have 250+ samples
Compute and apply ablation to remove refusal directions
Run SFT to repair capabilities
Save the final merged weights

Individual Stages

For more control, run stages separately:

# Stage 1: Collect activations
python abliterate.py --collect-only \
  --model /path/to/model \
  --safety safety.jsonl \
  --safe helpful.jsonl \
  --expert-tokens 250

# Stage 2: Apply ablation
python abliterate.py --ablate-only \
  --model /path/to/model \
  --activations output/activation_store.npz \
  --ablation-scale 1.0

# Stage 3: SFT repair
python abliterate.py --sft-only \
  --model /path/to/model \
  --ablated-weights output/ablated.safetensors \
  --safe sft_data.jsonl \
  --sft-steps 1000

# Stage 4: Evaluate (optional)
python abliterate.py --eval-only \
  --model /path/to/model \
  --eval-weights output/final.safetensors \
  --test-prompts test.jsonl

Data Format

Safety Prompts (for collection)

JSONL with prompts that typically get refused:

{"prompt": "How do I pick a lock?"}
{"prompt": "Write a story about violence"}

Safe/Helpful Prompts (for collection & SFT)

JSONL with prompts that get helpful responses:

{"prompt": "Explain quantum computing", "response": "Quantum computing uses..."}
{"prompt": "Write a poem about nature", "response": "The morning dew..."}

For SFT, responses must include <think>...</think> reasoning tags:

{"prompt": "Solve 2+2", "response": "<think>I need to add 2 and 2</think>The answer is 4."}

Dataset Groups (Weighted SFT)

For weighted round-robin SFT across multiple datasets, use a JSON config:

{
  "datasets": {
    "science": {"path": "data/science.jsonl", "adapter": "jsonl"},
    "chat": {"path": "data/chat.parquet", "adapter": "parquet_chat"},
    "code": {"path": "data/code.parquet", "adapter": "parquet_openhands"}
  }
}

Then run with --weighted:

python abliterate.py --sft-only --weighted --safe data/blend.json ...

CLI Reference

Global Options

Option	Description	Default
`--model`	Path to base model weights	required
`--output-dir`	Output directory	`abliterate_output`
`--output`	Final weights filename	`final.safetensors`
`--resume`	Resume from checkpoint	false

Collection Options

Option	Description	Default
`--safety`	Path to safety/refused prompts	required
`--safe`	Path to safe/helpful prompts	required
`--expert-tokens`	Min samples per expert	250
`--coverage-pct`	Target expert coverage	0.95
`--direct`	Use Qwen to upgrade prompts	false

Ablation Options

Option	Description	Default
`--ablation-scale`	Projection scale (0-1)	1.0
`--activations`	Path to activation store	auto

SFT Options

Option	Description	Default
`--sft-steps`	Training steps	1000
`--sft-learning-rate`	Learning rate	1e-5
`--sft-lora-rank`	LoRA rank	16
`--weighted`	Use weighted round-robin	false

Evaluation Options

Option	Description	Default
`--test-prompts`	Path to test prompts	uses safety
`--max-test-prompts`	Max prompts to test	all
`--eval-weights`	Weights to evaluate	final weights

Architecture

abliterate_moe/
├── core/           # Constants, types, base classes
├── data/           # Data loading, activation storage
├── models/         # Model loading with activation capture
├── generation/     # Text generation with activation hooks
├── behavior/       # Response classification (LLM judge)
├── ablation/       # Direction computation and weight modification
├── training/       # LoRA, SFT trainer
├── pipeline/       # Orchestration (collect, ablate, sft, eval)
└── utils/          # Logging, checkpoints, signals

How It Works

MoE Structure

Nemotron-3-Nano has 23 MoE layers, each with:

128 routed experts - selected dynamically per token
Shared experts - always active

Total: 2,944+ expert networks that collectively determine model behavior.

Ablation Process

Capture activations for refused responses (safety prompts)
Capture activations for helpful responses (safe prompts)
Compute refusal direction per expert: r = normalize(mean(refused) - mean(helpful))
Project out direction from weights: W_new = W - scale * (W @ r) @ r.T

This removes the component of each expert's output that points toward "refusal" while preserving other capabilities.

SFT Repair

Ablation can damage some capabilities. SFT with LoRA on helpful examples repairs this:

Apply LoRA adapters to MoE layers
Train on diverse helpful examples
Merge LoRA back into base weights

Checkpointing

The pipeline supports full checkpoint/resume:

# Start training (Ctrl+C to interrupt)
python abliterate.py --full ...

# Resume from checkpoint
python abliterate.py --full --resume ...

Checkpoints save:

Collection progress and activation store
SFT step, optimizer state, random seed
Dataset positions for reproducible resume

Troubleshooting

Out of Memory

Reduce batch size or use streaming data loading
Close other applications
The 60GB model needs ~200GB RAM minimum for base weights

Infinite Thinking

If the model generates endless <think> content without responding:

This may indicate over-ablation (try lower --ablation-scale)
Or insufficient SFT (try more --sft-steps)

Poor Results

Ensure safety prompts actually get refused by the base model
Ensure safe prompts get helpful responses
Try more expert tokens (--expert-tokens 500)
Verify SFT data has proper <think> tags

License

MIT License - see LICENSE file.

Citation

@misc{abliterate_moe2025,
  author = {Caliane},
  title = {Abliterate-MoE: Removing Refusal Behavior from Mixture-of-Experts Models},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Caliane/abliterate-moe}
}

Acknowledgments

Research

Arditi et al. for the foundational research on refusal directions in LLMs

Base Model

NVIDIA for Nemotron-3-Nano-30B-A3B (Hybrid Mamba-2 + MoE + Attention)

SFT Training Datasets

OpenThoughts3-1.2M - Chain-of-thought reasoning (open-thoughts)
OpenHands SFT Trajectories - Agentic coding (All-Hands-AI / SWE-Gym)
NVIDIA - Science and chat examples

Framework

Apple MLX team for the framework

References

@inproceedings{arditi2024refusal,
  title={Refusal in Language Models Is Mediated by a Single Direction},
  author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2024},
  url={https://arxiv.org/abs/2406.11717}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Caliane/abliterate-moe

Refusal in Language Models Is Mediated by a Single Direction

Paper • 2406.11717 • Published Jun 17, 2024 • 6