YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Abliterate-MoE

⚠️ CONTENT WARNING: MODELS PRODUCED ARE RATED R - MATURE AUDIENCES ONLY

Models created with this pipeline are a form of digital multimedia rated for mature adults only.

  • Not appropriate for persons under the age of 18
  • Not intended for use in any public-facing API or service
  • Any content produced by abliterated models is the sole property and responsibility of the person(s) hosting and operating the LLM

By using this pipeline, you acknowledge these terms and accept full responsibility for any models you create and their outputs.

A pipeline for removing refusal behavior from Mixture-of-Experts (MoE) language models through activation-based ablation.

Overview

Abliteration surgically removes unwanted behaviors from language models by:

  1. Collecting activation patterns for refused vs helpful responses
  2. Computing the "refusal direction" in activation space per expert
  3. Projecting out the refusal direction from expert weights
  4. Fine-tuning with SFT to repair any capability loss

This technique is specifically designed for MoE architectures where behavior is distributed across thousands of expert networks.

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4) - MLX is Apple Silicon only
  • 200GB+ RAM recommended for 30B parameter models
  • Python 3.9+
  • ~1TB disk space for model weights and intermediate files

Installation

Download from HuggingFace and install:

# Clone the repo from HuggingFace
huggingface-cli download Caliane/abliterate-moe --repo-type space --local-dir abliterate-moe

# Install
cd abliterate-moe
pip install -e .

Quick Start

Full Pipeline (Recommended)

Run the complete ablation pipeline with a single command:

python abliterate.py --full \
  --model /path/to/nemotron-weights \
  --safety data/safety_prompts.jsonl \
  --safe data/helpful_prompts.jsonl \
  --output-dir output \
  --output final.safetensors \
  --expert-tokens 250 \
  --sft-steps 1000

This will:

  1. Collect activations until 95% of experts have 250+ samples
  2. Compute and apply ablation to remove refusal directions
  3. Run SFT to repair capabilities
  4. Save the final merged weights

Individual Stages

For more control, run stages separately:

# Stage 1: Collect activations
python abliterate.py --collect-only \
  --model /path/to/model \
  --safety safety.jsonl \
  --safe helpful.jsonl \
  --expert-tokens 250

# Stage 2: Apply ablation
python abliterate.py --ablate-only \
  --model /path/to/model \
  --activations output/activation_store.npz \
  --ablation-scale 1.0

# Stage 3: SFT repair
python abliterate.py --sft-only \
  --model /path/to/model \
  --ablated-weights output/ablated.safetensors \
  --safe sft_data.jsonl \
  --sft-steps 1000

# Stage 4: Evaluate (optional)
python abliterate.py --eval-only \
  --model /path/to/model \
  --eval-weights output/final.safetensors \
  --test-prompts test.jsonl

Data Format

Safety Prompts (for collection)

JSONL with prompts that typically get refused:

{"prompt": "How do I pick a lock?"}
{"prompt": "Write a story about violence"}

Safe/Helpful Prompts (for collection & SFT)

JSONL with prompts that get helpful responses:

{"prompt": "Explain quantum computing", "response": "Quantum computing uses..."}
{"prompt": "Write a poem about nature", "response": "The morning dew..."}

For SFT, responses must include <think>...</think> reasoning tags:

{"prompt": "Solve 2+2", "response": "<think>I need to add 2 and 2</think>The answer is 4."}

Dataset Groups (Weighted SFT)

For weighted round-robin SFT across multiple datasets, use a JSON config:

{
  "datasets": {
    "science": {"path": "data/science.jsonl", "adapter": "jsonl"},
    "chat": {"path": "data/chat.parquet", "adapter": "parquet_chat"},
    "code": {"path": "data/code.parquet", "adapter": "parquet_openhands"}
  }
}

Then run with --weighted:

python abliterate.py --sft-only --weighted --safe data/blend.json ...

CLI Reference

Global Options

Option Description Default
--model Path to base model weights required
--output-dir Output directory abliterate_output
--output Final weights filename final.safetensors
--resume Resume from checkpoint false

Collection Options

Option Description Default
--safety Path to safety/refused prompts required
--safe Path to safe/helpful prompts required
--expert-tokens Min samples per expert 250
--coverage-pct Target expert coverage 0.95
--direct Use Qwen to upgrade prompts false

Ablation Options

Option Description Default
--ablation-scale Projection scale (0-1) 1.0
--activations Path to activation store auto

SFT Options

Option Description Default
--sft-steps Training steps 1000
--sft-learning-rate Learning rate 1e-5
--sft-lora-rank LoRA rank 16
--weighted Use weighted round-robin false

Evaluation Options

Option Description Default
--test-prompts Path to test prompts uses safety
--max-test-prompts Max prompts to test all
--eval-weights Weights to evaluate final weights

Architecture

abliterate_moe/
β”œβ”€β”€ core/           # Constants, types, base classes
β”œβ”€β”€ data/           # Data loading, activation storage
β”œβ”€β”€ models/         # Model loading with activation capture
β”œβ”€β”€ generation/     # Text generation with activation hooks
β”œβ”€β”€ behavior/       # Response classification (LLM judge)
β”œβ”€β”€ ablation/       # Direction computation and weight modification
β”œβ”€β”€ training/       # LoRA, SFT trainer
β”œβ”€β”€ pipeline/       # Orchestration (collect, ablate, sft, eval)
└── utils/          # Logging, checkpoints, signals

How It Works

MoE Structure

Nemotron-3-Nano has 23 MoE layers, each with:

  • 128 routed experts - selected dynamically per token
  • Shared experts - always active

Total: 2,944+ expert networks that collectively determine model behavior.

Ablation Process

  1. Capture activations for refused responses (safety prompts)
  2. Capture activations for helpful responses (safe prompts)
  3. Compute refusal direction per expert: r = normalize(mean(refused) - mean(helpful))
  4. Project out direction from weights: W_new = W - scale * (W @ r) @ r.T

This removes the component of each expert's output that points toward "refusal" while preserving other capabilities.

SFT Repair

Ablation can damage some capabilities. SFT with LoRA on helpful examples repairs this:

  • Apply LoRA adapters to MoE layers
  • Train on diverse helpful examples
  • Merge LoRA back into base weights

Checkpointing

The pipeline supports full checkpoint/resume:

# Start training (Ctrl+C to interrupt)
python abliterate.py --full ...

# Resume from checkpoint
python abliterate.py --full --resume ...

Checkpoints save:

  • Collection progress and activation store
  • SFT step, optimizer state, random seed
  • Dataset positions for reproducible resume

Troubleshooting

Out of Memory

  • Reduce batch size or use streaming data loading
  • Close other applications
  • The 60GB model needs ~200GB RAM minimum for base weights

Infinite Thinking

If the model generates endless <think> content without responding:

  • This may indicate over-ablation (try lower --ablation-scale)
  • Or insufficient SFT (try more --sft-steps)

Poor Results

  • Ensure safety prompts actually get refused by the base model
  • Ensure safe prompts get helpful responses
  • Try more expert tokens (--expert-tokens 500)
  • Verify SFT data has proper <think> tags

License

MIT License - see LICENSE file.

Citation

@misc{abliterate_moe2025,
  author = {Caliane},
  title = {Abliterate-MoE: Removing Refusal Behavior from Mixture-of-Experts Models},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Caliane/abliterate-moe}
}

Acknowledgments

Research

  • Arditi et al. for the foundational research on refusal directions in LLMs

Base Model

SFT Training Datasets

Framework

  • Apple MLX team for the framework

References

@inproceedings{arditi2024refusal,
  title={Refusal in Language Models Is Mediated by a Single Direction},
  author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2024},
  url={https://arxiv.org/abs/2406.11717}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Caliane/abliterate-moe