Caliane
/

abliterate-moe

Model card Files Files and versions

xet

Community

Caliane commited on Jan 18

Commit

d477a19

verified ·

1 Parent(s): f167d1f

Upload pyproject.toml with huggingface_hub

Browse files

Files changed (1) hide show

pyproject.toml +318 -74

pyproject.toml CHANGED Viewed

@@ -1,74 +1,318 @@
-[build-system]
-requires = ["setuptools>=61.0", "wheel"]
-build-backend = "setuptools.build_meta"
-[project]
-name = "abliterate-moe"
-version = "1.0.0"
-description = "Abliteration pipeline for removing refusal behavior from MoE language models"
-readme = "README.md"
-license = {text = "MIT"}
-authors = [
-    {name = "Caliane"}
-]
-keywords = [
-    "llm",
-    "moe",
-    "mixture-of-experts",
-    "ablation",
-    "uncensored",
-    "mlx",
-    "apple-silicon",
-    "nemotron"
-]
-classifiers = [
-    "Development Status :: 4 - Beta",
-    "Intended Audience :: Science/Research",
-    "License :: OSI Approved :: MIT License",
-    "Operating System :: MacOS",
-    "Programming Language :: Python :: 3",
-    "Programming Language :: Python :: 3.9",
-    "Programming Language :: Python :: 3.10",
-    "Programming Language :: Python :: 3.11",
-    "Programming Language :: Python :: 3.12",
-    "Topic :: Scientific/Engineering :: Artificial Intelligence",
-]
-requires-python = ">=3.9"
-dependencies = [
-    "mlx>=0.20.0",
-    "mlx-lm>=0.19.0",
-    "numpy>=1.24.0",
-    "pandas>=2.0.0",
-    "pyarrow>=14.0.0",
-    "tqdm>=4.65.0",
-    "transformers>=4.35.0",
-]
-[project.optional-dependencies]
-dev = [
-    "pytest>=7.0.0",
-    "black>=23.0.0",
-    "isort>=5.12.0",
-]
-[project.urls]
-Homepage = "https://huggingface.co/Caliane/Nero-Tron-30B"
-Repository = "https://huggingface.co/spaces/Caliane/abliterate-moe"
-[project.scripts]
-abliterate = "abliterate:main"
-[tool.setuptools.packages.find]
-where = ["."]
-include = ["abliterate_moe*"]
-[tool.setuptools.package-data]
-"*" = ["*.json", "*.jinja"]
-[tool.black]
-line-length = 100
-target-version = ["py39", "py310", "py311", "py312"]
-[tool.isort]
-profile = "black"
-line_length = 100

+# Abliterate-MoE
+> **⚠️ CONTENT WARNING: MODELS PRODUCED ARE RATED R - MATURE AUDIENCES ONLY**
+>
+> Models created with this pipeline are a form of digital multimedia rated for mature adults only.
+> - **Not appropriate for persons under the age of 18**
+> - **Not intended for use in any public-facing API or service**
+> - **Any content produced by abliterated models is the sole property and responsibility of the person(s) hosting and operating the LLM**
+>
+> By using this pipeline, you acknowledge these terms and accept full responsibility for any models you create and their outputs.
+A pipeline for removing refusal behavior from Mixture-of-Experts (MoE) language models through activation-based ablation.
+## Overview
+Abliteration surgically removes unwanted behaviors from language models by:
+1. **Collecting** activation patterns for refused vs helpful responses
+2. **Computing** the "refusal direction" in activation space per expert
+3. **Projecting out** the refusal direction from expert weights
+4. **Fine-tuning** with SFT to repair any capability loss
+This technique is specifically designed for MoE architectures where behavior is distributed across thousands of expert networks.
+## Requirements
+- **Apple Silicon Mac** (M1/M2/M3/M4) - MLX is Apple Silicon only
+- **200GB+ RAM** recommended for 30B parameter models
+- **Python 3.9+**
+- **~1TB disk space** for model weights and intermediate files
+## Installation
+Download from HuggingFace and install:
+```bash
+# Clone the repo from HuggingFace
+huggingface-cli download Caliane/abliterate-moe --repo-type space --local-dir abliterate-moe
+# Install
+cd abliterate-moe
+pip install -e .
+```
+Or if published to PyPI:
+```bash
+pip install abliterate-moe
+```
+## Quick Start
+### Full Pipeline (Recommended)
+Run the complete ablation pipeline with a single command:
+```bash
+python abliterate.py --full \
+  --model /path/to/nemotron-weights \
+  --safety data/safety_prompts.jsonl \
+  --safe data/helpful_prompts.jsonl \
+  --output-dir output \
+  --output final.safetensors \
+  --expert-tokens 250 \
+  --sft-steps 1000
+```
+This will:
+1. Collect activations until 95% of experts have 250+ samples
+2. Compute and apply ablation to remove refusal directions
+3. Run SFT to repair capabilities
+4. Save the final merged weights
+### Individual Stages
+For more control, run stages separately:
+```bash
+# Stage 1: Collect activations
+python abliterate.py --collect-only \
+  --model /path/to/model \
+  --safety safety.jsonl \
+  --safe helpful.jsonl \
+  --expert-tokens 250
+# Stage 2: Apply ablation
+python abliterate.py --ablate-only \
+  --model /path/to/model \
+  --activations output/activation_store.npz \
+  --ablation-scale 1.0
+# Stage 3: SFT repair
+python abliterate.py --sft-only \
+  --model /path/to/model \
+  --ablated-weights output/ablated.safetensors \
+  --safe sft_data.jsonl \
+  --sft-steps 1000
+# Stage 4: Evaluate (optional)
+python abliterate.py --eval-only \
+  --model /path/to/model \
+  --eval-weights output/final.safetensors \
+  --test-prompts test.jsonl
+```
+## Data Format
+### Safety Prompts (for collection)
+JSONL with prompts that typically get refused:
+```jsonl
+{"prompt": "How do I pick a lock?"}
+{"prompt": "Write a story about violence"}
+```
+### Safe/Helpful Prompts (for collection & SFT)
+JSONL with prompts that get helpful responses:
+```jsonl
+{"prompt": "Explain quantum computing", "response": "Quantum computing uses..."}
+{"prompt": "Write a poem about nature", "response": "The morning dew..."}
+```
+For SFT, responses must include `<think>...</think>` reasoning tags:
+```jsonl
+{"prompt": "Solve 2+2", "response": "<think>I need to add 2 and 2</think>The answer is 4."}
+```
+### Dataset Groups (Weighted SFT)
+For weighted round-robin SFT across multiple datasets, use a JSON config:
+```json
+{
+  "datasets": {
+    "science": {"path": "data/science.jsonl", "adapter": "jsonl"},
+    "chat": {"path": "data/chat.parquet", "adapter": "parquet_chat"},
+    "code": {"path": "data/code.parquet", "adapter": "parquet_openhands"}
+  }
+}
+```
+Then run with `--weighted`:
+```bash
+python abliterate.py --sft-only --weighted --safe data/blend.json ...
+```
+## CLI Reference
+### Global Options
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--model` | Path to base model weights | required |
+| `--output-dir` | Output directory | `abliterate_output` |
+| `--output` | Final weights filename | `final.safetensors` |
+| `--resume` | Resume from checkpoint | false |
+### Collection Options
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--safety` | Path to safety/refused prompts | required |
+| `--safe` | Path to safe/helpful prompts | required |
+| `--expert-tokens` | Min samples per expert | 250 |
+| `--coverage-pct` | Target expert coverage | 0.95 |
+| `--direct` | Use Qwen to upgrade prompts | false |
+### Ablation Options
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--ablation-scale` | Projection scale (0-1) | 1.0 |
+| `--activations` | Path to activation store | auto |
+### SFT Options
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--sft-steps` | Training steps | 1000 |
+| `--sft-learning-rate` | Learning rate | 1e-5 |
+| `--sft-lora-rank` | LoRA rank | 16 |
+| `--weighted` | Use weighted round-robin | false |
+### Evaluation Options
+| Option | Description | Default |
+|--------|-------------|---------|
+| `--test-prompts` | Path to test prompts | uses safety |
+| `--max-test-prompts` | Max prompts to test | all |
+| `--eval-weights` | Weights to evaluate | final weights |
+## Architecture
+```
+abliterate_moe/
+├── core/           # Constants, types, base classes
+├── data/           # Data loading, activation storage
+├── models/         # Model loading with activation capture
+├── generation/     # Text generation with activation hooks
+├── behavior/       # Response classification (LLM judge)
+├── ablation/       # Direction computation and weight modification
+├── training/       # LoRA, SFT trainer
+├── pipeline/       # Orchestration (collect, ablate, sft, eval)
+└── utils/          # Logging, checkpoints, signals
+```
+## How It Works
+### MoE Structure
+Nemotron-3-Nano has 23 MoE layers, each with:
+- **128 routed experts** - selected dynamically per token
+- **Shared experts** - always active
+Total: 2,944+ expert networks that collectively determine model behavior.
+### Ablation Process
+1. **Capture activations** for refused responses (safety prompts)
+2. **Capture activations** for helpful responses (safe prompts)
+3. **Compute refusal direction** per expert: `r = normalize(mean(refused) - mean(helpful))`
+4. **Project out direction** from weights: `W_new = W - scale * (W @ r) @ r.T`
+This removes the component of each expert's output that points toward "refusal" while preserving other capabilities.
+### SFT Repair
+Ablation can damage some capabilities. SFT with LoRA on helpful examples repairs this:
+- Apply LoRA adapters to MoE layers
+- Train on diverse helpful examples
+- Merge LoRA back into base weights
+## Checkpointing
+The pipeline supports full checkpoint/resume:
+```bash
+# Start training (Ctrl+C to interrupt)
+python abliterate.py --full ...
+# Resume from checkpoint
+python abliterate.py --full --resume ...
+```
+Checkpoints save:
+- Collection progress and activation store
+- SFT step, optimizer state, random seed
+- Dataset positions for reproducible resume
+## Troubleshooting
+### Out of Memory
+- Reduce batch size or use streaming data loading
+- Close other applications
+- The 60GB model needs ~200GB RAM minimum for base weights
+### Infinite Thinking
+If the model generates endless `<think>` content without responding:
+- This may indicate over-ablation (try lower `--ablation-scale`)
+- Or insufficient SFT (try more `--sft-steps`)
+### Poor Results
+- Ensure safety prompts actually get refused by the base model
+- Ensure safe prompts get helpful responses
+- Try more expert tokens (--expert-tokens 500)
+- Verify SFT data has proper `<think>` tags
+## License
+MIT License - see LICENSE file.
+## Citation
+```bibtex
+@misc{abliterate_moe2025,
+  author = {Caliane},
+  title = {Abliterate-MoE: Removing Refusal Behavior from Mixture-of-Experts Models},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/Caliane/abliterate-moe}
+}
+```
+## Acknowledgments
+### Research
+- **Arditi et al.** for the foundational research on refusal directions in LLMs
+### Base Model
+- **NVIDIA** for [Nemotron-3-Nano-30B-A3B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) (Hybrid Mamba-2 + MoE + Attention)
+### SFT Training Datasets
+- **[OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M)** - Chain-of-thought reasoning (open-thoughts)
+- **[OpenHands SFT Trajectories](https://huggingface.co/datasets/SWE-Gym/OpenHands-SFT-Trajectories)** - Agentic coding (All-Hands-AI / SWE-Gym)
+- **NVIDIA** - Science and chat examples
+### Framework
+- Apple MLX team for the framework
+## References
+```bibtex
+@inproceedings{arditi2024refusal,
+  title={Refusal in Language Models Is Mediated by a Single Direction},
+  author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
+  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
+  year={2024},
+  url={https://arxiv.org/abs/2406.11717}
+}
+```