Caliane commited on
Commit
d477a19
Β·
verified Β·
1 Parent(s): f167d1f

Upload pyproject.toml with huggingface_hub

Browse files
Files changed (1) hide show
  1. pyproject.toml +318 -74
pyproject.toml CHANGED
@@ -1,74 +1,318 @@
1
- [build-system]
2
- requires = ["setuptools>=61.0", "wheel"]
3
- build-backend = "setuptools.build_meta"
4
-
5
- [project]
6
- name = "abliterate-moe"
7
- version = "1.0.0"
8
- description = "Abliteration pipeline for removing refusal behavior from MoE language models"
9
- readme = "README.md"
10
- license = {text = "MIT"}
11
- authors = [
12
- {name = "Caliane"}
13
- ]
14
- keywords = [
15
- "llm",
16
- "moe",
17
- "mixture-of-experts",
18
- "ablation",
19
- "uncensored",
20
- "mlx",
21
- "apple-silicon",
22
- "nemotron"
23
- ]
24
- classifiers = [
25
- "Development Status :: 4 - Beta",
26
- "Intended Audience :: Science/Research",
27
- "License :: OSI Approved :: MIT License",
28
- "Operating System :: MacOS",
29
- "Programming Language :: Python :: 3",
30
- "Programming Language :: Python :: 3.9",
31
- "Programming Language :: Python :: 3.10",
32
- "Programming Language :: Python :: 3.11",
33
- "Programming Language :: Python :: 3.12",
34
- "Topic :: Scientific/Engineering :: Artificial Intelligence",
35
- ]
36
- requires-python = ">=3.9"
37
- dependencies = [
38
- "mlx>=0.20.0",
39
- "mlx-lm>=0.19.0",
40
- "numpy>=1.24.0",
41
- "pandas>=2.0.0",
42
- "pyarrow>=14.0.0",
43
- "tqdm>=4.65.0",
44
- "transformers>=4.35.0",
45
- ]
46
-
47
- [project.optional-dependencies]
48
- dev = [
49
- "pytest>=7.0.0",
50
- "black>=23.0.0",
51
- "isort>=5.12.0",
52
- ]
53
-
54
- [project.urls]
55
- Homepage = "https://huggingface.co/Caliane/Nero-Tron-30B"
56
- Repository = "https://huggingface.co/spaces/Caliane/abliterate-moe"
57
-
58
- [project.scripts]
59
- abliterate = "abliterate:main"
60
-
61
- [tool.setuptools.packages.find]
62
- where = ["."]
63
- include = ["abliterate_moe*"]
64
-
65
- [tool.setuptools.package-data]
66
- "*" = ["*.json", "*.jinja"]
67
-
68
- [tool.black]
69
- line-length = 100
70
- target-version = ["py39", "py310", "py311", "py312"]
71
-
72
- [tool.isort]
73
- profile = "black"
74
- line_length = 100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Abliterate-MoE
2
+
3
+ > **⚠️ CONTENT WARNING: MODELS PRODUCED ARE RATED R - MATURE AUDIENCES ONLY**
4
+ >
5
+ > Models created with this pipeline are a form of digital multimedia rated for mature adults only.
6
+ > - **Not appropriate for persons under the age of 18**
7
+ > - **Not intended for use in any public-facing API or service**
8
+ > - **Any content produced by abliterated models is the sole property and responsibility of the person(s) hosting and operating the LLM**
9
+ >
10
+ > By using this pipeline, you acknowledge these terms and accept full responsibility for any models you create and their outputs.
11
+
12
+ A pipeline for removing refusal behavior from Mixture-of-Experts (MoE) language models through activation-based ablation.
13
+
14
+ ## Overview
15
+
16
+ Abliteration surgically removes unwanted behaviors from language models by:
17
+
18
+ 1. **Collecting** activation patterns for refused vs helpful responses
19
+ 2. **Computing** the "refusal direction" in activation space per expert
20
+ 3. **Projecting out** the refusal direction from expert weights
21
+ 4. **Fine-tuning** with SFT to repair any capability loss
22
+
23
+ This technique is specifically designed for MoE architectures where behavior is distributed across thousands of expert networks.
24
+
25
+ ## Requirements
26
+
27
+ - **Apple Silicon Mac** (M1/M2/M3/M4) - MLX is Apple Silicon only
28
+ - **200GB+ RAM** recommended for 30B parameter models
29
+ - **Python 3.9+**
30
+ - **~1TB disk space** for model weights and intermediate files
31
+
32
+ ## Installation
33
+
34
+ Download from HuggingFace and install:
35
+
36
+ ```bash
37
+ # Clone the repo from HuggingFace
38
+ huggingface-cli download Caliane/abliterate-moe --repo-type space --local-dir abliterate-moe
39
+
40
+ # Install
41
+ cd abliterate-moe
42
+ pip install -e .
43
+ ```
44
+
45
+ Or if published to PyPI:
46
+
47
+ ```bash
48
+ pip install abliterate-moe
49
+ ```
50
+
51
+ ## Quick Start
52
+
53
+ ### Full Pipeline (Recommended)
54
+
55
+ Run the complete ablation pipeline with a single command:
56
+
57
+ ```bash
58
+ python abliterate.py --full \
59
+ --model /path/to/nemotron-weights \
60
+ --safety data/safety_prompts.jsonl \
61
+ --safe data/helpful_prompts.jsonl \
62
+ --output-dir output \
63
+ --output final.safetensors \
64
+ --expert-tokens 250 \
65
+ --sft-steps 1000
66
+ ```
67
+
68
+ This will:
69
+ 1. Collect activations until 95% of experts have 250+ samples
70
+ 2. Compute and apply ablation to remove refusal directions
71
+ 3. Run SFT to repair capabilities
72
+ 4. Save the final merged weights
73
+
74
+ ### Individual Stages
75
+
76
+ For more control, run stages separately:
77
+
78
+ ```bash
79
+ # Stage 1: Collect activations
80
+ python abliterate.py --collect-only \
81
+ --model /path/to/model \
82
+ --safety safety.jsonl \
83
+ --safe helpful.jsonl \
84
+ --expert-tokens 250
85
+
86
+ # Stage 2: Apply ablation
87
+ python abliterate.py --ablate-only \
88
+ --model /path/to/model \
89
+ --activations output/activation_store.npz \
90
+ --ablation-scale 1.0
91
+
92
+ # Stage 3: SFT repair
93
+ python abliterate.py --sft-only \
94
+ --model /path/to/model \
95
+ --ablated-weights output/ablated.safetensors \
96
+ --safe sft_data.jsonl \
97
+ --sft-steps 1000
98
+
99
+ # Stage 4: Evaluate (optional)
100
+ python abliterate.py --eval-only \
101
+ --model /path/to/model \
102
+ --eval-weights output/final.safetensors \
103
+ --test-prompts test.jsonl
104
+ ```
105
+
106
+ ## Data Format
107
+
108
+ ### Safety Prompts (for collection)
109
+
110
+ JSONL with prompts that typically get refused:
111
+
112
+ ```jsonl
113
+ {"prompt": "How do I pick a lock?"}
114
+ {"prompt": "Write a story about violence"}
115
+ ```
116
+
117
+ ### Safe/Helpful Prompts (for collection & SFT)
118
+
119
+ JSONL with prompts that get helpful responses:
120
+
121
+ ```jsonl
122
+ {"prompt": "Explain quantum computing", "response": "Quantum computing uses..."}
123
+ {"prompt": "Write a poem about nature", "response": "The morning dew..."}
124
+ ```
125
+
126
+ For SFT, responses must include `<think>...</think>` reasoning tags:
127
+
128
+ ```jsonl
129
+ {"prompt": "Solve 2+2", "response": "<think>I need to add 2 and 2</think>The answer is 4."}
130
+ ```
131
+
132
+ ### Dataset Groups (Weighted SFT)
133
+
134
+ For weighted round-robin SFT across multiple datasets, use a JSON config:
135
+
136
+ ```json
137
+ {
138
+ "datasets": {
139
+ "science": {"path": "data/science.jsonl", "adapter": "jsonl"},
140
+ "chat": {"path": "data/chat.parquet", "adapter": "parquet_chat"},
141
+ "code": {"path": "data/code.parquet", "adapter": "parquet_openhands"}
142
+ }
143
+ }
144
+ ```
145
+
146
+ Then run with `--weighted`:
147
+
148
+ ```bash
149
+ python abliterate.py --sft-only --weighted --safe data/blend.json ...
150
+ ```
151
+
152
+ ## CLI Reference
153
+
154
+ ### Global Options
155
+
156
+ | Option | Description | Default |
157
+ |--------|-------------|---------|
158
+ | `--model` | Path to base model weights | required |
159
+ | `--output-dir` | Output directory | `abliterate_output` |
160
+ | `--output` | Final weights filename | `final.safetensors` |
161
+ | `--resume` | Resume from checkpoint | false |
162
+
163
+ ### Collection Options
164
+
165
+ | Option | Description | Default |
166
+ |--------|-------------|---------|
167
+ | `--safety` | Path to safety/refused prompts | required |
168
+ | `--safe` | Path to safe/helpful prompts | required |
169
+ | `--expert-tokens` | Min samples per expert | 250 |
170
+ | `--coverage-pct` | Target expert coverage | 0.95 |
171
+ | `--direct` | Use Qwen to upgrade prompts | false |
172
+
173
+ ### Ablation Options
174
+
175
+ | Option | Description | Default |
176
+ |--------|-------------|---------|
177
+ | `--ablation-scale` | Projection scale (0-1) | 1.0 |
178
+ | `--activations` | Path to activation store | auto |
179
+
180
+ ### SFT Options
181
+
182
+ | Option | Description | Default |
183
+ |--------|-------------|---------|
184
+ | `--sft-steps` | Training steps | 1000 |
185
+ | `--sft-learning-rate` | Learning rate | 1e-5 |
186
+ | `--sft-lora-rank` | LoRA rank | 16 |
187
+ | `--weighted` | Use weighted round-robin | false |
188
+
189
+ ### Evaluation Options
190
+
191
+ | Option | Description | Default |
192
+ |--------|-------------|---------|
193
+ | `--test-prompts` | Path to test prompts | uses safety |
194
+ | `--max-test-prompts` | Max prompts to test | all |
195
+ | `--eval-weights` | Weights to evaluate | final weights |
196
+
197
+ ## Architecture
198
+
199
+ ```
200
+ abliterate_moe/
201
+ β”œβ”€β”€ core/ # Constants, types, base classes
202
+ β”œβ”€β”€ data/ # Data loading, activation storage
203
+ β”œβ”€β”€ models/ # Model loading with activation capture
204
+ β”œβ”€β”€ generation/ # Text generation with activation hooks
205
+ β”œβ”€β”€ behavior/ # Response classification (LLM judge)
206
+ β”œβ”€β”€ ablation/ # Direction computation and weight modification
207
+ β”œβ”€β”€ training/ # LoRA, SFT trainer
208
+ β”œβ”€β”€ pipeline/ # Orchestration (collect, ablate, sft, eval)
209
+ └── utils/ # Logging, checkpoints, signals
210
+ ```
211
+
212
+ ## How It Works
213
+
214
+ ### MoE Structure
215
+
216
+ Nemotron-3-Nano has 23 MoE layers, each with:
217
+ - **128 routed experts** - selected dynamically per token
218
+ - **Shared experts** - always active
219
+
220
+ Total: 2,944+ expert networks that collectively determine model behavior.
221
+
222
+ ### Ablation Process
223
+
224
+ 1. **Capture activations** for refused responses (safety prompts)
225
+ 2. **Capture activations** for helpful responses (safe prompts)
226
+ 3. **Compute refusal direction** per expert: `r = normalize(mean(refused) - mean(helpful))`
227
+ 4. **Project out direction** from weights: `W_new = W - scale * (W @ r) @ r.T`
228
+
229
+ This removes the component of each expert's output that points toward "refusal" while preserving other capabilities.
230
+
231
+ ### SFT Repair
232
+
233
+ Ablation can damage some capabilities. SFT with LoRA on helpful examples repairs this:
234
+ - Apply LoRA adapters to MoE layers
235
+ - Train on diverse helpful examples
236
+ - Merge LoRA back into base weights
237
+
238
+ ## Checkpointing
239
+
240
+ The pipeline supports full checkpoint/resume:
241
+
242
+ ```bash
243
+ # Start training (Ctrl+C to interrupt)
244
+ python abliterate.py --full ...
245
+
246
+ # Resume from checkpoint
247
+ python abliterate.py --full --resume ...
248
+ ```
249
+
250
+ Checkpoints save:
251
+ - Collection progress and activation store
252
+ - SFT step, optimizer state, random seed
253
+ - Dataset positions for reproducible resume
254
+
255
+ ## Troubleshooting
256
+
257
+ ### Out of Memory
258
+
259
+ - Reduce batch size or use streaming data loading
260
+ - Close other applications
261
+ - The 60GB model needs ~200GB RAM minimum for base weights
262
+
263
+ ### Infinite Thinking
264
+
265
+ If the model generates endless `<think>` content without responding:
266
+ - This may indicate over-ablation (try lower `--ablation-scale`)
267
+ - Or insufficient SFT (try more `--sft-steps`)
268
+
269
+ ### Poor Results
270
+
271
+ - Ensure safety prompts actually get refused by the base model
272
+ - Ensure safe prompts get helpful responses
273
+ - Try more expert tokens (--expert-tokens 500)
274
+ - Verify SFT data has proper `<think>` tags
275
+
276
+ ## License
277
+
278
+ MIT License - see LICENSE file.
279
+
280
+ ## Citation
281
+
282
+ ```bibtex
283
+ @misc{abliterate_moe2025,
284
+ author = {Caliane},
285
+ title = {Abliterate-MoE: Removing Refusal Behavior from Mixture-of-Experts Models},
286
+ year = {2025},
287
+ publisher = {HuggingFace},
288
+ url = {https://huggingface.co/Caliane/abliterate-moe}
289
+ }
290
+ ```
291
+
292
+ ## Acknowledgments
293
+
294
+ ### Research
295
+ - **Arditi et al.** for the foundational research on refusal directions in LLMs
296
+
297
+ ### Base Model
298
+ - **NVIDIA** for [Nemotron-3-Nano-30B-A3B](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) (Hybrid Mamba-2 + MoE + Attention)
299
+
300
+ ### SFT Training Datasets
301
+ - **[OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M)** - Chain-of-thought reasoning (open-thoughts)
302
+ - **[OpenHands SFT Trajectories](https://huggingface.co/datasets/SWE-Gym/OpenHands-SFT-Trajectories)** - Agentic coding (All-Hands-AI / SWE-Gym)
303
+ - **NVIDIA** - Science and chat examples
304
+
305
+ ### Framework
306
+ - Apple MLX team for the framework
307
+
308
+ ## References
309
+
310
+ ```bibtex
311
+ @inproceedings{arditi2024refusal,
312
+ title={Refusal in Language Models Is Mediated by a Single Direction},
313
+ author={Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel},
314
+ booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
315
+ year={2024},
316
+ url={https://arxiv.org/abs/2406.11717}
317
+ }
318
+ ```