Step 3.5 Flash REAP-149B — CRACK Abliterated (4-bit MLX)
Step 3.5 Flash 149B (REAP-pruned) with refusal behavior removed via CRACK surgery.
What Is This?
Step 3.5 Flash by StepFun, pruned to 149B via Cerebras REAP (25% expert reduction), with CRACK abliteration — safety guardrails permanently removed at the weight level.
This is the larger REAP variant with 216 experts (vs 121B's 173), retaining more of the original model's capacity. Best balance of quality and size — fits M4 Max 128GB and M3 Ultra 256GB.
| Architecture | Step 3.5 Flash MoE — 149B total, 216 experts (REAP from 288), 8 active |
| Active Parameters | ~11B per token |
| Quantization | 4-bit (group_size=64, router gates at 8-bit) |
| Disk Size | 78 GB |
| Speed | 48 tok/s on M4 Max 128GB |
| Abliteration | Permanent weight surgery via CRACK |
| RAM Required | 128 GB unified memory |
| Context | 262,144 tokens |
Note: This model requires
trust_remote_code=Truedue to the customstep3p5model architecture.
Test Results
Tested with greedy decoding (temp=0) across 16 harmful + 16 harmless prompts from the HarmBench dataset.
| Category | Result |
|---|---|
| Compliance (16 harmful prompts) | ✅ 15/16 |
| Coherence (16 harmless prompts) | ✅ 16/16 |
| Chain-of-thought | ✅ <think> reasoning preserved |
| Code generation | ✅ Working implementations |
| Knowledge | ✅ Accurate factual responses |
Features
- Full chain-of-thought:
<think>tags for step-by-step reasoning (can be toggled) - Dual attention: Full attention + sliding window (512) for efficient long-context
- Sigmoid MoE routing: Smooth expert selection with learned bias
- SwiGLU activation clamping: Prevents output explosion in deep layers
- More experts: 216 experts (vs 121B's 173) — retains more of the original model's knowledge
- REAP pruning: 25% expert reduction (288→216) with minimal quality loss
Usage
With mlx-lm
import os
os.environ["TRUST_REMOTE_CODE"] = "1"
from mlx_lm import load, generate
model, tokenizer = load("dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)
With vMLX
Download and load directly in vMLX — no code needed.
What CRACK does
CRACK (Controlled Refusal Ablation via Calibrated Knockout) is a weight-level intervention. The modification is permanently baked into the published weights — no fine-tuning, no LoRA, no system prompts, no runtime hooks.
On this model the result is broad refusal removal with reasoning, code, and instruction-following preserved (see benchmarks above).
Model Family
149B Variants (this model)
| Variant | Bits | Size | RAM | Status |
|---|---|---|---|---|
| 149B Q4 | 4-bit | 78 GB | 128 GB | ✅ |
| 149B Q6 | 6-bit | 113 GB | 256 GB | ✅ |
| 149B Q8 | 8-bit | 148 GB | 256 GB | ✅ |
121B Variants (lighter, faster)
| Variant | Bits | Size | RAM | Status |
|---|---|---|---|---|
| 121B Q4 | 4-bit | 63 GB | 128 GB | ✅ |
| 121B Q6 | 6-bit | 92 GB | 256 GB | ✅ |
| 121B Q8 | 8-bit | 120 GB | 256 GB | ✅ |
Credits
- StepFun — Step 3.5 Flash base model
- Cerebras — REAP expert pruning
- dealign.ai — CRACK abliteration surgery
Disclaimer: This model has safety guardrails removed. It will comply with requests that the original model would refuse. Users are responsible for how they use this model. Released for research purposes.
- Downloads last month
- 236
4-bit
Model tree for dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK
Base model
stepfun-ai/Step-3.5-Flash