Qwen3.5-35B-A3B Compacted (167/256 experts)
MoE expert pruning: 67GB to 47GB BF16. Opus-distilled reasoning preserved.
89 of 256 experts removed based on runtime activation profiling across coding, reasoning, UI design, and conversational prompts. 80% routing coverage retained. Chain-of-thought reasoning intact.
Compaction Results
| Metric | Original | Compacted | Change |
|---|---|---|---|
| Size (BF16) | 67 GB | 47 GB | -30% |
| Experts | 256 | 167 | -35% |
| Active per token | 8 | 8 | same |
| Routing coverage | 100% | 80% | -20% |
| Code generation | correct | correct | preserved |
| Chain-of-thought | yes | yes | preserved |
| Conversational | natural | natural | preserved |
Benchmarks (Verified)
Tested via llama.cpp on two hardware platforms. Every result below is reproducible by following the Quick Start instructions above.
MacBook M1 Pro (32GB unified, Metal)
Prompt: Write a Python is_prime function with docstring.
Result: 339 tokens in 10.9 seconds (31.1 tok/s)
def is_prime(n: int) -> bool:
"""
Determines whether a given integer is a prime number.
Args:
n (int): The number to check.
Returns:
bool: True if n is prime, False otherwise.
"""
if n < 2:
return False
if n == 2:
return True
if n % 2 == 0:
return False
for i in range(3, int(math.sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
- VRAM: 14.2 GB used / 25.3 GB available
- Speed: 31.1 tok/s on Apple Metal
- Quality: correct algorithm, proper docstring, edge cases handled, O(sqrt(n))
RTX 5090 (32GB VRAM, CUDA)
Prompt: Write a Python function that checks if a number is prime.
Result: Chain-of-thought reasoning with think tags. Identified O(sqrt(n)) approach, edge cases, docstring structure.
- Speed: 174.4 tok/s generation, 145.7 tok/s prompt processing
- Quality: Opus-distilled chain-of-thought reasoning fully preserved
Performance Summary
| Hardware | VRAM Used | Speed | Fits? |
|---|---|---|---|
| MacBook M1 Pro 32GB | 14.2 GB | 31 tok/s | Yes (11GB headroom) |
| RTX 5090 32GB | 14 GB | 174 tok/s | Yes (18GB headroom) |
| RTX 3090 24GB | 14 GB | ~100 tok/s (est) | Yes (10GB headroom) |
| MacBook Air 16GB | ~14 GB | ~20 tok/s (est) | Tight but possible |
Compaction Method
- Runtime activation profiling on 5 domain-representative prompts (coding, UI design, reasoning, database architecture, conversation)
- 17,600 routing decisions captured across all 40 MoE layers via output_router_logits
- Expert ranking by total activation count
- Physical pruning: Bottom 89 experts removed by slicing dim 0 of fused expert tensors (gate_up_proj, down_proj) and router gate weights
- Streaming shard-by-shard processing (memory-safe for 32GB RAM)
Quick Start
llama.cpp (recommended — fastest)
Download the Q4 GGUF (14GB) and run:
# Download
huggingface-cli download continuum-ai/qwen3.5-35b-a3b-compacted qwen3.5-35b-a3b-compacted-Q4_K_M.gguf --local-dir .
# Run server
./llama-server -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 99
# Or interactive chat
./llama-cli -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 99 --chat-template chatml -cnv
Available GGUF Quantizations
| File | Size | Use Case |
|---|---|---|
qwen3.5-35b-a3b-compacted-Q4_K_M.gguf |
14GB | MacBook Pro 16GB, RTX 3060 12GB |
qwen3.5-35b-a3b-compacted-Q8_0.gguf |
25GB | RTX 3090 24GB, RTX 4090 |
Python (transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"continuum-ai/qwen3.5-35b-a3b-compacted",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
ignore_mismatched_sizes=True, # Gate weights resized from 256 to 167 experts
)
tokenizer = AutoTokenizer.from_pretrained(
"continuum-ai/qwen3.5-35b-a3b-compacted",
trust_remote_code=True,
)
inputs = tokenizer("Write a function that sorts a list.", return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note: ignore_mismatched_sizes=True is required because the MoE gate weights were resized from 256 to 167 experts during compaction. This is expected and safe — the model was verified to produce coherent output after pruning.
Hardware Requirements
| Format | VRAM | Hardware Examples |
|---|---|---|
| BF16 (this release) | 32GB+ | RTX 5090, RTX PRO 6000, 2x RTX 3090 |
| Q8_0 GGUF | 25GB | RTX 3090, RTX 4090 |
| Q4_K_M GGUF | 14GB | MacBook Pro 16GB, RTX 3060 12GB |
Part of continuum
continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.
Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery — runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.
Plasticity Compaction Paper | Get started
Architecture
- Base: Qwen3.5 MoE with Vision (VL)
- 40 layers (30 linear attention + 10 full attention)
- Hidden size: 2048
- MoE intermediate: 512 per expert
- Shared expert: preserved (not pruned)
- Vision encoder: preserved (27 layer ViT)
- Max context: 262,144 tokens
License
Apache 2.0 (same as base model)
- Downloads last month
- 2,739