Qwen3.5-35B-A3B Compacted (167/256 experts)

MoE expert pruning: 67GB to 47GB BF16. Opus-distilled reasoning preserved.

89 of 256 experts removed based on runtime activation profiling across coding, reasoning, UI design, and conversational prompts. 80% routing coverage retained. Chain-of-thought reasoning intact.

Compaction Results

Metric	Original	Compacted	Change
Size (BF16)	67 GB	47 GB	-30%
Experts	256	167	-35%
Active per token	8	8	same
Routing coverage	100%	80%	-20%
Code generation	correct	correct	preserved
Chain-of-thought	yes	yes	preserved
Conversational	natural	natural	preserved

Benchmarks (Verified)

Tested via llama.cpp on two hardware platforms. Every result below is reproducible by following the Quick Start instructions above.

MacBook M1 Pro (32GB unified, Metal)

Prompt: Write a Python is_prime function with docstring.

Result: 339 tokens in 10.9 seconds (31.1 tok/s)

def is_prime(n: int) -> bool:
    """
    Determines whether a given integer is a prime number.

    Args:
        n (int): The number to check.

    Returns:
        bool: True if n is prime, False otherwise.
    """
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(math.sqrt(n)) + 1, 2):
        if n % i == 0:
            return False
    return True

VRAM: 14.2 GB used / 25.3 GB available
Speed: 31.1 tok/s on Apple Metal
Quality: correct algorithm, proper docstring, edge cases handled, O(sqrt(n))

RTX 5090 (32GB VRAM, CUDA)

Prompt: Write a Python function that checks if a number is prime.

Result: Chain-of-thought reasoning with think tags. Identified O(sqrt(n)) approach, edge cases, docstring structure.

Speed: 174.4 tok/s generation, 145.7 tok/s prompt processing
Quality: Opus-distilled chain-of-thought reasoning fully preserved

Performance Summary

Hardware	VRAM Used	Speed	Fits?
MacBook M1 Pro 32GB	14.2 GB	31 tok/s	Yes (11GB headroom)
RTX 5090 32GB	14 GB	174 tok/s	Yes (18GB headroom)
RTX 3090 24GB	14 GB	~100 tok/s (est)	Yes (10GB headroom)
MacBook Air 16GB	~14 GB	~20 tok/s (est)	Tight but possible

Compaction Method

Runtime activation profiling on 5 domain-representative prompts (coding, UI design, reasoning, database architecture, conversation)
17,600 routing decisions captured across all 40 MoE layers via output_router_logits
Expert ranking by total activation count
Physical pruning: Bottom 89 experts removed by slicing dim 0 of fused expert tensors (gate_up_proj, down_proj) and router gate weights
Streaming shard-by-shard processing (memory-safe for 32GB RAM)

Quick Start

llama.cpp (recommended — fastest)

Download the Q4 GGUF (14GB) and run:

# Download
huggingface-cli download continuum-ai/qwen3.5-35b-a3b-compacted qwen3.5-35b-a3b-compacted-Q4_K_M.gguf --local-dir .

# Run server
./llama-server -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 99

# Or interactive chat
./llama-cli -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 99 --chat-template chatml -cnv

Available GGUF Quantizations

File	Size	Use Case
`qwen3.5-35b-a3b-compacted-Q4_K_M.gguf`	14GB	MacBook Pro 16GB, RTX 3060 12GB
`qwen3.5-35b-a3b-compacted-Q8_0.gguf`	25GB	RTX 3090 24GB, RTX 4090

Python (transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "continuum-ai/qwen3.5-35b-a3b-compacted",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    ignore_mismatched_sizes=True,  # Gate weights resized from 256 to 167 experts
)
tokenizer = AutoTokenizer.from_pretrained(
    "continuum-ai/qwen3.5-35b-a3b-compacted",
    trust_remote_code=True,
)

inputs = tokenizer("Write a function that sorts a list.", return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: ignore_mismatched_sizes=True is required because the MoE gate weights were resized from 256 to 167 experts during compaction. This is expected and safe — the model was verified to produce coherent output after pruning.

Hardware Requirements

Format	VRAM	Hardware Examples
BF16 (this release)	32GB+	RTX 5090, RTX PRO 6000, 2x RTX 3090
Q8_0 GGUF	25GB	RTX 3090, RTX 4090
Q4_K_M GGUF	14GB	MacBook Pro 16GB, RTX 3060 12GB

Part of continuum

continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.

Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery — runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.

Plasticity Compaction Paper | Get started