Qwen3.5-35B-A3B Compacted (167/256 experts)

MoE expert pruning: 67GB to 47GB BF16. Opus-distilled reasoning preserved.

89 of 256 experts removed based on runtime activation profiling across coding, reasoning, UI design, and conversational prompts. 80% routing coverage retained. Chain-of-thought reasoning intact.

Compaction Results

Metric Original Compacted Change
Size (BF16) 67 GB 47 GB -30%
Experts 256 167 -35%
Active per token 8 8 same
Routing coverage 100% 80% -20%
Code generation correct correct preserved
Chain-of-thought yes yes preserved
Conversational natural natural preserved

Benchmarks (Verified)

Tested via llama.cpp on two hardware platforms. Every result below is reproducible by following the Quick Start instructions above.

MacBook M1 Pro (32GB unified, Metal)

Prompt: Write a Python is_prime function with docstring.

Result: 339 tokens in 10.9 seconds (31.1 tok/s)

def is_prime(n: int) -> bool:
    """
    Determines whether a given integer is a prime number.

    Args:
        n (int): The number to check.

    Returns:
        bool: True if n is prime, False otherwise.
    """
    if n < 2:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(math.sqrt(n)) + 1, 2):
        if n % i == 0:
            return False
    return True
  • VRAM: 14.2 GB used / 25.3 GB available
  • Speed: 31.1 tok/s on Apple Metal
  • Quality: correct algorithm, proper docstring, edge cases handled, O(sqrt(n))

RTX 5090 (32GB VRAM, CUDA)

Prompt: Write a Python function that checks if a number is prime.

Result: Chain-of-thought reasoning with think tags. Identified O(sqrt(n)) approach, edge cases, docstring structure.

  • Speed: 174.4 tok/s generation, 145.7 tok/s prompt processing
  • Quality: Opus-distilled chain-of-thought reasoning fully preserved

Performance Summary

Hardware VRAM Used Speed Fits?
MacBook M1 Pro 32GB 14.2 GB 31 tok/s Yes (11GB headroom)
RTX 5090 32GB 14 GB 174 tok/s Yes (18GB headroom)
RTX 3090 24GB 14 GB ~100 tok/s (est) Yes (10GB headroom)
MacBook Air 16GB ~14 GB ~20 tok/s (est) Tight but possible

Compaction Method

  1. Runtime activation profiling on 5 domain-representative prompts (coding, UI design, reasoning, database architecture, conversation)
  2. 17,600 routing decisions captured across all 40 MoE layers via output_router_logits
  3. Expert ranking by total activation count
  4. Physical pruning: Bottom 89 experts removed by slicing dim 0 of fused expert tensors (gate_up_proj, down_proj) and router gate weights
  5. Streaming shard-by-shard processing (memory-safe for 32GB RAM)

Quick Start

llama.cpp (recommended — fastest)

Download the Q4 GGUF (14GB) and run:

# Download
huggingface-cli download continuum-ai/qwen3.5-35b-a3b-compacted qwen3.5-35b-a3b-compacted-Q4_K_M.gguf --local-dir .

# Run server
./llama-server -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 99

# Or interactive chat
./llama-cli -m qwen3.5-35b-a3b-compacted-Q4_K_M.gguf -c 4096 -ngl 99 --chat-template chatml -cnv

Available GGUF Quantizations

File Size Use Case
qwen3.5-35b-a3b-compacted-Q4_K_M.gguf 14GB MacBook Pro 16GB, RTX 3060 12GB
qwen3.5-35b-a3b-compacted-Q8_0.gguf 25GB RTX 3090 24GB, RTX 4090

Python (transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "continuum-ai/qwen3.5-35b-a3b-compacted",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    ignore_mismatched_sizes=True,  # Gate weights resized from 256 to 167 experts
)
tokenizer = AutoTokenizer.from_pretrained(
    "continuum-ai/qwen3.5-35b-a3b-compacted",
    trust_remote_code=True,
)

inputs = tokenizer("Write a function that sorts a list.", return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: ignore_mismatched_sizes=True is required because the MoE gate weights were resized from 256 to 167 experts during compaction. This is expected and safe — the model was verified to produce coherent output after pruning.

Hardware Requirements

Format VRAM Hardware Examples
BF16 (this release) 32GB+ RTX 5090, RTX PRO 6000, 2x RTX 3090
Q8_0 GGUF 25GB RTX 3090, RTX 4090
Q4_K_M GGUF 14GB MacBook Pro 16GB, RTX 3060 12GB

Part of continuum

continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.

Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery — runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.

Plasticity Compaction Paper | Get started

Architecture

  • Base: Qwen3.5 MoE with Vision (VL)
  • 40 layers (30 linear attention + 10 full attention)
  • Hidden size: 2048
  • MoE intermediate: 512 per expert
  • Shared expert: preserved (not pruned)
  • Vision encoder: preserved (27 layer ViT)
  • Max context: 262,144 tokens

License

Apache 2.0 (same as base model)

Downloads last month
2,739
Safetensors
Model size
25B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for continuum-ai/qwen3.5-35b-a3b-compacted

Quantized
(16)
this model
Quantizations
1 model