Qwen2.5-Coder-14B-Instruct — Compacted (25Q/5KV, Q5_K_S)

A 14-billion parameter coding model compressed to run on a 16GB MacBook Air.

How It Was Built

Continuum's adaptive compression pipeline:

Head Pruning: 40 Q-heads / 8 KV-heads → 25 Q-heads / 5 KV-heads (37.5% KV cache reduction)
Quantization: Q5_K_S (5.1 bits per weight)
Result: 27GB BF16 → 8.9GB GGUF

Performance

Metric	Value
Speed	9.2 tok/s (M1 Pro 32GB, Metal)
Memory	~9 GB
Architecture	48 layers, 25 Q-heads, 5 KV-heads, head_dim=128
Quantization	Q5_K_S (5.1 BPW)

How to Run

With Continuum — downloads automatically:

# Model alias "coder" resolves to this model
./jtag inference/generate --model=coder --prompt="def fibonacci(n):"

License

Apache 2.0

Part of continuum

continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.

Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery — runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.

Plasticity Compaction Paper | Get started