Qwen2.5-Coder-14B-Instruct β€” Compacted (25Q/5KV, Q5_K_S)

A 14-billion parameter coding model compressed to run on a 16GB MacBook Air.

How It Was Built

Continuum's adaptive compression pipeline:

  1. Head Pruning: 40 Q-heads / 8 KV-heads β†’ 25 Q-heads / 5 KV-heads (37.5% KV cache reduction)
  2. Quantization: Q5_K_S (5.1 bits per weight)
  3. Result: 27GB BF16 β†’ 8.9GB GGUF

Performance

Metric Value
Speed 9.2 tok/s (M1 Pro 32GB, Metal)
Memory ~9 GB
Architecture 48 layers, 25 Q-heads, 5 KV-heads, head_dim=128
Quantization Q5_K_S (5.1 BPW)

How to Run

With Continuum β€” downloads automatically:

# Model alias "coder" resolves to this model
./jtag inference/generate --model=coder --prompt="def fibonacci(n):"

Links

License

Apache 2.0

Part of continuum

continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.

Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery β€” runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.

Plasticity Compaction Paper | Get started

Downloads last month
2,065
GGUF
Model size
14B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for continuum-ai/qwen2.5-coder-14b-compacted

Base model

Qwen/Qwen2.5-14B
Quantized
(86)
this model