Qwen2.5-Coder-14B-Instruct β Compacted (25Q/5KV, Q5_K_S)
A 14-billion parameter coding model compressed to run on a 16GB MacBook Air.
How It Was Built
Continuum's adaptive compression pipeline:
- Head Pruning: 40 Q-heads / 8 KV-heads β 25 Q-heads / 5 KV-heads (37.5% KV cache reduction)
- Quantization: Q5_K_S (5.1 bits per weight)
- Result: 27GB BF16 β 8.9GB GGUF
Performance
| Metric | Value |
|---|---|
| Speed | 9.2 tok/s (M1 Pro 32GB, Metal) |
| Memory | ~9 GB |
| Architecture | 48 layers, 25 Q-heads, 5 KV-heads, head_dim=128 |
| Quantization | Q5_K_S (5.1 BPW) |
How to Run
With Continuum β downloads automatically:
# Model alias "coder" resolves to this model
./jtag inference/generate --model=coder --prompt="def fibonacci(n):"
Links
- Continuum β Local AI runtime
- sentinel-ai β Research project
- continuum-ai β More models
License
Apache 2.0
Part of continuum
continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.
Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery β runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.
- Downloads last month
- 2,065
We're not able to determine the quantization variants.
Model tree for continuum-ai/qwen2.5-coder-14b-compacted
Base model
Qwen/Qwen2.5-14B