Qwen2.5-Coder-32B-Instruct β Compacted (25Q/5KV, Q3_K_S)
A 32-billion parameter coding model compressed to run on a 32GB MacBook.
The Story
This model was built by continuum, an open-source local AI runtime with continuous learning and adaptive model compression.
The compression technique β utilization-aware head pruning β was conceived as part of the sentinel-ai research project, which explores autonomous AI systems that can learn, adapt, and optimize themselves. Continuum is the runtime that makes those ideas real: it fine-tunes models on your data, measures which attention heads matter for your task, prunes the ones that don't, and quantizes the rest at variable precision to fit your hardware.
This model is proof that it works. A 32B coding model that normally requires 80GB+ VRAM, running locally on consumer hardware, generating correct code.
How It Was Built
Continuum's adaptive compression pipeline:
- Score: LoRA fine-tuning on coding tasks with gradient capture. Each attention head's contribution is measured by gradient flow through its gate projection.
- Prune: 3 of 8 KV groups removed (37.5% KV cache reduction). 40 Q-heads / 8 KV-heads β 25 Q-heads / 5 KV-heads. Whole GQA groups pruned to maintain architectural integrity.
- Quantize: Q3_K_S (3.5 bits per weight). 62 GB BF16 β 13 GB GGUF.
- Verify: Inference tested on M1 Pro 32GB MacBook β generates correct Python functions.
Future models from this pipeline will use mixed quantization: high-utilization layers get more precision (Q5_K/Q6_K), low-utilization layers get less (Q3_K). Same file size, better quality.
Performance
| Metric | Value |
|---|---|
| Speed | 5.3 tok/s (M1 Pro 32GB, Metal) |
| Memory | 13.4 GB steady, 19.9 GB peak during load |
| Architecture | 64 layers, 25 Q-heads, 5 KV-heads, head_dim=128 |
| Quantization | Q3_K_S (3.55 BPW) |
| Context | 32,768 tokens |
Sample Output
Prompt: def is_prime(n):
def is_prime(n):
"""Check if a number is prime."""
if n <= 1:
return False
for i in range(2, int(n**0.5)+1):
if n % i == 0:
return False
return True
Prompt: def binary_search(arr, target):
def binary_search(arr, target):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1
How to Run
With Continuum (recommended)
Continuum natively supports compacted GGUF models with non-standard attention dimensions.
The model downloads automatically when needed β no HuggingFace account required.
With Other Engines
This model uses non-standard attention dimensions from head pruning (e.g., Q projection is [5120, 3200] instead of [5120, 5120]). Engines that validate tensor shapes against architecture defaults (llama.cpp, ollama) will reject it. Engines that load tensors by shape rather than metadata will work.
Make Your Own
The compression pipeline that built this model is open source. With continuum, you can compress any GQA model for any target device:
# Score: fine-tune + capture head utilization
./jtag genome/train --provider=runpod --captureGradients=true
# Compress: prune + quantize for your hardware
./jtag plasticity/compress \
--capturePath=./capture \
--modelPath=Qwen/Qwen2.5-Coder-32B-Instruct \
--deviceSpec=32gb
Target any device: 16gb (MacBook Air), 32gb (MacBook Pro), 24gb-vram (RTX 5090).
Links
- continuum β Local AI runtime. Continuous learning, model compression, autonomous personas.
- sentinel-ai β Research project exploring autonomous AI optimization. The ideas behind adaptive compression originated here.
- continuum-ai on HuggingFace β More compressed models coming soon.
License
Apache 2.0 (same as base model).
Part of continuum
continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.
Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery β runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.
- Downloads last month
- 1,956
We're not able to determine the quantization variants.
Model tree for continuum-ai/qwen2.5-coder-32b-compacted
Base model
Qwen/Qwen2.5-32B