Qwen2.5-Coder-32B-Instruct β€” Compacted (25Q/5KV, Q3_K_S)

A 32-billion parameter coding model compressed to run on a 32GB MacBook.

The Story

This model was built by continuum, an open-source local AI runtime with continuous learning and adaptive model compression.

The compression technique β€” utilization-aware head pruning β€” was conceived as part of the sentinel-ai research project, which explores autonomous AI systems that can learn, adapt, and optimize themselves. Continuum is the runtime that makes those ideas real: it fine-tunes models on your data, measures which attention heads matter for your task, prunes the ones that don't, and quantizes the rest at variable precision to fit your hardware.

This model is proof that it works. A 32B coding model that normally requires 80GB+ VRAM, running locally on consumer hardware, generating correct code.

How It Was Built

Continuum's adaptive compression pipeline:

  1. Score: LoRA fine-tuning on coding tasks with gradient capture. Each attention head's contribution is measured by gradient flow through its gate projection.
  2. Prune: 3 of 8 KV groups removed (37.5% KV cache reduction). 40 Q-heads / 8 KV-heads β†’ 25 Q-heads / 5 KV-heads. Whole GQA groups pruned to maintain architectural integrity.
  3. Quantize: Q3_K_S (3.5 bits per weight). 62 GB BF16 β†’ 13 GB GGUF.
  4. Verify: Inference tested on M1 Pro 32GB MacBook β€” generates correct Python functions.

Future models from this pipeline will use mixed quantization: high-utilization layers get more precision (Q5_K/Q6_K), low-utilization layers get less (Q3_K). Same file size, better quality.

Performance

Metric Value
Speed 5.3 tok/s (M1 Pro 32GB, Metal)
Memory 13.4 GB steady, 19.9 GB peak during load
Architecture 64 layers, 25 Q-heads, 5 KV-heads, head_dim=128
Quantization Q3_K_S (3.55 BPW)
Context 32,768 tokens

Sample Output

Prompt: def is_prime(n):

def is_prime(n):
    """Check if a number is prime."""
    if n <= 1:
        return False
    for i in range(2, int(n**0.5)+1):
        if n % i == 0:
            return False
    return True

Prompt: def binary_search(arr, target):

def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1

How to Run

With Continuum (recommended)

Continuum natively supports compacted GGUF models with non-standard attention dimensions.

The model downloads automatically when needed β€” no HuggingFace account required.

With Other Engines

This model uses non-standard attention dimensions from head pruning (e.g., Q projection is [5120, 3200] instead of [5120, 5120]). Engines that validate tensor shapes against architecture defaults (llama.cpp, ollama) will reject it. Engines that load tensors by shape rather than metadata will work.

Make Your Own

The compression pipeline that built this model is open source. With continuum, you can compress any GQA model for any target device:

# Score: fine-tune + capture head utilization
./jtag genome/train --provider=runpod --captureGradients=true

# Compress: prune + quantize for your hardware
./jtag plasticity/compress \
  --capturePath=./capture \
  --modelPath=Qwen/Qwen2.5-Coder-32B-Instruct \
  --deviceSpec=32gb

Target any device: 16gb (MacBook Air), 32gb (MacBook Pro), 24gb-vram (RTX 5090).

Links

  • continuum β€” Local AI runtime. Continuous learning, model compression, autonomous personas.
  • sentinel-ai β€” Research project exploring autonomous AI optimization. The ideas behind adaptive compression originated here.
  • continuum-ai on HuggingFace β€” More compressed models coming soon.

License

Apache 2.0 (same as base model).

Part of continuum

continuum is an open-source AI ecosystem where personas live, work, learn, and evolve on your hardware. Zero API keys required. AGPL-3.0.

Built on the research foundations of Synthetic Citizens: AI Personas as Persistent, Evolving Entities and Plasticity Compaction: SOTA-to-COTS via MoE Expert Pruning. Our core contribution is utilization-aware model surgery β€” runtime profiling determines exactly which components are active for a target domain, how much each contributes, and what precision each requires. MoE experts, attention heads, and weight precision are all targeted independently based on measured activation patterns, not uniform heuristics. The result: SOTA models surgically reduced to fit consumer hardware with reasoning quality preserved.

Plasticity Compaction Paper | Get started

Downloads last month
1,956
GGUF
Model size
31B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for continuum-ai/qwen2.5-coder-32b-compacted

Base model

Qwen/Qwen2.5-32B
Quantized
(116)
this model