DeepSeek-V3.2-TurboQuant

KV cache quantization for DeepSeek-V3.2 using TurboQuant compression.

This repository provides TurboQuant-compressed KV cache configurations for deepseek-ai/DeepSeek-V3.2, one of the most capable open-weight large language models available. TurboQuant dramatically reduces KV cache memory during inference while preserving generation quality.

Overview

Attribute Value
Base model deepseek-ai/DeepSeek-V3.2
Architecture Mixture of Experts (MoE)
Total parameters ~671B
Compression TurboQuant KV cache (4-bit default, 2-bit aggressive)
License MIT
Task Text generation

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model_id = "deepseek-ai/DeepSeek-V3.2"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype="auto",
)

# Apply TurboQuant KV cache compression
cache = TurboQuantCache(
    model,
    bits=4,          # 4-bit default; use bits=2 for aggressive compression
    residual_length=128,
)

inputs = tokenizer("Explain mixture of experts architectures.", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=512,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV cache quantization method that compresses the key-value cache used during autoregressive generation. Rather than quantizing model weights, TurboQuant targets the runtime memory consumed by the KV cache, which grows linearly with sequence length and becomes the dominant memory cost at long contexts.

Compression Modes

Mode KV Cache Bits Quality Impact Use Case
Default 4-bit Minimal degradation General-purpose inference
Aggressive 2-bit Slight degradation Memory-constrained deployments

Why KV Cache Compression Matters for DeepSeek-V3.2

DeepSeek-V3.2 is a massive 671B-parameter MoE model. At long context lengths, the KV cache can consume hundreds of gigabytes of memory. TurboQuant makes it feasible to serve this model with substantially reduced hardware requirements, especially for long-context workloads.

Memory Estimates

Configuration Approximate Size
FP16 weights ~1.3 TB
FP8 weights (base) ~671 GB
KV cache (FP16, 128K context) Very large -- scales with sequence length
KV cache (TurboQuant 4-bit) ~4x reduction vs FP16 cache
KV cache (TurboQuant 2-bit) ~8x reduction vs FP16 cache

Note: TurboQuant compresses the KV cache only. Model weights remain in their original precision. For weight quantization, see the MLX variants below.

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/DeepSeek-V3.2-TurboQuant

Finetuned
(35)
this model

Paper for majentik/DeepSeek-V3.2-TurboQuant