majentik's picture
Add model card
ade072c verified
metadata
base_model: deepseek-ai/DeepSeek-V3.2
library_name: transformers
tags:
  - rotorquant
  - kv-cache-quantization
  - deepseek
  - moe
  - quantized
  - text-generation
  - mixture-of-experts
license: mit
pipeline_tag: text-generation

DeepSeek-V3.2-RotorQuant

KV cache quantization for DeepSeek-V3.2 using RotorQuant compression.

This repository provides RotorQuant-compressed KV cache configurations for deepseek-ai/DeepSeek-V3.2, one of the most capable open-weight large language models available. RotorQuant achieves 5.3x faster prefill and 28% faster decode while maintaining near-lossless quality.

Overview

Attribute Value
Base model deepseek-ai/DeepSeek-V3.2
Architecture Mixture of Experts (MoE)
Total parameters ~671B
Compression RotorQuant KV cache
Perplexity 6.91 (vs 7.07 baseline)
Prefill speedup 5.3x
Decode speedup 1.28x (28% faster)
License MIT
Task Text generation

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model_id = "deepseek-ai/DeepSeek-V3.2"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype="auto",
)

# Apply RotorQuant KV cache compression
cache = IsoQuantCache(
    model,
    residual_length=128,
)

inputs = tokenizer("Explain mixture of experts architectures.", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=512,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is RotorQuant?

RotorQuant is a KV cache quantization method that applies rotation-based transformations to compress the key-value cache during autoregressive generation. It achieves substantial speedups in both prefill and decode stages while actually improving perplexity in some configurations.

Performance Comparison

Metric Baseline (FP16 KV) RotorQuant
Perplexity 7.07 6.91
Prefill speed 1.0x 5.3x
Decode speed 1.0x 1.28x
KV cache memory 100% Substantially reduced

Why KV Cache Compression Matters for DeepSeek-V3.2

DeepSeek-V3.2 is a massive 671B-parameter MoE model. At long context lengths, the KV cache can consume hundreds of gigabytes of memory. RotorQuant makes it feasible to serve this model with substantially reduced hardware requirements and faster throughput, especially for long-context workloads.

Memory Estimates

Configuration Approximate Size
FP16 weights ~1.3 TB
FP8 weights (base) ~671 GB
KV cache (FP16, 128K context) Very large -- scales with sequence length
KV cache (RotorQuant) Substantial reduction vs FP16 cache

Note: RotorQuant compresses the KV cache only. Model weights remain in their original precision. For weight quantization, see the MLX variants below.

See Also