Add model card

ade072c verified 3 days ago

3.89 kB

base_model: deepseek-ai/DeepSeek-V3.2
library_name: transformers
tags:
  - rotorquant
  - kv-cache-quantization
  - deepseek
  - moe
  - quantized
  - text-generation
  - mixture-of-experts
license: mit
pipeline_tag: text-generation

DeepSeek-V3.2-RotorQuant

KV cache quantization for DeepSeek-V3.2 using RotorQuant compression.

This repository provides RotorQuant-compressed KV cache configurations for deepseek-ai/DeepSeek-V3.2, one of the most capable open-weight large language models available. RotorQuant achieves 5.3x faster prefill and 28% faster decode while maintaining near-lossless quality.

Overview

Attribute	Value
Base model	deepseek-ai/DeepSeek-V3.2
Architecture	Mixture of Experts (MoE)
Total parameters	~671B
Compression	RotorQuant KV cache
Perplexity	6.91 (vs 7.07 baseline)
Prefill speedup	5.3x
Decode speedup	1.28x (28% faster)
License	MIT
Task	Text generation

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model_id = "deepseek-ai/DeepSeek-V3.2"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype="auto",
)

# Apply RotorQuant KV cache compression
cache = IsoQuantCache(
    model,
    residual_length=128,
)

inputs = tokenizer("Explain mixture of experts architectures.", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    past_key_values=cache,
    max_new_tokens=512,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is RotorQuant?

RotorQuant is a KV cache quantization method that applies rotation-based transformations to compress the key-value cache during autoregressive generation. It achieves substantial speedups in both prefill and decode stages while actually improving perplexity in some configurations.

Performance Comparison

Metric	Baseline (FP16 KV)	RotorQuant
Perplexity	7.07	6.91
Prefill speed	1.0x	5.3x
Decode speed	1.0x	1.28x
KV cache memory	100%	Substantially reduced

Why KV Cache Compression Matters for DeepSeek-V3.2

DeepSeek-V3.2 is a massive 671B-parameter MoE model. At long context lengths, the KV cache can consume hundreds of gigabytes of memory. RotorQuant makes it feasible to serve this model with substantially reduced hardware requirements and faster throughput, especially for long-context workloads.

Memory Estimates

Configuration	Approximate Size
FP16 weights	~1.3 TB
FP8 weights (base)	~671 GB
KV cache (FP16, 128K context)	Very large -- scales with sequence length
KV cache (RotorQuant)	Substantial reduction vs FP16 cache

Note: RotorQuant compresses the KV cache only. Model weights remain in their original precision. For weight quantization, see the MLX variants below.

majentik
/

DeepSeek-V3.2-RotorQuant