Add model card

ade072c verified 3 days ago

3.89 kB

	---
	base_model: deepseek-ai/DeepSeek-V3.2
	library_name: transformers
	tags:
	- rotorquant
	- kv-cache-quantization
	- deepseek
	- moe
	- quantized
	- text-generation
	- mixture-of-experts
	license: mit
	pipeline_tag: text-generation
	---

	# DeepSeek-V3.2-RotorQuant

	KV cache quantization for DeepSeek-V3.2 using RotorQuant compression.

	This repository provides RotorQuant-compressed KV cache configurations for [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2), one of the most capable open-weight large language models available. RotorQuant achieves 5.3x faster prefill and 28% faster decode while maintaining near-lossless quality.

	## Overview

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Base model \| [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) \|
	\| Architecture \| Mixture of Experts (MoE) \|
	\| Total parameters \| ~671B \|
	\| Compression \| RotorQuant KV cache \|
	\| Perplexity \| 6.91 (vs 7.07 baseline) \|
	\| Prefill speedup \| 5.3x \|
	\| Decode speedup \| 1.28x (28% faster) \|
	\| License \| MIT \|
	\| Task \| Text generation \|

	## Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from turboquant import IsoQuantCache

	model_id = "deepseek-ai/DeepSeek-V3.2"

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	trust_remote_code=True,
	device_map="auto",
	torch_dtype="auto",
	)

	# Apply RotorQuant KV cache compression
	cache = IsoQuantCache(
	model,
	residual_length=128,
	)

	inputs = tokenizer("Explain mixture of experts architectures.", return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	past_key_values=cache,
	max_new_tokens=512,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## What is RotorQuant?

	[RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache quantization method that applies rotation-based transformations to compress the key-value cache during autoregressive generation. It achieves substantial speedups in both prefill and decode stages while actually improving perplexity in some configurations.

	### Performance Comparison

	\| Metric \| Baseline (FP16 KV) \| RotorQuant \|
	\|--------\|-------------------\|------------\|
	\| Perplexity \| 7.07 \| 6.91 \|
	\| Prefill speed \| 1.0x \| 5.3x \|
	\| Decode speed \| 1.0x \| 1.28x \|
	\| KV cache memory \| 100% \| Substantially reduced \|

	### Why KV Cache Compression Matters for DeepSeek-V3.2

	DeepSeek-V3.2 is a massive 671B-parameter MoE model. At long context lengths, the KV cache can consume hundreds of gigabytes of memory. RotorQuant makes it feasible to serve this model with substantially reduced hardware requirements and faster throughput, especially for long-context workloads.

	## Memory Estimates

	\| Configuration \| Approximate Size \|
	\|---------------\|-----------------\|
	\| FP16 weights \| ~1.3 TB \|
	\| FP8 weights (base) \| ~671 GB \|
	\| KV cache (FP16, 128K context) \| Very large -- scales with sequence length \|
	\| KV cache (RotorQuant) \| Substantial reduction vs FP16 cache \|

	Note: RotorQuant compresses the KV cache only. Model weights remain in their original precision. For weight quantization, see the MLX variants below.

	## See Also

	- [deepseek-ai/DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) -- Base model
	- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) -- Source code and documentation
	- [majentik/DeepSeek-V3.2-TurboQuant](https://huggingface.co/majentik/DeepSeek-V3.2-TurboQuant) -- Alternative KV cache compression
	- [majentik/DeepSeek-V3.2-RotorQuant-MLX-2bit](https://huggingface.co/majentik/DeepSeek-V3.2-RotorQuant-MLX-2bit) -- MLX 2-bit weight quant + RotorQuant
	- [majentik/DeepSeek-V3.2-RotorQuant-MLX-1bit](https://huggingface.co/majentik/DeepSeek-V3.2-RotorQuant-MLX-1bit) -- MLX 1-bit weight quant + RotorQuant