Add model card

d39850a verified 3 days ago

3.9 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: Qwen/Qwen3.5-27B
	tags:
	- rotorquant
	- 2bit
	- kv-cache
	- quantized
	- qwen3.5
	- thinking
	language:
	- en
	pipeline_tag: text-generation
	---

	# Qwen3.5-27B-RotorQuant-2bit

	2-bit KV cache compression for [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) using [RotorQuant](https://github.com/scrya-com/rotorquant).

	> This is a KV-cache-only repository. It contains no model weight files — only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.

	## Overview

	Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

	RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width.

	### RotorQuant Advantages

	\| Metric \| RotorQuant 2-bit \| Standard 2-bit \|
	\|---\|---\|---\|
	\| Prefill speed \| 5.3x faster \| Baseline \|
	\| Decode speed \| 28% faster \| Baseline \|
	\| Perplexity \| 6.91 \| 7.07 \|

	RotorQuant achieves lower perplexity (better quality) while also being faster — a rare combination at aggressive quantization levels.

	## Specifications

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| Qwen/Qwen3.5-27B \|
	\| Parameters \| 27B \|
	\| Architecture \| Hybrid Transformer \|
	\| Native context \| 262,144 tokens \|
	\| Thinking mode \| Yes \|
	\| KV cache method \| RotorQuant 2-bit (IsoQuant) \|
	\| KV cache compression \| ~10x vs FP16 \|
	\| Weights \| Original (FP16/BF16, loaded separately) \|

	## Memory Estimates

	\| Component \| Estimate \|
	\|---\|---\|
	\| Model weights (BF16) \| ~54 GB \|
	\| KV cache at 128K context (2-bit RotorQuant) \| ~1.3 GB \|
	\| KV cache at 128K context (FP16, baseline) \| ~12.8 GB \|

	## Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from turboquant import IsoQuantCache

	model_id = "Qwen/Qwen3.5-27B"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

	# Apply 2-bit RotorQuant KV cache compression
	cache = IsoQuantCache(bits=2)

	messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

	outputs = model.generate(
	inputs,
	max_new_tokens=2048,
	past_key_values=cache,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Quality Notes

	- 2-bit is aggressive quantization, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
	- Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential.
	- For higher quality with moderate compression, consider 4-bit KV cache variants.
	- Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.

	## References

	- [RotorQuant](https://github.com/scrya-com/rotorquant) — Rotation-based isotropic KV cache quantization
	- [Qwen3.5-27B base model](https://huggingface.co/Qwen/Qwen3.5-27B)

	## See Also

	- [majentik/Qwen3.5-27B-TurboQuant-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-2bit) — TurboQuant 2-bit KV cache variant
	- [majentik/Qwen3.5-27B-TurboQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-TurboQuant-MLX-2bit) — MLX 2-bit weights + TurboQuant KV cache
	- [majentik/Qwen3.5-27B-RotorQuant-MLX-2bit](https://huggingface.co/majentik/Qwen3.5-27B-RotorQuant-MLX-2bit) — MLX 2-bit weights + RotorQuant KV cache