Instructions to use majentik/Qwen3.5-27B-RotorQuant-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/Qwen3.5-27B-RotorQuant-2bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="majentik/Qwen3.5-27B-RotorQuant-2bit")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("majentik/Qwen3.5-27B-RotorQuant-2bit", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use majentik/Qwen3.5-27B-RotorQuant-2bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "majentik/Qwen3.5-27B-RotorQuant-2bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/Qwen3.5-27B-RotorQuant-2bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/majentik/Qwen3.5-27B-RotorQuant-2bit

SGLang

How to use majentik/Qwen3.5-27B-RotorQuant-2bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "majentik/Qwen3.5-27B-RotorQuant-2bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/Qwen3.5-27B-RotorQuant-2bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "majentik/Qwen3.5-27B-RotorQuant-2bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/Qwen3.5-27B-RotorQuant-2bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use majentik/Qwen3.5-27B-RotorQuant-2bit with Docker Model Runner:
```
docker model run hf.co/majentik/Qwen3.5-27B-RotorQuant-2bit
```

Qwen3.5-27B-RotorQuant-2bit

2-bit KV cache compression for Qwen/Qwen3.5-27B using RotorQuant.

This is a KV-cache-only repository. It contains no model weight files — only the configuration and model card for applying RotorQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.

Overview

Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

RotorQuant applies rotation-based isotropic quantization to the KV cache, achieving better quality and speed than standard quantization approaches at the same bit width.

RotorQuant Advantages

Metric	RotorQuant 2-bit	Standard 2-bit
Prefill speed	5.3x faster	Baseline
Decode speed	28% faster	Baseline
Perplexity	6.91	7.07

RotorQuant achieves lower perplexity (better quality) while also being faster — a rare combination at aggressive quantization levels.

Specifications

Property	Value
Base model	Qwen/Qwen3.5-27B
Parameters	27B
Architecture	Hybrid Transformer
Native context	262,144 tokens
Thinking mode	Yes
KV cache method	RotorQuant 2-bit (IsoQuant)
KV cache compression	~10x vs FP16
Weights	Original (FP16/BF16, loaded separately)

Memory Estimates

Component	Estimate
Model weights (BF16)	~54 GB
KV cache at 128K context (2-bit RotorQuant)	~1.3 GB
KV cache at 128K context (FP16, baseline)	~12.8 GB

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model_id = "Qwen/Qwen3.5-27B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Apply 2-bit RotorQuant KV cache compression
cache = IsoQuantCache(bits=2)

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quality Notes

2-bit is aggressive quantization, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
Best suited for memory-constrained scenarios where fitting long-context inference on limited hardware is essential.
For higher quality with moderate compression, consider 4-bit KV cache variants.
Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.

References

RotorQuant — Rotation-based isotropic KV cache quantization
Qwen3.5-27B base model

Variants in this family

(Showing 16 sibling variants under majentik/qwen3.5-27b-*. The current variant — RotorQuant-2bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-2bit	transformers	n/a	Standalone 2-bit weights
RotorQuant-GGUF-IQ4_XS	llama.cpp	~23 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K	llama.cpp	~16 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~21 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~30 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~36 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~57 GB	Near-lossless reference
RotorQuant-MLX-2bit	mlx-lm	~8.6 GB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~17 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~32 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-2bit	transformers	n/a	Standalone 2-bit weights
TurboQuant-MLX-2bit	mlx-lm	~8.6 GB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~17 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~32 GB	Apple Silicon reference

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for majentik/Qwen3.5-27B-RotorQuant-2bit

Base model

Qwen/Qwen3.5-27B

Finetuned

(280)

this model

majentik
/

Qwen3.5-27B-RotorQuant-2bit