Instructions to use majentik/Qwen3.5-27B-TurboQuant-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/Qwen3.5-27B-TurboQuant-2bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="majentik/Qwen3.5-27B-TurboQuant-2bit")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("majentik/Qwen3.5-27B-TurboQuant-2bit", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use majentik/Qwen3.5-27B-TurboQuant-2bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "majentik/Qwen3.5-27B-TurboQuant-2bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/Qwen3.5-27B-TurboQuant-2bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/majentik/Qwen3.5-27B-TurboQuant-2bit

SGLang

How to use majentik/Qwen3.5-27B-TurboQuant-2bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "majentik/Qwen3.5-27B-TurboQuant-2bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/Qwen3.5-27B-TurboQuant-2bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "majentik/Qwen3.5-27B-TurboQuant-2bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "majentik/Qwen3.5-27B-TurboQuant-2bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use majentik/Qwen3.5-27B-TurboQuant-2bit with Docker Model Runner:
```
docker model run hf.co/majentik/Qwen3.5-27B-TurboQuant-2bit
```

Qwen3.5-27B-TurboQuant-2bit

2-bit KV cache compression for Qwen/Qwen3.5-27B using TurboQuant.

This is a KV-cache-only repository. It contains no model weight files — only the configuration and model card for applying TurboQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.

Overview

Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

TurboQuant 2-bit compresses the KV cache by approximately 8x compared to FP16, dramatically reducing memory usage for long-context inference. At 2-bit precision this is an aggressive quantization — expect some quality degradation compared to 4-bit, but it enables inference in memory-constrained environments where 4-bit KV cache would not fit.

Specifications

Property	Value
Base model	Qwen/Qwen3.5-27B
Parameters	27B
Architecture	Hybrid Transformer
Native context	262,144 tokens
Thinking mode	Yes
KV cache method	TurboQuant 2-bit
KV cache compression	~8x vs FP16
Weights	Original (FP16/BF16, loaded separately)

Memory Estimates

Component	Estimate
Model weights (BF16)	~54 GB
KV cache at 128K context (2-bit)	~1.6 GB
KV cache at 128K context (FP16, baseline)	~12.8 GB

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model_id = "Qwen/Qwen3.5-27B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Apply 2-bit KV cache compression
cache = TurboQuantCache(bits=2)

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quality Notes

2-bit is aggressive quantization. It is best suited for memory-constrained scenarios (e.g., fitting long-context inference on a single GPU).
For higher quality with moderate compression, consider 4-bit KV cache variants.
Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.

References

Variants in this family

(Showing 16 sibling variants under majentik/qwen3.5-27b-*. The current variant — TurboQuant-2bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-2bit	transformers	n/a	Standalone 2-bit weights
RotorQuant-GGUF-IQ4_XS	llama.cpp	~23 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K	llama.cpp	~16 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~21 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~30 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~36 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~57 GB	Near-lossless reference
RotorQuant-MLX-2bit	mlx-lm	~8.6 GB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~17 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~32 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-2bit	transformers	n/a	Standalone 2-bit weights
TurboQuant-MLX-2bit	mlx-lm	~8.6 GB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~17 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~32 GB	Apple Silicon reference

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for majentik/Qwen3.5-27B-TurboQuant-2bit

Base model

Qwen/Qwen3.5-27B

Finetuned

(298)

this model

Paper for majentik/Qwen3.5-27B-TurboQuant-2bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34

majentik
/

Qwen3.5-27B-TurboQuant-2bit