Instructions to use majentik/Qwen3.5-27B-RotorQuant-MLX-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/Qwen3.5-27B-RotorQuant-MLX-2bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("majentik/Qwen3.5-27B-RotorQuant-MLX-2bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use majentik/Qwen3.5-27B-RotorQuant-MLX-2bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use majentik/Qwen3.5-27B-RotorQuant-MLX-2bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default majentik/Qwen3.5-27B-RotorQuant-MLX-2bit

Run Hermes

hermes

OpenClaw new

How to use majentik/Qwen3.5-27B-RotorQuant-MLX-2bit with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MLX LM

How to use majentik/Qwen3.5-27B-RotorQuant-MLX-2bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.5-27B-RotorQuant-MLX-2bit

MLX 2-bit weight quantization + RotorQuant 2-bit KV cache compression for Qwen/Qwen3.5-27B.

Dual compression for Apple Silicon: both the model weights and the KV cache are quantized to 2-bit, enabling long-context inference on memory-constrained Macs.

Overview

Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

This variant applies two layers of compression:

MLX 2-bit weight quantization — reduces the 27B model from 54 GB (BF16) to approximately **8 GB**, making it loadable on Apple Silicon devices with limited unified memory.
RotorQuant 2-bit KV cache — rotation-based isotropic quantization compresses the key-value cache with better quality and speed than standard approaches.

RotorQuant Advantages

Metric	RotorQuant 2-bit	Standard 2-bit
Prefill speed	5.3x faster	Baseline
Decode speed	28% faster	Baseline
Perplexity	6.91	7.07

RotorQuant achieves lower perplexity (better quality) while also being faster — making it the preferred 2-bit KV cache method when quality matters.

Specifications

Property	Value
Base model	Qwen/Qwen3.5-27B
Parameters	27B
Architecture	Hybrid Transformer
Native context	262,144 tokens
Thinking mode	Yes
Weight quantization	MLX 2-bit
KV cache method	RotorQuant 2-bit (IsoQuant)
KV cache compression	~10x vs FP16
Runtime	MLX (Apple Silicon)

Memory Estimates

Component	Estimate
Model weights (MLX 2-bit)	~8 GB
KV cache at 128K context (2-bit RotorQuant)	~1.3 GB
Total at 128K context	~9.3 GB
Comparison: BF16 weights + FP16 KV at 128K	~66.8 GB

Quickstart

from mlx_lm import load, generate
from turboquant import IsoQuantCache

model_id = "majentik/Qwen3.5-27B-RotorQuant-MLX-2bit"

model, tokenizer = load(model_id)

# Apply 2-bit RotorQuant KV cache compression
cache = IsoQuantCache(bits=2)

prompt = "Explain the Riemann hypothesis in simple terms."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(
    model,
    tokenizer,
    prompt=text,
    max_tokens=2048,
    kv_cache=cache,
)
print(response)

Quality Notes

2-bit weights + 2-bit KV cache is the most aggressive quantization combination, but RotorQuant's rotation-based approach preserves more quality than standard methods (perplexity 6.91 vs 7.07).
For higher quality on Apple Silicon, consider 4-bit weight variants with 4-bit KV cache.
Thinking mode reasoning quality may be more sensitive to quantization since the model relies on both weight precision and cached reasoning tokens for its final answer.
Best suited for: prototyping, development, long-context exploration, and scenarios where running the model at all matters more than peak quality.

References

RotorQuant — Rotation-based isotropic KV cache quantization
MLX — Apple's machine learning framework
mlx-lm — LLM inference with MLX
Qwen3.5-27B base model

Quant trade-off (MLX lane)

Bits	Approx size	Use case	Recommendation
2-bit	~7.0 GB	Aggressive quantization	Very low-RAM Macs
3-bit	~9.7 GB	Lossy but small	Low-RAM Macs
4-bit	~11 GB	Balanced default	Recommended for most Macs
5-bit	~14 GB	Higher fidelity	Quality-sensitive
6-bit	~16 GB	Approaching FP16 quality	High-fidelity
8-bit	~21 GB	Near-lossless reference	Fidelity-critical work

(Current variant — 2bit — is bolded.)

Variants in this family

(Showing 16 sibling variants under majentik/qwen3.5-27b-*. The current variant — RotorQuant-MLX-2bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-2bit	transformers	n/a	Standalone 2-bit weights
RotorQuant-GGUF-IQ4_XS	llama.cpp	~23 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K	llama.cpp	~16 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~21 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~30 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~36 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~57 GB	Near-lossless reference
RotorQuant-MLX-2bit	mlx-lm	~8.6 GB	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	~17 GB	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	~32 GB	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-2bit	transformers	n/a	Standalone 2-bit weights
TurboQuant-MLX-2bit	mlx-lm	~8.6 GB	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	~17 GB	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	~32 GB	Apple Silicon reference

Downloads last month: 105

Safetensors

Model size

27B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

2-bit

Model tree for majentik/Qwen3.5-27B-RotorQuant-MLX-2bit

Base model

Qwen/Qwen3.5-27B

Quantized

(213)

this model

majentik
/

Qwen3.5-27B-RotorQuant-MLX-2bit

Qwen3.5-27B-RotorQuant-MLX-2bit

Overview

RotorQuant Advantages

Specifications

Memory Estimates

Quickstart

Quality Notes

References

See Also

Quant trade-off (MLX lane)

Variants in this family

Model tree for majentik/Qwen3.5-27B-RotorQuant-MLX-2bit