GPT-OSS-120B - RotorQuant MLX 8-bit

8-bit weight-quantized MLX version of openai/gpt-oss-120b with RotorQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. RotorQuant delivers 5.3x faster prefill and 28% faster decode compared to TurboQuant. GPT-OSS-120B is OpenAI's flagship open-weights Mixture-of-Experts model (Apache 2.0), approaching o4-mini quality for reasoning tasks.

Approximate model size: ~120 GB

Model Specifications

Property Value
Base Model openai/gpt-oss-120b
Parameters 120 billion (MoE)
Architecture Mixture-of-Experts (MoE) Transformer
License Apache 2.0 (commercial use OK)
Weight Quantization 8-bit (~120 GB)
KV-Cache Quantization RotorQuant
Framework MLX (Apple Silicon)

Quickstart

from mlx_lm import load, generate
from rotorquant import IsoQuantCache

model, tokenizer = load("majentik/gpt-oss-120b-RotorQuant-MLX-8bit")

prompt = "Explain the theory of relativity."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

What is RotorQuant?

RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. Combined with 8-bit weight quantization in MLX, this provides a dual compression strategy with superior KV-cache performance.

Key advantages over TurboQuant:

  • 5.3x faster prefill
  • 28% faster decode
  • Equivalent memory savings

KV-Cache Quantization Comparison

Method Prefill Speed Decode Speed Memory Savings Reference
TurboQuant 1x (baseline) 1x (baseline) High arXiv: 2504.19874
RotorQuant 5.3x faster 28% faster High GitHub

Memory Estimates (GPT-OSS-120B)

Precision Approximate Size MLX Variant
BF16 (original) ~240 GB --
8-bit quantized ~120 GB This model
4-bit quantized ~65 GB RotorQuant-MLX-4bit
2-bit quantized ~30 GB RotorQuant-MLX-2bit

Hardware Requirements

This model requires approximately 120 GB of unified memory. Recommended hardware:

  • Apple M2 Ultra (192 GB)
  • Apple M3 Ultra (192 GB or 512 GB)
  • Mac Studio M4 Ultra (192 GB+)
  • Multi-device MLX inference for smaller Macs

See Also

Quant trade-off (MLX lane)

Bits Approx size Use case Recommendation
2-bit ~31 GB Aggressive quantization Very low-RAM Macs
3-bit ~43 GB Lossy but small Low-RAM Macs
4-bit ~50 GB Balanced default Recommended for most Macs
5-bit ~60 GB Higher fidelity Quality-sensitive
6-bit ~72 GB Approaching FP16 quality High-fidelity
8-bit ~91 GB Near-lossless reference Fidelity-critical work

(Current variant โ€” 8bit โ€” is bolded.)

Variants in this family

(Showing 14 sibling variants under majentik/gpt-oss-120b-*. The current variant โ€” RotorQuant-MLX-8bit โ€” is bolded.)

Variant Runtime Approx size Use case
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-GGUF-IQ4_XS llama.cpp ~103 GB Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-Q2_K llama.cpp ~72 GB Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M llama.cpp ~94 GB Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M llama.cpp ~132 GB Balanced default
RotorQuant-GGUF-Q5_K_M llama.cpp ~158 GB Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0 llama.cpp ~252 GB Near-lossless reference
RotorQuant-MLX-2bit mlx-lm ~38 GB Apple Silicon, smallest
RotorQuant-MLX-4bit mlx-lm ~74 GB Apple Silicon balanced
RotorQuant-MLX-8bit mlx-lm ~142 GB Apple Silicon reference
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit mlx-lm ~38 GB Apple Silicon, smallest
TurboQuant-MLX-4bit mlx-lm ~74 GB Apple Silicon balanced
TurboQuant-MLX-8bit mlx-lm ~142 GB Apple Silicon reference
Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for majentik/gpt-oss-120b-RotorQuant-MLX-8bit

Finetuned
(110)
this model

Paper for majentik/gpt-oss-120b-RotorQuant-MLX-8bit