GPT-OSS-120B - TurboQuant MLX 2-bit

2-bit weight-quantized MLX version of openai/gpt-oss-120b with TurboQuant KV-cache quantization. Optimized for Apple Silicon inference via the MLX framework. The smallest variant, enabling GPT-OSS-120B to fit on more accessible Mac hardware; expect some quality degradation vs higher-bit variants. GPT-OSS-120B is OpenAI's flagship open-weights Mixture-of-Experts model (Apache 2.0), approaching o4-mini quality for reasoning tasks.

Approximate model size: ~30 GB

Model Specifications

Property Value
Base Model openai/gpt-oss-120b
Parameters 120 billion (MoE)
Architecture Mixture-of-Experts (MoE) Transformer
License Apache 2.0 (commercial use OK)
Weight Quantization 2-bit (~30 GB)
KV-Cache Quantization TurboQuant
Framework MLX (Apple Silicon)

Quickstart

from mlx_lm import load, generate
from turboquant import TurboQuantCache

model, tokenizer = load("majentik/gpt-oss-120b-TurboQuant-MLX-2bit")

prompt = "Explain the theory of relativity."
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) compresses the KV cache used during autoregressive generation. Combined with aggressive 2-bit weight quantization in MLX, this produces the smallest possible footprint for GPT-OSS-120B, making the flagship open model accessible on more modest Apple Silicon hardware.

KV-Cache Quantization Comparison

Method Prefill Speed Decode Speed Memory Savings Reference
TurboQuant 1x (baseline) 1x (baseline) High arXiv: 2504.19874
RotorQuant 5.3x faster 28% faster High GitHub

Memory Estimates (GPT-OSS-120B)

Precision Approximate Size MLX Variant
BF16 (original) ~240 GB --
8-bit quantized ~120 GB TurboQuant-MLX-8bit
4-bit quantized ~65 GB TurboQuant-MLX-4bit
2-bit quantized ~30 GB This model

Hardware Requirements

This model requires approximately 30 GB of unified memory. Recommended hardware:

  • Apple M1 Max (32 GB+)
  • Apple M2 Max (32 GB+)
  • Apple M3 Max (36 GB+)
  • Apple M4 Max (36 GB+)
  • Any Apple Silicon Mac with 36 GB+ unified memory

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/gpt-oss-120b-TurboQuant-MLX-2bit

Finetuned
(98)
this model

Paper for majentik/gpt-oss-120b-TurboQuant-MLX-2bit