Instructions to use majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit

Run Hermes

hermes

MLX LM

How to use majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

majentik commited on Apr 13

Commit

4b447bf

verified ·

1 Parent(s): d3d7ddb

Add model card (weights pending mlx_lm mistral3 architecture support)

Browse files

Files changed (1) hide show

README.md +91 -0

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+---
+base_model: mistralai/Mistral-Small-4-119B-2603
+library_name: mlx
+license: apache-2.0
+tags:
+  - rotorquant
+  - kv-cache-quantization
+  - mistral
+  - moe
+  - sparse-moe
+  - multimodal
+  - quantized
+  - mlx
+  - 2-bit
+  - apple-silicon
+  - 256k-context
+  - thinking
+pipeline_tag: text-generation
+---
+# Mistral-Small-4-119B-RotorQuant-MLX-2bit
+**Dual compression: 2-bit MLX weight quantization + RotorQuant KV cache quantization** for Mistral Small 4 on Apple Silicon.
+This repository provides a 2-bit weight-quantized MLX conversion of [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) with RotorQuant KV cache quantization support. Aggressive compression for running on consumer Apple Silicon hardware.
+## Overview
+This model applies two complementary compression techniques:
+1. **2-bit weight quantization (MLX)** -- reduces model weights from ~238 GB to ~30 GB
+2. **RotorQuant KV cache quantization** -- reduces KV cache from ~32 GB to ~6.5 GB at 256K context
+This enables running a 119B-parameter MoE model on Apple Silicon Macs with 64 GB+ unified memory.
+## Model Specs
+| Property | Value |
+|---|---|
+| Base Model | Mistral Small 4 (March 2026) |
+| Total Parameters | 119B |
+| Active Parameters | 6.5B per token (Sparse MoE) |
+| Architecture | Sparse MoE -- 128 experts, 4 active per token |
+| Context Length | 256K tokens |
+| Modality | Text + Images (multimodal) |
+| Capabilities | Thinking / reasoning, tool use, multilingual |
+| License | Apache 2.0 |
+| Weight Quantization | 2-bit (MLX) |
+| KV Cache Quantization | RotorQuant 3-bit |
+## Memory Estimates
+| Configuration | Weights | KV Cache (256K) | Total |
+|---|---|---|---|
+| FP16 baseline | ~238 GB | ~32 GB | ~270 GB |
+| **This model (2-bit MLX + RotorQuant)** | **~30 GB** | **~6.5 GB** | **~36.5 GB** |
+> **Note:** This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count. The 2-bit quantization trades some quality for significantly reduced memory. Expect modest degradation on complex reasoning tasks compared to 4-bit.
+## Quickstart
+```python
+from mlx_lm import load, generate
+model, tokenizer = load("majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit")
+prompt = "Explain sparse mixture-of-experts architectures."
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+response = generate(model, tokenizer, prompt=text, max_tokens=512)
+print(response)
+```
+## What is RotorQuant?
+[RotorQuant](https://github.com/scrya-com/rotorquant) is a rotation-based KV cache quantization method that applies learned rotations before quantizing the key-value cache. Key results on the base model:
+- **5.3x faster prefill** compared to unquantized baseline
+- **28% faster decode** throughput
+- **Perplexity: 6.91** vs 7.07 for unquantized (lower is better)
+Because it targets the KV cache rather than weights, it stacks with weight quantization for compounding memory savings.
+## See Also
+- [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) -- Base model
+- [majentik/Mistral-Small-4-119B-RotorQuant](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant) -- KV cache only (no weight quantization)
+- [majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit) -- 4-bit MLX variant
+- [majentik/Mistral-Small-4-119B-RotorQuant-MLX-1bit](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant-MLX-1bit) -- 1-bit MLX variant
+- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)