Instructions to use olka-fi/MiniMax-M3-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use olka-fi/MiniMax-M3-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="olka-fi/MiniMax-M3-MXFP4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("olka-fi/MiniMax-M3-MXFP4", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("olka-fi/MiniMax-M3-MXFP4", trust_remote_code=True, device_map="auto")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use olka-fi/MiniMax-M3-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "olka-fi/MiniMax-M3-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/MiniMax-M3-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/olka-fi/MiniMax-M3-MXFP4

SGLang

How to use olka-fi/MiniMax-M3-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "olka-fi/MiniMax-M3-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/MiniMax-M3-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "olka-fi/MiniMax-M3-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "olka-fi/MiniMax-M3-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use olka-fi/MiniMax-M3-MXFP4 with Docker Model Runner:
```
docker model run hf.co/olka-fi/MiniMax-M3-MXFP4
```

MiniMax-M3 — MXFP4 (mixed precision)

A 4-bit MXFP4 quantization of MiniMax-M3, produced with qstream. The routed MoE experts (≈95% of the weights) are quantized to MXFP4; everything that is quality-sensitive is kept at higher precision.

4x RTX PRO 6000 launch recipe by 0xSero: https://github.com/0xSero/minimax-m3-sm120


Size	237 GB (down from 444 GB MXFP8 source, ~53%)
Format	compressed-tensors `mixed-precision` (E2M1 4-bit + E8M0 group-32 scales)
Base	MiniMax-M3 (256K-context vision-language sparse MoE, 128 experts top-4 + 1 shared, SwiGLU-OAI, lightning-indexer block-sparse attention)

What is quantized to what

Component	Precision	Why
Routed experts (`block_sparse_moe.experts.*`)	MXFP4 (4-bit)	95% of the weights — the only place worth the size win
Shared expert, attention, dense MLP	MXFP8 (8-bit, native passthrough)	runs on every token / sensitive — kept lossless from the source
Embeddings, lm_head, router gate, vision tower, projector, norms	BF16 / F32	unchanged

Quality (this checkpoint, served on vLLM)

Metric	Result
Perplexity (clean English)	5.32
GSM8K (full 1319-problem test set, chain-of-thought)	92.9% (1225/1319)

Quantization is faithful: a degraded checkpoint would show PPL in the hundreds. Eval scripts: scripts/eval_ppl.py, scripts/eval_gsm8k.py in the qstream repo.

Fidelity, footprint & provenance

Quantization error: routed-expert reconstruction SQNR ≈ 18.4 dB (MXFP4 vs the MXFP8 source) — i.e. only the unavoidable 4-bit rounding; the 2D-linear and 3D-MoE GEMM paths were verified bit-faithful at 55 dB / 48 dB.
Vision is untouched: the CLIP vision tower + projector stay BF16, so image capability equals the base model — only the text MoE is quantized.
Footprint: ~221 GiB of weights; fits a single ≥256 GB GPU (e.g. B300). Measured ~460 tok/s aggregate generation at 16 concurrent requests on one B300.
Provenance: built with qstream @cb795c3 from the MiniMax-M3 MXFP8 release; mixed-precision recipe (experts→MXFP4, rest→MXFP8).

Serving with vLLM

This checkpoint targets a MiniMax-M3-capable vLLM build. MXFP4-on-M3 is currently an experimental path in that fork, so two things are required:

The config in this repo (config.json) — its config_groups target vLLM's merged runtime modules (qkv_proj, gate_up_proj), which is necessary for the fused linears to load quantized.
The MoE clamp patch in vllm_patch/ — forwards the SwiGLU-OAI swiglu_limit/alpha/beta into the MXFP4 MoE quant config (without it the SWIGLUOAI_UNINTERLEAVE requires clamp_limit assertion fires). See vllm_patch/README.md.

docker run --gpus all --privileged --ipc=host -p 8000:8000 \
  -e VLLM_MXFP4_USE_MARLIN=1 \
  -v $(FOLDER-WITH-MiniMax-M3-MXFP4)/vllm_patch/compressed_tensors_moe_w4a4_mxfp4.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_mxfp4.py \
  vllm/vllm-openai:minimax-m3 olka-fi/MiniMax-M3-MXFP4 \
  --block-size 128 --tool-call-parser minimax_m3 --enable-auto-tool-choice \
  --reasoning-parser minimax_m3 --load-format fastsafetensors \
  --gpu-memory-utilization 0.97 --enforce-eager --max-model-len 200000 \
  --max-num-batched-tokens 2048 --linear-backend marlin

Fits on a single ~275 GB GPU (e.g. B300/SM100). On SM120 (DGX Spark) the same Marlin path applies, but also needs the MSA SM12x sparse-attention kernels, and the ~221 GiB of weights won't fit in 2×128 GB.

License

Inherits the MiniMax Community License from the base model (non-commercial). This is a derivative (quantized) work of MiniMax-M3.

Downloads last month: 23,015

Safetensors

Model size

234B params

Tensor type

F32

BF16

F8_E4M3

Model tree for olka-fi/MiniMax-M3-MXFP4

Base model

MiniMaxAI/MiniMax-M3

Quantized

(56)

this model

Finetunes

1 model