Instructions to use olka-fi/MiniMax-M3-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use olka-fi/MiniMax-M3-MXFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="olka-fi/MiniMax-M3-MXFP4", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("olka-fi/MiniMax-M3-MXFP4", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("olka-fi/MiniMax-M3-MXFP4", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use olka-fi/MiniMax-M3-MXFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "olka-fi/MiniMax-M3-MXFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olka-fi/MiniMax-M3-MXFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/olka-fi/MiniMax-M3-MXFP4
- SGLang
How to use olka-fi/MiniMax-M3-MXFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "olka-fi/MiniMax-M3-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olka-fi/MiniMax-M3-MXFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "olka-fi/MiniMax-M3-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olka-fi/MiniMax-M3-MXFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use olka-fi/MiniMax-M3-MXFP4 with Docker Model Runner:
docker model run hf.co/olka-fi/MiniMax-M3-MXFP4
MiniMax-M3 — MXFP4 (mixed precision)
A 4-bit MXFP4 quantization of MiniMax-M3, produced with qstream. The routed MoE experts (≈95% of the weights) are quantized to MXFP4; everything that is quality-sensitive is kept at higher precision.
4x RTX PRO 6000 launch recipe by 0xSero: https://github.com/0xSero/minimax-m3-sm120
| Size | 237 GB (down from 444 GB MXFP8 source, ~53%) |
| Format | compressed-tensors mixed-precision (E2M1 4-bit + E8M0 group-32 scales) |
| Base | MiniMax-M3 (256K-context vision-language sparse MoE, 128 experts top-4 + 1 shared, SwiGLU-OAI, lightning-indexer block-sparse attention) |
What is quantized to what
| Component | Precision | Why |
|---|---|---|
Routed experts (block_sparse_moe.experts.*) |
MXFP4 (4-bit) | 95% of the weights — the only place worth the size win |
| Shared expert, attention, dense MLP | MXFP8 (8-bit, native passthrough) | runs on every token / sensitive — kept lossless from the source |
| Embeddings, lm_head, router gate, vision tower, projector, norms | BF16 / F32 | unchanged |
Quality (this checkpoint, served on vLLM)
| Metric | Result |
|---|---|
| Perplexity (clean English) | 5.32 |
| GSM8K (full 1319-problem test set, chain-of-thought) | 92.9% (1225/1319) |
Quantization is faithful: a degraded checkpoint would show PPL in the hundreds. Eval
scripts: scripts/eval_ppl.py, scripts/eval_gsm8k.py in the qstream repo.
Fidelity, footprint & provenance
- Quantization error: routed-expert reconstruction SQNR ≈ 18.4 dB (MXFP4 vs the MXFP8 source) — i.e. only the unavoidable 4-bit rounding; the 2D-linear and 3D-MoE GEMM paths were verified bit-faithful at 55 dB / 48 dB.
- Vision is untouched: the CLIP vision tower + projector stay BF16, so image capability equals the base model — only the text MoE is quantized.
- Footprint: ~221 GiB of weights; fits a single ≥256 GB GPU (e.g. B300). Measured ~460 tok/s aggregate generation at 16 concurrent requests on one B300.
- Provenance: built with qstream
@cb795c3from theMiniMax-M3MXFP8 release; mixed-precision recipe (experts→MXFP4, rest→MXFP8).
Serving with vLLM
This checkpoint targets a MiniMax-M3-capable vLLM build. MXFP4-on-M3 is currently an experimental path in that fork, so two things are required:
- The config in this repo (
config.json) — itsconfig_groupstarget vLLM's merged runtime modules (qkv_proj,gate_up_proj), which is necessary for the fused linears to load quantized. - The MoE clamp patch in
vllm_patch/— forwards the SwiGLU-OAIswiglu_limit/alpha/betainto the MXFP4 MoE quant config (without it theSWIGLUOAI_UNINTERLEAVE requires clamp_limitassertion fires). Seevllm_patch/README.md.
docker run --gpus all --privileged --ipc=host -p 8000:8000 \
-e VLLM_MXFP4_USE_MARLIN=1 \
-v $(FOLDER-WITH-MiniMax-M3-MXFP4)/vllm_patch/compressed_tensors_moe_w4a4_mxfp4.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_w4a4_mxfp4.py \
vllm/vllm-openai:minimax-m3 olka-fi/MiniMax-M3-MXFP4 \
--block-size 128 --tool-call-parser minimax_m3 --enable-auto-tool-choice \
--reasoning-parser minimax_m3 --load-format fastsafetensors \
--gpu-memory-utilization 0.97 --enforce-eager --max-model-len 200000 \
--max-num-batched-tokens 2048 --linear-backend marlin
Fits on a single ~275 GB GPU (e.g. B300/SM100). On SM120 (DGX Spark) the same Marlin path applies, but also needs the MSA SM12x sparse-attention kernels, and the ~221 GiB of weights won't fit in 2×128 GB.
License
Inherits the MiniMax Community License from the base model (non-commercial). This is a derivative (quantized) work of MiniMax-M3.
- Downloads last month
- 2,039
Model tree for olka-fi/MiniMax-M3-MXFP4
Base model
MiniMaxAI/MiniMax-M3