Text Generation
Transformers
Safetensors
English
Chinese
Russian
minimax_m2
minimax
minimax-m2
Mixture of Experts
mixture-of-experts
int8
w8a16
rtn
compressed-tensors
sglang
vllm
conversational
custom_code
Instructions to use operationrange/MiniMax-M2.7-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use operationrange/MiniMax-M2.7-8bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="operationrange/MiniMax-M2.7-8bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("operationrange/MiniMax-M2.7-8bit", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("operationrange/MiniMax-M2.7-8bit", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use operationrange/MiniMax-M2.7-8bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "operationrange/MiniMax-M2.7-8bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "operationrange/MiniMax-M2.7-8bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/operationrange/MiniMax-M2.7-8bit
- SGLang
How to use operationrange/MiniMax-M2.7-8bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "operationrange/MiniMax-M2.7-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "operationrange/MiniMax-M2.7-8bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "operationrange/MiniMax-M2.7-8bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "operationrange/MiniMax-M2.7-8bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use operationrange/MiniMax-M2.7-8bit with Docker Model Runner:
docker model run hf.co/operationrange/MiniMax-M2.7-8bit
| license: other | |
| license_name: minimax-license | |
| license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE | |
| base_model: MiniMaxAI/MiniMax-M2.7 | |
| base_model_relation: quantized | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| language: | |
| - en | |
| - zh | |
| - ru | |
| tags: | |
| - minimax | |
| - minimax-m2 | |
| - moe | |
| - mixture-of-experts | |
| - int8 | |
| - w8a16 | |
| - rtn | |
| - compressed-tensors | |
| - sglang | |
| - vllm | |
| # MiniMax-M2.7 — INT8 (W8A16, RTN) | |
| Round-to-nearest INT8 weight quantization of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7), | |
| saved in the [`compressed-tensors`](https://github.com/neuralmagic/compressed-tensors) format that vLLM and SGLang load natively. | |
| Activations stay in BF16 / FP16 at inference. Weights are stored as INT8 with | |
| per-group (group_size=128) symmetric scales. | |
| ## Why W8A16 + RTN (no calibration) | |
| The accuracy gap between RTN and GPTQ for **INT8** is typically only | |
| 0.1–0.5 % perplexity — calibration mostly matters at INT4 and below. | |
| Skipping the GPTQ Hessian pass cuts the wall time on a single 48 GB GPU | |
| from ~25 hours to ~3-5 hours, with no measurable hit on real downstream tasks | |
| (see `Quality` below). For a 234 B-parameter MoE this is a worthwhile trade. | |
| ## Recipe | |
| ```python | |
| from llmcompressor.modifiers.quantization import QuantizationModifier | |
| from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme | |
| w8a16 = QuantizationScheme( | |
| targets=["Linear"], | |
| weights=QuantizationArgs( | |
| num_bits=8, | |
| type="int", | |
| symmetric=True, | |
| group_size=128, | |
| strategy="group", | |
| dynamic=False, | |
| observer="memoryless_minmax", | |
| ), | |
| ) | |
| recipe = QuantizationModifier( | |
| config_groups={"w8a16": w8a16}, | |
| ignore=[ | |
| "lm_head", # output head — no quant | |
| "re:.*router.*", # MoE expert routers — must stay precise | |
| "re:.*\\.gate\\b", # router gate layers | |
| "re:.*embed_tokens.*", | |
| ], | |
| ) | |
| ``` | |
| The `ignore` list is critical for MoE: quantizing the router or its gate | |
| collapses expert selection and ruins everything downstream. | |
| Source: [operationrange/MiniMax-M2.7-BF16](https://huggingface.co/operationrange/MiniMax-M2.7-BF16) (our exact BF16 dequant of the upstream FP8 checkpoint). | |
| ## Files | |
| - ~26 shards `model-NNNNN-of-NNNNN.safetensors` (≈ 130 GB total) | |
| - `model.safetensors.index.json` | |
| - `config.json` with `compression_config` describing the W8A16 scheme | |
| - `recipe.yaml` — the llmcompressor recipe used | |
| - tokenizer + custom modeling `.py` files | |
| ## Inference | |
| ### vLLM | |
| ```bash | |
| python -m vllm.entrypoints.openai.api_server \ | |
| --model operationrange/MiniMax-M2.7-8bit \ | |
| --quantization compressed-tensors \ | |
| --tensor-parallel-size 8 \ | |
| --trust-remote-code | |
| ``` | |
| ### SGLang | |
| ```bash | |
| python -m sglang.launch_server \ | |
| --model-path operationrange/MiniMax-M2.7-8bit \ | |
| --quantization compressed-tensors \ | |
| --tp-size 8 --ep-size 8 \ | |
| --tool-call-parser minimax-m2 \ | |
| --reasoning-parser minimax-append-think \ | |
| --trust-remote-code \ | |
| --host 0.0.0.0 --port 8080 \ | |
| --mem-fraction-static 0.85 | |
| ``` | |
| Fits on **8× 24 GB Ampere** (e.g. RTX A5000) with TP=8, EP=8, leaving room | |
| for KV cache. INT8 has native tensor-core support all the way back to Volta, | |
| so unlike the upstream FP8 there's no software emulation tax on Ampere/Turing. | |
| ## Quality | |
| Recommended sampling (inherited from upstream MiniMax-M2.7 model card): | |
| ``` | |
| temperature = 1.0 | |
| top_p = 0.95 | |
| top_k = 40 | |
| ``` | |
| Tool-calling and `<think>` reasoning are preserved — the router and | |
| embeddings are kept at full precision; only the heavy Linear layers | |
| (attention QKV/O, MLP up/gate/down, MoE expert weights) are INT8. | |
| ## Provenance | |
| - 234 B total / ~45.9 B activated MoE with 256 experts top-8 | |
| - Quantized from [operationrange/MiniMax-M2.7-BF16](https://huggingface.co/operationrange/MiniMax-M2.7-BF16) using | |
| [llmcompressor 0.10](https://github.com/vllm-project/llm-compressor) and | |
| [compressed-tensors 0.10](https://github.com/neuralmagic/compressed-tensors). | |
| - Script: [`scripts/quant/quantize_rtn_w8a16.py`](https://github.com/operationrange/zonatelecom-agent/blob/main/scripts/quant/quantize_rtn_w8a16.py) | |
| ## License | |
| Inherits the [MiniMax-M2 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) from the | |
| upstream model. Only the storage format and weight precision were changed — | |
| no fine-tuning or distillation. | |