--- license: other license_name: minimax-license license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE base_model: MiniMaxAI/MiniMax-M2.7 base_model_relation: quantized library_name: transformers pipeline_tag: text-generation language: - en - zh - ru tags: - minimax - minimax-m2 - moe - mixture-of-experts - bf16 - dequantized --- # MiniMax-M2.7 — BF16 (dequantized from FP8) Plain `bfloat16` weights of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7), reconstructed from the upstream block-FP8 (E4M3, 128×128 blocks) checkpoint via shard-by-shard blockwise dequantization. **No calibration, no rounding loss beyond the original FP8→BF16 cast** — every block is materialized exactly: ``` bf16_block = (fp8_block.float() * scale_fp32).bfloat16() ``` ## Why this exists `MiniMaxAI/MiniMax-M2.7` ships natively in FP8. On Ampere and earlier (e.g. RTX A5000) FP8 tensor cores don't exist and inference engines have to emulate FP8 through FP16 — paying double bandwidth without the speed benefit. For further offline quantization (AWQ, GPTQ, RTN INT8, …) you need plain BF16 weights anyway: `transformers + torch_dtype=bfloat16` won't materialize the attention projections under the FP8 quant config, which trips up `llmcompressor`'s GPTQ tracer. This repo is the missing intermediate: **upstream MiniMax-M2.7 weights in plain BF16 safetensors**, ready to be fed into any standard quantization pipeline. ## Contents - 47 shards `model-NNNNN-of-00047.safetensors` - rebuilt `model.safetensors.index.json` (no `*.weight_scale_inv` entries) - `config.json` with the upstream `quantization_config` stripped - tokenizer + custom modeling `.py` files copied verbatim from the FP8 source Total ≈ **458 GB**. ## Provenance Produced on a single 48 GB GPU pod (~30 minutes wall time) using a ~150-line script — see [`dequant_fp8_blockwise.py`](https://github.com/operationrange/zonatelecom-agent/blob/main/scripts/quant/dequant_fp8_blockwise.py). Process per shard: 1. open `model-XXXXX-of-00130.safetensors` from the FP8 source 2. for each `*.weight` (FP8 e4m3fn): look up `*.weight_scale_inv` (FP32, 128×128) 3. broadcast scale to weight shape, multiply, cast to BF16 4. drop the scale tensor 5. write `model-NNNNN-of-00047.safetensors` (5 GB shards) Other tensors (embeddings, layer norms, MoE routers/gates that were already unquantized in the upstream config's `modules_to_not_convert`) are passed through with a BF16 cast. ## Quick load ```python from transformers import AutoModelForCausalLM, AutoTokenizer m = AutoModelForCausalLM.from_pretrained( "operationrange/MiniMax-M2.7-BF16", torch_dtype="bfloat16", device_map="auto", trust_remote_code=True, ) tok = AutoTokenizer.from_pretrained("operationrange/MiniMax-M2.7-BF16", trust_remote_code=True) ``` Inference at full BF16 needs ≥ ~470 GB combined GPU+CPU memory, so this checkpoint is mostly intended as a starting point for further compression (AWQ-INT4, GPTQ-INT8, etc.) rather than direct serving. ## License Inherits the [MiniMax-M2 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE) from the upstream model. No weights were modified — only the storage format.