--- pipeline_tag: text-generation base_model: - MiniMaxAI/MiniMax-M3 license: other license_name: minimax-community-license license_link: https://huggingface.co/MiniMaxAI/MiniMax-M3/blob/main/LICENSE tags: - nvidia - ModelOpt - MiniMax-M3 - quantized - NVFP4 - nvfp4 --- # Model Overview ## Description MiniMax-M3 is a multimodal model with frontier-level coding and agentic capabilities, built on a Mixture-of-Experts architecture with a 1M-token context window. The model processes text, image, video, and computer use inputs and produces text outputs, with emphasis on long-horizon coding tasks, agentic and tool-use workflows, and long-form video understanding. The NVIDIA MiniMax-M3 NVFP4 model is quantized with [Model Optimizer](https://github.com/NVIDIA/Model-Optimizer). *This model is ready for non-commercial use.* ## Third-Party Community Consideration: This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA [MiniMax-M3 Model Card](https://huggingface.co/MiniMaxAI). ### License/Terms of Use: **GOVERNING TERMS:** Use of the checkpoints is governed by the [MiniMax Community License](https://huggingface.co/MiniMaxAI/MiniMax-M3/blob/main/LICENSE).
**Additional Information:** Built with [MiniMax M3](https://huggingface.co/MiniMaxAI/MiniMax-M3). ### Deployment Geography: Global ## Use Case: **Use Case:** MiniMax-M3 is intended for multimodal understanding across text, image, and video; long-form video understanding (up to 30 minutes); long-horizon coding tasks (8+ hours); agentic and tool-use workflows; and design and creative tasks. The model supports two reasoning modes switchable per request: thinking mode for complex reasoning and agentic tasks, and non-thinking mode for latency-sensitive scenarios. ### Release Date: Hugging Face 06/23/2026 via https://huggingface.co/nvidia/MiniMax-M3-NVFP4 ## Model Architecture: **Architecture Type:** Transformer
**Network Architecture:** Mixture-of-Experts (multimodal)
**Total Parameters:** 428B
**Active Parameters:** Approximately 23B per token (A23B)
**Vision Encoder:** ViT for image and video input ### Input: **Input Types:** Text, Image, Video
**Input Formats:** Text: String; Image: RGB images; Video: encoded video file
**Input Parameters:** One-Dimensional (1D), Two-Dimensional (2D), Three-Dimensional (3D)
**Other Input Properties:** Supports long-form video input up to 30 minutes.
**Input Context Length (ISL):** 1 million tokens ### Output: **Output Types:** Text
**Output Format:** String
**Output Parameters:** One-Dimensional (1D)
**Other Output Properties:** None Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. ## Software Integration: **Runtime Engine(s):**
**vLLM** **Supported Hardware Microarchitecture Compatibility:** * NVIDIA Blackwell **Preferred Operating System(s):** * Linux The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. ## Model Version(s): This model is NVFP4 quantized with nvidia-modelopt **v0.44.0** ## Training and Evaluation Datasets: ### Training Dataset **Data Modality:** Text, Image, Video
**Image Training Data Size:** Undisclosed
**Text Training Data Size:** Undisclosed
**Training Data Collection:** Undisclosed
**Training Labeling:** Undisclosed
**Training Properties:** Undisclosed ## Evaluation Dataset: **Datasets:** GPQA Diamond, AA-LCR, τ²-Telecom, MMMU-Pro, and SciCode
**Data Collection Method by dataset:** Hybrid, Automated, Human
**Labeling Method by dataset:** Hybrid, Automated, Human
**Properties:** We evaluated the model on reasoning, instruction-following, agentic, multimodal, and coding benchmarks: GPQA Diamond contains 448 graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry; AA-LCR (Artificial Analysis Long Context Reasoning) tests reasoning and synthesis over long-context inputs spanning multiple documents; τ²-Telecom (tau2-bench) is an agentic tool-use benchmark measuring multi-turn task completion in a telecom customer-service domain; MMMU-Pro is a massive multi-discipline multimodal understanding benchmark with challenging multiple-choice questions requiring image comprehension across diverse academic domains; SciCode evaluates scientific coding capabilities. ## Inference: **Engine:** vLLM **Test Hardware:** NVIDIA Blackwell B200 ## Post Training Quantization This model was obtained by quantizing the weights and activations of Minimax-M3 to NVFP4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing disk size and GPU memory requirements by approximately 2x. ## Usage To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you currently need the nightly docker image that includes MiniMax-M3 NVFP4 support from [vllm-project/vllm#46380](https://github.com/vllm-project/vllm/pull/46380) (not yet in a stable release). Launch the nightly image and run the sample command below: ``` vllm serve nvidia/MiniMax-M3-NVFP4 \ --tensor-parallel-size 8 \ --block-size 128 \ --tool-call-parser minimax_m3 \ --reasoning-parser minimax_m3 \ --enable-auto-tool-choice ``` ### Evaluation **NVFP4 Quantization Accuracy (vs. FP8 baseline):** | **Precision** | **GPQA Diamond** | **AA-LCR** | **τ²-Telecom** | **MMMU-Pro** | **SciCode** | |---|---|---|---|---|---| | FP8 | **92.53** | **76.62** | **92.22** | **71.97** | **49.90** | | NVFP4 | **91.92** | **75.60** | **91.89** | **71.01** | **49.70** | Baseline: MiniMax-M3 in its native MXFP8 format. Benchmarked with temperature=1.0, top_p=0.95, max num tokens 65536. ## Model Limitations: The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. ## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).