Updated Readme, no need of PR since base nightly works on Blackwell pretty well
Browse filesVerified myself on RTX 6000 Pro
```shell
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339]
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] █ █ █▄ ▄█
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.23.1rc1.dev471+ge312c5cb2
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] █▄█▀ █ █ █ █ model /app/efs/models/minimax-m3-nvfp4
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:339]
(APIServer pid=1) INFO 06-26 06:55:48 [api_utils.py:273] non-default args: {'model_tag': '/app/efs/models/minimax-m3-nvfp4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'minimax_m3', 'model': '/app/efs/models/minimax-m3-nvfp4', 'trust_remote_code': True, 'max_model_len': -1, 'served_model_name': ['minimax-m3'], 'reasoning_parser': 'minimax_m3', 'tensor_parallel_size': 4, 'block_size': 128, 'gpu_memory_utilization': 0.95}
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 06-26 06:55:48 [envs.py:2019] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) INFO 06-26 06:56:00 [model.py:598] Resolved architecture: MiniMaxM3SparseForConditionalGeneration
(APIServer pid=1) INFO 06-26 06:56:00 [model.py:1725] Using max model len 1048576
(APIServer pid=1) INFO 06-26 06:56:01 [scheduler.py:252] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) WARNING 06-26 06:56:01 [modelopt.py:384] Detected ModelOpt fp8 checkpoint (quant_algo=FP8). Please note that the format is experimental and could change.
(APIServer pid=1) WARNING 06-26 06:56:01 [modelopt.py:1028] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=1) WARNING 06-26 06:56:01 [modelopt.py:1028] Detected ModelOpt NVFP4 checkpoint (quant_algo=W4A16_NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 06-26 06:56:01 [vllm.py:1006] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 06-26 06:56:01 [vllm.py:1094] Auto-enabling VLLM_USE_BREAKABLE_CUDAGRAPH=1. Set VLLM_USE_BREAKABLE_CUDAGRAPH=0 to opt out.
(APIServer pid=1) WARNING 06-26 06:56:01 [vllm.py:1100] VLLM_USE_BREAKABLE_CUDAGRAPH is set, disabling vLLM's torch.compile pipeline. Equivalent to -cc.mode=none.
(APIServer pid=1) WARNING 06-26 06:56:01 [vllm.py:1110] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 06-26 06:56:01 [kernel.py:276] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(APIServer pid=1) INFO 06-26 06:56:02 [compilation.py:310] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=1654) INFO 06-26 06:56:13 [core.py:114] Initializing a V1 LLM engine (v0.23.1rc1.dev471+ge312c5cb2) with config: model='/app/efs/models/minimax-m3-nvfp4', speculative_config=None, tokenizer='/app/efs/models/minimax-m3-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_mixed, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='minimax_m3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False, jit_monitor_mode='warn', jit_monitor_verbose=False), seed=0, served_model_name=minimax-m3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=True, moe_backend='auto', linear_backend='auto')
(EngineCore pid=1654) WARNING 06-26 06:56:13 [multiproc_executor.py:1063] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=1654) INFO 06-26 06:56:13 [multiproc_executor.py:140] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.0.21.126 (local), world_size=4, local_world_size=4
(Worker pid=1830) INFO 06-26 06:56:21 [parallel_state.py:1588] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1836) INFO 06-26 06:56:26 [parallel_state.py:1588] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1848) INFO 06-26 06:56:31 [parallel_state.py:1588] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1863) INFO 06-26 06:56:35 [parallel_state.py:1588] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:60377 backend=nccl
(Worker pid=1830) INFO 06-26 06:56:35 [pynccl.py:113] vLLM is using nccl==2.28.9
(Worker pid=1830) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1836) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1848) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1863) WARNING 06-26 06:56:36 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
(Worker pid=1836) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1863) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1848) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1830) WARNING 06-26 06:56:36 [custom_all_reduce.py:151] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=1830) INFO 06-26 06:56:36 [cuda_communicator.py:245] Using ['PYNCCL'] all-reduce backends (in dispatch order) for group 'tp:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=1830) INFO 06-26 06:56:36 [cuda_communicator.py:245] Using ['PYNCCL'] all-reduce backends (in dispatch order) for group 'ep:0' out of potential backends: ['NCCL_SYMM_MEM', 'QUICK_REDUCE', 'FLASHINFER', 'CUSTOM', 'SYMM_MEM', 'PYNCCL'].
(Worker pid=1830) INFO 06-26 06:56:36 [parallel_state.py:1923] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=1830) INFO 06-26 06:56:37 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
(Worker_TP0 pid=1830) INFO 06-26
|
@@ -116,7 +116,7 @@ This model is NVFP4 quantized with nvidia-modelopt **v0.44.0**
|
|
| 116 |
This model was obtained by quantizing the weights and activations of Minimax-M3 to NVFP4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing disk size and GPU memory requirements by approximately 2x.
|
| 117 |
|
| 118 |
## Usage
|
| 119 |
-
To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you currently need the nightly docker image
|
| 120 |
|
| 121 |
```
|
| 122 |
vllm serve nvidia/MiniMax-M3-NVFP4 \
|
|
|
|
| 116 |
This model was obtained by quantizing the weights and activations of Minimax-M3 to NVFP4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing disk size and GPU memory requirements by approximately 2x.
|
| 117 |
|
| 118 |
## Usage
|
| 119 |
+
To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you currently need the nightly docker image. Launch the nightly image and run the sample command below:
|
| 120 |
|
| 121 |
```
|
| 122 |
vllm serve nvidia/MiniMax-M3-NVFP4 \
|