| # Validation |
|
|
| ## Target |
|
|
| - Kernel family: MiniMax M3 sparse attention (MSA) |
| - Package: `flashrt/MiniMaxAI-msa-blackwell` |
| - HF Jobs package selector: `MiniMaxAI-msa-blackwell` |
| - Package version: v1 Blackwell native-helper package |
| - Target GPU family: Blackwell CUDA compute capability 12.x |
| - Validated GPU: SM121 / GB10 / DGX Spark |
| - Dtype: BF16 inputs with FP32 accumulation references |
| - Layout: paged KV cache |
| - Model path: FlashRT MiniMax-Spark runtime on DGX Spark / GB10 |
|
|
| ## Correctness Gate |
|
|
| Run quick validation: |
|
|
| ```bash |
| PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \ |
| python MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py --quick |
| ``` |
|
|
| Run full validation: |
|
|
| ```bash |
| PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \ |
| python MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py |
| ``` |
|
|
| Run standalone long-context validation: |
|
|
| ```bash |
| PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \ |
| python MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py --long-context |
| ``` |
|
|
| Expected full coverage: |
|
|
| | Area | Shapes | Reference | Required | |
| |---|---:|---|---| |
| | API surface | official `MiniMaxAI/msa` public names | `api_status.py` | all official root names exported; no unsupported public root API entries | |
| | Native CUDA top-k helper | heads 64, batch 1-2, blocks 1-256 | PyTorch top-k over valid blocks | exact set match | |
| | Decode sparse GQA attention | ctx 128, 2048, 4096, 32768 | paged FP32 PyTorch | cos >= 0.999, max_abs <= 5e-2 | |
| | Prefill sparse GQA attention | ctx 512, 4096 | paged causal FP32 PyTorch | cos >= 0.999, max_abs <= 5e-2 | |
| | Decode sparse GQA attention with sink | ctx 2048, 32768 | paged FP32 PyTorch | cos >= 0.999, max_abs <= 5e-2 | |
| | Official decode API wrapper | ctx 2048, 4096 | direct Blackwell decode kernel | cos = 1.0, max_abs = 0 | |
| | Official CSR prefill API wrapper | ctx 512, 2048 | direct Blackwell prefill kernel | cos = 1.0, max_abs = 0 under CSR-preserved block order | |
| | Official NVFP4 prefill API wrapper | ctx 512 BF16 dispatch path | `sparse_atten_func` | cos = 1.0, max_abs = 0 | |
| | Native CUDA NVFP4 dequant | rows/cols `(1,128)`, `(257,128)`, `(64,4096)` | Python NVFP4 reference | exact BF16 match | |
| | Official FP4 indexer API | tiny FP4 packed tensors; native artifact path when built | PyTorch block-score reference | returns official score layout | |
| | Decode lightning indexer | ctx 2048, 4096, 32768 | PyTorch blockmax top-k set | overlap >= 0.99 | |
| | Standalone long-context decode | ctx 65536, 131072 | paged FP32 PyTorch / direct kernel | cos >= 0.999; wrapper max_abs = 0 | |
| | Installed-artifact native long top-k | blocks 512, 1024 | PyTorch top-k over valid blocks | exact set match | |
| |
| API surface validation: |
| |
| ```bash |
| PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \ |
| python -m pytest MiniMaxAI-msa-blackwell/tests/test_api_surface.py -q |
| ``` |
| |
| The test tracks every official `MiniMaxAI/msa` public API name: |
| |
| - `sparse_atten_func` |
| - `sparse_atten_nvfp4_kv_func` |
| - `sparse_decode_atten_func` |
| - `SparseDecodePagedAttentionWrapper` |
| - `fp4_indexer_block_scores` |
| - `build_k2q_csr` |
| - `SparseK2qCsrBuilderSm100` |
| - `Nvfp4QuantizedTensor` |
| - `quantize_bf16_to_nvfp4_128x4` |
| - `quantize_kv_bf16_to_nvfp4_128x4` |
| - `dequantize_nvfp4_128x4_to_bf16` |
| - `swizzle_nvfp4_scale_to_128x4` |
| - `nvfp4_global_scale_from_amax` |
|
|
| The root module exports every official public name. Decode, CSR prefill, NVFP4 |
| prefill compatibility, FP4 block scoring, CSR, and NVFP4 helper names are all |
| callable. Hub built artifacts use compiled CUDA ops for the MiniMax-M3 |
| Blackwell decode route, FP4 block-score indexer, and swizzled NVFP4 -> BF16 |
| dequantization path. Source-tree mode keeps reference paths so the API remains |
| testable before the extension is built. |
|
|
| ## FlashRT Integration Note |
|
|
| FlashRT has validated the decode sparse path on SM121 over context lengths |
| 128 to 32768 with cosine similarity >= 0.999. The 32768 context length has |
| also been exercised in the FlashRT MiniMax-Spark model runtime on DGX Spark / |
| GB10, so it is the current end-to-end model validation boundary. |
|
|
| The standalone package kernel tests additionally cover 65536 and 131072 |
| context lengths. These long-context rows validate the kernel and API wrapper |
| contract outside the full model runtime; they should not be described as |
| MiniMax-Spark end-to-end model validation until the full runtime path is rerun |
| at those lengths. |
|
|
| The same decode sparse path has also been exercised in FlashRT's MiniMax-Spark |
| model runtime on DGX Spark / GB10. That end-to-end validation is intentionally |
| kept as a FlashRT runtime validation item, while this Hub package exposes the |
| standalone kernel API for community use. |
|
|
| ## Native Helper Compile Smoke |
|
|
| Before HF Jobs publish, the native helper was compiled locally as a PyTorch |
| extension using the same source files: |
|
|
| - `torch-ext/torch_binding.cpp` |
| - `csrc/msa_topk_from_scores.cu` |
| - `csrc/msa_decode_attn.cu` |
| - `csrc/msa_decode_attn_mma.cu` |
| - `csrc/msa_indexer_block_scores.cu` |
| - `csrc/msa_nvfp4_dequant.cu` |
|
|
| Environment: |
|
|
| | Field | Value | |
| |---|---| |
| | GPU | NVIDIA GeForce RTX 5090 | |
| | PyTorch | 2.9.1+cu128 | |
| | nvcc | CUDA 13.0 | |
| | Target arch | sm_120 | |
| |
| Result: |
| |
| | Check | Shape | Reference | Verdict | |
| |---|---:|---|---| |
| | Native score -> top-k | heads 64, batch 1, blocks 256, topk 16 | PyTorch top-k set | PASS | |
| | Native FP4 block-score indexer | official `[Hq, blocks, total_q]` score layout | PyTorch block-score reference | PASS | |
| | Native NVFP4 swizzled -> BF16 dequant | rows/cols `(1,128)`, `(257,128)`, `(64,4096)` | Python NVFP4 reference | PASS | |
|
|
| ## Blackwell Package Validation |
|
|
| Remote Blackwell validation environment: |
|
|
| | Field | Value | |
| |---|---| |
| | Host | `spark-f517` | |
| | GPU | NVIDIA GB10 | |
| | Compute capability | 12.1 | |
| | Driver | 580.159.03 | |
| | Python | 3.12.3 | |
| | PyTorch | 2.12.0+cu130 | |
| | Triton | 3.7.0 | |
|
|
| Command: |
|
|
| ```bash |
| PY=/home/leadtek/jax/bin/python |
| PYTHONPATH=MiniMaxAI-msa-blackwell/torch-ext \ |
| $PY MiniMaxAI-msa-blackwell/tests/test_msa_blackwell.py |
| ``` |
|
|
| Result: |
|
|
| | Check | Shape | Cosine | Max abs / overlap | Verdict | |
| |---|---|---:|---:|---| |
| | Decode sparse GQA | ctx128_b1 | 0.999998 | 1.6032e-03 | PASS | |
| | Decode sparse GQA | ctx2048_b1 | 0.999996 | 4.9090e-04 | PASS | |
| | Decode sparse GQA | ctx2048_b2_sink | 0.999996 | 6.8302e-04 | PASS | |
| | Decode sparse GQA | ctx4096_b1 | 0.999996 | 4.5899e-04 | PASS | |
| | Decode sparse GQA | ctx4096_b2_mixed | 0.999996 | 7.3129e-04 | PASS | |
| | Decode sparse GQA | ctx32768_b1 | 0.999996 | 6.9451e-04 | PASS | |
| | Decode sparse GQA | ctx32768_b1_sink | 0.999996 | 5.6115e-04 | PASS | |
| | Decode sparse GQA | ctx65536_b1 | 0.999996 | 4.3470e-04 | PASS | |
| | Decode sparse GQA | ctx131072_b1 | 0.999996 | 7.1825e-04 | PASS | |
| | Decode top-k indexer | ctx2048 | n/a | overlap 1.000 | PASS | |
| | Decode top-k indexer | ctx4096 | n/a | overlap 1.000 | PASS | |
| | Decode top-k indexer | ctx32768 | n/a | overlap 1.000 | PASS | |
| | Decode top-k indexer | ctx65536 | n/a | overlap 1.000 | PASS | |
| | Decode top-k indexer | ctx131072 | n/a | overlap 1.000 | PASS | |
| | Official decode wrapper | ctx2048 | 1.000000 | 0.0000e+00 | PASS | |
| | Official decode wrapper | ctx4096 | 1.000000 | 0.0000e+00 | PASS | |
| | Official decode wrapper | ctx65536 | 1.000000 | 0.0000e+00 | PASS | |
| | Official decode wrapper | ctx131072 | 1.000000 | 0.0000e+00 | PASS | |
| | Native CUDA NVFP4 dequant | rows1_cols128 | 1.000000 | 0.0000e+00 | PASS | |
| | Native CUDA NVFP4 dequant | rows257_cols128 | 1.000000 | 0.0000e+00 | PASS | |
| | Native CUDA NVFP4 dequant | rows64_cols4096 | 1.000000 | 0.0000e+00 | PASS | |
| |
| Installed-artifact native top-k validation on RTX 5090 / torch 2.11 / CUDA |
| 12.8: |
| |
| | Context | Blocks | Overlap | Verdict | |
| |---:|---:|---:|---| |
| | 32768 | 256 | 1.000 | PASS | |
| | 65536 | 512 | 1.000 | PASS | |
| | 131072 | 1024 | 1.000 | PASS | |
| |
| The warning `tl.make_block_ptr is deprecated` appears with Triton 3.7.0. It is |
| a deprecation warning, not a correctness failure. |
| |
| ## Native Alignment Status |
| |
| The upstream `MiniMaxAI/msa` package targets SM100. This Blackwell package |
| keeps the same public API surface where practical and provides native CUDA |
| implementations for the hot paths needed by the FlashRT MiniMax-Spark runtime: |
| |
| - score-to-top-k sparse block selection; |
| - tensor-core sparse decode for the MiniMax-M3 Blackwell shape; |
| - FP4 block-score indexing; |
| - swizzled NVFP4 -> BF16 dequantization for the W4A16 path. |
| |
| The CSR prefill wrapper remains part of the public compatibility surface and is |
| validated against the package reference path. Shape and parameter restrictions |
| are explicit errors rather than silent wrong results. |
| |