Working Config for H200 Here
So it turns out that there is the nightly and then there is the "Release" tagged. I ended up installing the tagged release and not the nightly even though they're the same version. The nightly has the fixes and Day0 release for this model. If you're running down this rabbit hole...be warned. You're probably not on the right version.
Memory Mathematics for 64K Context
KV Cache Calculation:
KV_cache_per_token = 2 Γ layers Γ hidden_size Γ dtype_size
= 2 Γ 32 Γ 4096 Γ 2 bytes
= 524,288 bytes = 512 KB
For 64K (65,536 tokens):
Total_KV = 65,536 Γ 512 KB = 33.6 GB
Total GPU Memory (TP=2):
KV cache: 33.6 GB (replicated across TP)
Model weights: 15 GB (15B active params Γ· 2)
Working memory: ~10 GB
TOTAL: 58.6 GB per GPU
3. H200 Capacity Analysis
H200 VRAM: 143 GB
Safe limit (80%): 114 GB
Required for 64K (TP=2): 58.6 GB
Headroom: 55.4 GB (48%)
4. Correct Settings for 64K Context
Use TP=2 (mathematically required):
python3 -m sglang.launch_server
--model-path /home/llm/MiMo-V2-Flash
--port 8203
--tp 2 --dp 4
--context-length 65536
--mem-fraction 0.70 # 100GB limit
A bit more details:
β’ TP=2: 6144 = 8 Γ 768 β Math works, but has weight replication bug
β’ TP=4: 3072 = 4 Γ 768 β Math works, but SGLang expects 768 not 3072
β’ TP=8: 1536 = 2 Γ 768 β Math works, but sharding bug gives 2304
β’ Base dimension of 768 from error messages β
β’ All TP configurations that divide the attention dimensions β
β’ Proper memory calculations for 64K context β
β’ Weight replication vs. sharding issues β
The failure mode reveals a SGLang framework bug, not a mathematical error.
π Evidence:
β’ TP=8 gives 2304 instead of expected 1536 = 2 Γ 768
β’ This shows incorrect sharding algorithm, not incorrect math
β’ Error occurs during weight loading, not during dimension calculations
Our mathematical analysis successfully predicted all the right configurations, but discovered that
SGLang's custom model implementation is fundamentally broken for MiMo-V2-Flash with any TP value.
The solution needs to be framework-level or model-conversion, not further mathematical optimization.