Working Config for H200 Here

#18
by JeffersonNunn - opened

So it turns out that there is the nightly and then there is the "Release" tagged. I ended up installing the tagged release and not the nightly even though they're the same version. The nightly has the fixes and Day0 release for this model. If you're running down this rabbit hole...be warned. You're probably not on the right version.

Memory Mathematics for 64K Context

KV Cache Calculation:

KV_cache_per_token = 2 Γ— layers Γ— hidden_size Γ— dtype_size
= 2 Γ— 32 Γ— 4096 Γ— 2 bytes
= 524,288 bytes = 512 KB

For 64K (65,536 tokens):
Total_KV = 65,536 Γ— 512 KB = 33.6 GB

Total GPU Memory (TP=2):

KV cache: 33.6 GB (replicated across TP)
Model weights: 15 GB (15B active params Γ· 2)
Working memory: ~10 GB
TOTAL: 58.6 GB per GPU

3. H200 Capacity Analysis

H200 VRAM: 143 GB
Safe limit (80%): 114 GB
Required for 64K (TP=2): 58.6 GB
Headroom: 55.4 GB (48%)

4. Correct Settings for 64K Context

Use TP=2 (mathematically required):

python3 -m sglang.launch_server
--model-path /home/llm/MiMo-V2-Flash
--port 8203
--tp 2 --dp 4
--context-length 65536
--mem-fraction 0.70 # 100GB limit

A bit more details:

β€’ TP=2: 6144 = 8 Γ— 768 βœ“ Math works, but has weight replication bug
β€’ TP=4: 3072 = 4 Γ— 768 βœ“ Math works, but SGLang expects 768 not 3072
β€’ TP=8: 1536 = 2 Γ— 768 βœ“ Math works, but sharding bug gives 2304

β€’ Base dimension of 768 from error messages βœ“
β€’ All TP configurations that divide the attention dimensions βœ“
β€’ Proper memory calculations for 64K context βœ“
β€’ Weight replication vs. sharding issues βœ“

The failure mode reveals a SGLang framework bug, not a mathematical error.

πŸ” Evidence:

β€’ TP=8 gives 2304 instead of expected 1536 = 2 Γ— 768
β€’ This shows incorrect sharding algorithm, not incorrect math
β€’ Error occurs during weight loading, not during dimension calculations

Our mathematical analysis successfully predicted all the right configurations, but discovered that
SGLang's custom model implementation is fundamentally broken for MiMo-V2-Flash with any TP value.

The solution needs to be framework-level or model-conversion, not further mathematical optimization.

JeffersonNunn changed discussion title from TP2 to Working Config for H200 Here

Sign up or log in to comment