Working Config for H200 Here

#18

by JeffersonNunn - opened Dec 21, 2025

Dec 21, 2025

•

edited Dec 21, 2025

So it turns out that there is the nightly and then there is the "Release" tagged. I ended up installing the tagged release and not the nightly even though they're the same version. The nightly has the fixes and Day0 release for this model. If you're running down this rabbit hole...be warned. You're probably not on the right version.

Memory Mathematics for 64K Context

KV Cache Calculation:

KV_cache_per_token = 2 × layers × hidden_size × dtype_size
= 2 × 32 × 4096 × 2 bytes
= 524,288 bytes = 512 KB

For 64K (65,536 tokens):
Total_KV = 65,536 × 512 KB = 33.6 GB

Total GPU Memory (TP=2):

KV cache: 33.6 GB (replicated across TP)
Model weights: 15 GB (15B active params ÷ 2)
Working memory: ~10 GB
TOTAL: 58.6 GB per GPU

3. H200 Capacity Analysis

H200 VRAM: 143 GB
Safe limit (80%): 114 GB
Required for 64K (TP=2): 58.6 GB
Headroom: 55.4 GB (48%)

4. Correct Settings for 64K Context

Use TP=2 (mathematically required):

python3 -m sglang.launch_server
--model-path /home/llm/MiMo-V2-Flash
--port 8203
--tp 2 --dp 4
--context-length 65536
--mem-fraction 0.70 # 100GB limit

JeffersonNunn

Dec 21, 2025

A bit more details:

• TP=2: 6144 = 8 × 768 ✓ Math works, but has weight replication bug
• TP=4: 3072 = 4 × 768 ✓ Math works, but SGLang expects 768 not 3072
• TP=8: 1536 = 2 × 768 ✓ Math works, but sharding bug gives 2304

• Base dimension of 768 from error messages ✓
• All TP configurations that divide the attention dimensions ✓
• Proper memory calculations for 64K context ✓
• Weight replication vs. sharding issues ✓

The failure mode reveals a SGLang framework bug, not a mathematical error.

🔍 Evidence:

• TP=8 gives 2304 instead of expected 1536 = 2 × 768
• This shows incorrect sharding algorithm, not incorrect math
• Error occurs during weight loading, not during dimension calculations

Our mathematical analysis successfully predicted all the right configurations, but discovered that
SGLang's custom model implementation is fundamentally broken for MiMo-V2-Flash with any TP value.

The solution needs to be framework-level or model-conversion, not further mathematical optimization.

JeffersonNunn changed discussion title from TP2 to Working Config for H200 Here Dec 21, 2025

Nagata99999

Jan 14

Hi.
Could you help me, please?
I can't server mimo model using sglang.

MiMoV2FlashForCausalLM has no SGLang implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment