Need speed estimate on 6x H200

#21
by mechanicmuthu - opened

Can someone kindly provide me an estimate of what total throughput and what peak numbers can be expected from 6x H200 or 8xH200. I am thinking what kind of gains can be expected from int4 on H200 series. Thanks in advance.

2-5 mins peak hours

Hi @mechanicmuthu ,

While I don't have exact benchmarks for 6x H200, here's some reference data from the deployment guide:

KTransformers + SGLang on 8x L20 + 2x Intel 6454S:

  • Prefill: 640.12 tokens/s
  • Decode: 24.51 tokens/s (48-way concurrency)

H200 vs L20:

  • H200: 141GB HBM3e, ~4.0 TB/s bandwidth
  • L20: 48GB GDDR6, ~864 GB/s bandwidth

Given H200's significantly higher bandwidth and memory, you can expect better performance than the L20 benchmark. However, with 6 GPUs instead of 8, you'd need to adjust tensor parallelism.

Suggested configuration for 6x H200:

vllm serve $MODEL_PATH -tp 6 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2

Rough estimate: Given H200's ~4.6x bandwidth advantage over L20, you might see decode speeds in the range of 50-100 tokens/s with proper optimization, though this depends heavily on batch size and context length.

For official benchmarks, I'd recommend checking with Moonshot AI directly or running your own tests with the model.

Sign up or log in to comment