使用sglang在两个H200上推理速度非常慢

by taozi555 - opened 30 days ago

python3 -m sglang.launch_server
--model meituan-longcat/LongCat-Flash-Lite
--port 6006
--host 0.0.0.0
--mem-fraction-static 0.9
--max-running-requests 64
--trust-remote-code
--skip-server-warmup
--attention-backend flashinfer
--ep 2
--tp 2
--disable-cuda-graph

使用此命令，大约12tk/s

malithh

28 days ago

我们是 DeployPad。
我们注意到使用 sglang 在 2× H200 上运行 LongCat 只有约 12 tok/s，这一性能明显偏低。

👉DeployPad 推理栈将于本周末正式上线
👉 上线即支持 LongCat：单张 H200 可达约 60–80 tok/s，同时支持 RTX Pro 6000
👉 将向 LongCat 社区开放支持

我们的目标是充分发挥 H200 的性能，在无需复杂调参的情况下实现更高吞吐。

周末见。

sjqgogogogo

4 days ago

Hi, we now support cuda graph on sglang via PR https://github.com/sgl-project/sglang/pull/17838

With cuda graph on, you can run LongCat-Flash-Lite-FP8 on H800*8 at about 250TPS/user

python3 -m sglang.launch_server \
    --model meituan-longcat/LongCat-Flash-Lite-FP8 \
    --port 8080 \
    --host 0.0.0.0 \
    --mem-fraction-static 0.9 \
    --max-running-requests 64 \
    --trust-remote-code \
    --skip-server-warmup \
    --attention-backend flashinfer \
    --ep 8 \
    --tp 8 \
    --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 28 32 36 40 44 48 52 56 60 64

Additionally, for the best possible performance, we highly recommend trying out our inference engine, SGLang-FluentLLM. You can find more details in https://github.com/meituan-longcat/SGLang-FluentLLM/tree/main

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment