Great Model! - sglang mtp support for triton backend

#19

by chriswritescode - opened Dec 21, 2025

Discussion

chriswritescode

Dec 21, 2025

•

edited Dec 21, 2025

https://github.com/chriswritescode-dev/sglang/tree/mtp-triton-backend

launch command for 4x6000 Pro Blackwell
get upto 150-170t/s

export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_FORWARD_UNKNOWN_TOOLS=true
python3 -m sglang.launch_server
--model-path /models/MiMo-V2-Flash
--served-model-name mimo-v2-flash
--tp-size 4
--moe-a2a-backend none
--host 0.0.0.0
--port 8000
--trust-remote-code
--mem-fraction-static 0.92
--max-running-requests 16
--chunked-prefill-size 16384
--tool-call-parser mimo
--context-length 200000
--attention-backend triton
--fp8-gemm-backend triton
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-mtp \

lluu8

Dec 22, 2025

Any idea bout this error?

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/sglang/python/sglang/launch_server.py", line 29, in
server_args = prepare_server_args(sys.argv[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/server_args.py", line 4921, in prepare_server_args
return ServerArgs.from_cli_args(raw_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/server_args.py", line 4418, in from_cli_args
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 312, in init
File "/root/sglang/python/sglang/srt/server_args.py", line 673, in post_init
self._handle_model_specific_adjustments()
File "/root/sglang/python/sglang/srt/server_args.py", line 998, in _handle_model_specific_adjustments
from sglang.srt.configs.model_config import is_deepseek_nsa
File "/root/sglang/python/sglang/srt/configs/model_config.py", line 26, in
from sglang.srt.layers.quantization import QUANTIZATION_METHODS
File "/root/sglang/python/sglang/srt/layers/quantization/init.py", line 19, in
from sglang.srt.layers.quantization.auto_round import AutoRoundConfig
File "/root/sglang/python/sglang/srt/layers/quantization/auto_round.py", line 17, in
from sglang.srt.layers.linear import LinearBase, UnquantizedLinearMethod
File "/root/sglang/python/sglang/srt/layers/linear.py", line 34, in
from sglang.srt.layers.quantization.unquant import UnquantizedLinearMethod
File "/root/sglang/python/sglang/srt/layers/quantization/unquant.py", line 11, in
from sglang.srt.layers.moe import (
File "/root/sglang/python/sglang/srt/layers/moe/init.py", line 1, in
from sglang.srt.layers.moe.moe_runner import MoeRunner, MoeRunnerConfig
AssertionError: duplicate template name^^, "duplicate template name"r/select_algorithm.py", line 143

willfalco

Dec 23, 2025

If you can also add support for MiMo-V2-Flash-AWQ-4bit

willfalco

Dec 24, 2025

Works great om this PR
MiMo-V2-Flash timeout/correct:7/5
business 73/789 wrong (90.7% accuracy)
law 273/1101 wrong (75.2% accuracy)
psychology 112/798 wrong (86.0% accuracy)
biology 51/717 wrong (92.9% accuracy)
chemistry 99/1132 wrong (91.3% accuracy)
history 98/381 wrong (74.3% accuracy)
other 142/924 wrong (84.6% accuracy)
health 139/818 wrong (83.0% accuracy)
economics 86/844 wrong (89.8% accuracy)
math 53/1351 wrong (96.1% accuracy)
physics 97/1299 wrong (92.5% accuracy)
computer science 46/410 wrong (88.8% accuracy)
philosophy 86/499 wrong (82.8% accuracy)
engineering 131/969 wrong (86.5% accuracy)

ALL CATEGORIES 1491/12032 wrong (87.6% accuracy)

darkstar3537

Jan 21

•

edited Jan 21

--enable-mtp option not found despite 1) building out of the fork 2) merging the PR into sglang checked out to the hash fork/branch was based on. That said it does work fine on 4 x 6000's here if I omit that option and use the rest. Thanks for the work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment