Great Model! - sglang mtp support for triton backend

#19
by chriswritescode - opened

https://github.com/chriswritescode-dev/sglang/tree/mtp-triton-backend

launch command for 4x6000 Pro Blackwell
get upto 150-170t/s

export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_FORWARD_UNKNOWN_TOOLS=true
python3 -m sglang.launch_server
--model-path /models/MiMo-V2-Flash
--served-model-name mimo-v2-flash
--tp-size 4
--moe-a2a-backend none
--host 0.0.0.0
--port 8000
--trust-remote-code
--mem-fraction-static 0.92
--max-running-requests 16
--chunked-prefill-size 16384
--tool-call-parser mimo
--context-length 200000
--attention-backend triton
--fp8-gemm-backend triton
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-mtp \

Any idea bout this error?

Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/sglang/python/sglang/launch_server.py", line 29, in
server_args = prepare_server_args(sys.argv[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/server_args.py", line 4921, in prepare_server_args
return ServerArgs.from_cli_args(raw_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/sglang/python/sglang/srt/server_args.py", line 4418, in from_cli_args
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 312, in init
File "/root/sglang/python/sglang/srt/server_args.py", line 673, in post_init
self._handle_model_specific_adjustments()
File "/root/sglang/python/sglang/srt/server_args.py", line 998, in _handle_model_specific_adjustments
from sglang.srt.configs.model_config import is_deepseek_nsa
File "/root/sglang/python/sglang/srt/configs/model_config.py", line 26, in
from sglang.srt.layers.quantization import QUANTIZATION_METHODS
File "/root/sglang/python/sglang/srt/layers/quantization/init.py", line 19, in
from sglang.srt.layers.quantization.auto_round import AutoRoundConfig
File "/root/sglang/python/sglang/srt/layers/quantization/auto_round.py", line 17, in
from sglang.srt.layers.linear import LinearBase, UnquantizedLinearMethod
File "/root/sglang/python/sglang/srt/layers/linear.py", line 34, in
from sglang.srt.layers.quantization.unquant import UnquantizedLinearMethod
File "/root/sglang/python/sglang/srt/layers/quantization/unquant.py", line 11, in
from sglang.srt.layers.moe import (
File "/root/sglang/python/sglang/srt/layers/moe/init.py", line 1, in
from sglang.srt.layers.moe.moe_runner import MoeRunner, MoeRunnerConfig
AssertionError: duplicate template name^^, "duplicate template name"r/select_algorithm.py", line 143

If you can also add support for MiMo-V2-Flash-AWQ-4bit

Works great om this PR
MiMo-V2-Flash timeout/correct:7/5
business 73/789 wrong (90.7% accuracy)
law 273/1101 wrong (75.2% accuracy)
psychology 112/798 wrong (86.0% accuracy)
biology 51/717 wrong (92.9% accuracy)
chemistry 99/1132 wrong (91.3% accuracy)
history 98/381 wrong (74.3% accuracy)
other 142/924 wrong (84.6% accuracy)
health 139/818 wrong (83.0% accuracy)
economics 86/844 wrong (89.8% accuracy)
math 53/1351 wrong (96.1% accuracy)
physics 97/1299 wrong (92.5% accuracy)
computer science 46/410 wrong (88.8% accuracy)
philosophy 86/499 wrong (82.8% accuracy)
engineering 131/969 wrong (86.5% accuracy)

ALL CATEGORIES 1491/12032 wrong (87.6% accuracy)

Sign up or log in to comment