Lekr0 commited on Apr 13

Commit

5513247

verified ·

1 Parent(s): d02d576

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

sglang/3rdparty/amd/profiling/PROFILING.md +425 -0
sglang/3rdparty/amd/profiling/client.sh +27 -0
sglang/3rdparty/amd/profiling/install_rpd.sh +10 -0
sglang/3rdparty/amd/profiling/loadTracer.sh +43 -0
sglang/3rdparty/amd/profiling/rpd.patch +12 -0
sglang/3rdparty/amd/profiling/rpd_profile_server_enable.patch +49 -0
sglang/3rdparty/amd/profiling/rpd_profile_server_enable_wCPU_activities.patch +126 -0
sglang/3rdparty/amd/profiling/server.sh +20 -0
sglang/3rdparty/amd/profiling/torch_profiler.patch +25 -0
sglang/3rdparty/amd/sgl-kernel/CMakeLists_rocm.txt +159 -0
sglang/3rdparty/amd/sgl-kernel/build_rocm.sh +123 -0
sglang/3rdparty/amd/sgl-kernel/rename_wheels_rocm.sh +30 -0
sglang/3rdparty/amd/sgl-kernel/rocm_hipify.py +40 -0
sglang/3rdparty/amd/tuning/TUNING.md +118 -0
sglang/3rdparty/amd/tuning/benchmark_moe_rocm.py +378 -0
sglang/docs/supported_models/extending/modelscope.md +28 -0
sglang/docs/supported_models/extending/support_new_models.md +320 -0
sglang/docs/supported_models/retrieval_ranking/classify_models.md +162 -0
sglang/docs/supported_models/retrieval_ranking/embedding_models.md +126 -0
sglang/docs/supported_models/retrieval_ranking/rerank_models.md +313 -0
sglang/docs/supported_models/specialized/index.rst +9 -0
sglang/docs/supported_models/specialized/reward_models.md +28 -0
sglang/docs/supported_models/text_generation/diffusion_language_models.md +111 -0
sglang/docs/supported_models/text_generation/generative_models.md +72 -0
sglang/docs/supported_models/text_generation/index.rst +11 -0
sglang/docs/supported_models/text_generation/multimodal_language_models.md +136 -0
sglang/python/sglang/srt/__pycache__/constants.cpython-311.pyc +0 -0
sglang/python/sglang/srt/__pycache__/environ.cpython-311.pyc +0 -0
sglang/python/sglang/srt/batch_overlap/__pycache__/operations.cpython-311.pyc +0 -0
sglang/python/sglang/srt/batch_overlap/__pycache__/operations_strategy.cpython-311.pyc +0 -0
sglang/python/sglang/srt/batch_overlap/__pycache__/single_batch_overlap.cpython-311.pyc +0 -0
sglang/python/sglang/srt/batch_overlap/__pycache__/two_batch_overlap.cpython-311.pyc +0 -0
sglang/python/sglang/srt/batch_overlap/operations.py +213 -0
sglang/python/sglang/srt/batch_overlap/operations_strategy.py +302 -0
sglang/python/sglang/srt/batch_overlap/single_batch_overlap.py +144 -0
sglang/python/sglang/srt/batch_overlap/two_batch_overlap.py +1082 -0
sglang/python/sglang/srt/checkpoint_engine/__init__.py +9 -0
sglang/python/sglang/srt/checkpoint_engine/checkpoint_engine_worker.py +143 -0
sglang/python/sglang/srt/checkpoint_engine/update.py +317 -0
sglang/python/sglang/srt/compilation/__pycache__/compilation_config.cpython-311.pyc +0 -0
sglang/python/sglang/srt/compilation/__pycache__/compile.cpython-311.pyc +0 -0
sglang/python/sglang/srt/compilation/__pycache__/piecewise_context_manager.cpython-311.pyc +0 -0
sglang/python/sglang/srt/compilation/backend.py +472 -0
sglang/python/sglang/srt/compilation/compilation_config.py +45 -0
sglang/python/sglang/srt/compilation/compilation_counter.py +47 -0
sglang/python/sglang/srt/compilation/compile.py +203 -0
sglang/python/sglang/srt/compilation/compiler_interface.py +504 -0
sglang/python/sglang/srt/compilation/cuda_piecewise_backend.py +206 -0
sglang/python/sglang/srt/compilation/fix_functionalization.py +134 -0
sglang/python/sglang/srt/compilation/fx_utils.py +83 -0

sglang/3rdparty/amd/profiling/PROFILING.md ADDED Viewed

	@@ -0,0 +1,425 @@

+## Profiling SGLang Infer System with AMD GPUs
+This AppNote describes the SGLang profiling technical, code augment and running steps for systems with AMD Instinct GPUs, nevertheless the same procedure may work with Nvidia GPUs too.
+Examples and steps are provided in detail, to facilitate easy reproduce and use to localize performance problem towards optimizations.
+Two primary methods are covered:
+- [RPD](https://github.com/ROCm/rocmProfileData.git)
+- [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)
+### Profiling SGLang Infer System with RPD Profiler
+RPD profiler is a low-overhead cross-platform profiler. Therefore, the same RPD code augment not only works for profiling on ROCm/AMD GPUs, but also works for profiling on CUDA/Nvidia GPUs as well. To do RPD profiling on SGLang repository, please use scripts and patch files included in this directory and follow the steps below:
+1. Install RPD with rpd.patch applied during installation using install_rpd.sh, both files are in this directory.
+install_rpd.sh
+```bash
+# download and install RPD
+apt update && apt install -y sqlite3 libsqlite3-dev libfmt-dev
+# install rpd module
+git clone https://github.com/ROCmSoftwarePlatform/rocmProfileData
+cd rocmProfileData
+git checkout 976899e9c6dbc6dd2bccf770818e4e44125590ac
+git apply rpd.patch
+make && make install
+cd rocpd_python && python setup.py install && cd ..
+cd rpd_tracer && make clean;make install && python setup.py install && cd ..
+```
+rpd.patch
+```bash
+diff --git a/rpd_tracer/Makefile b/rpd_tracer/Makefile
+index e9d9feb..b2e9e1a 100644
+--- a/rpd_tracer/Makefile
++++ b/rpd_tracer/Makefile
+@@ -16,7 +16,7 @@ ifneq (,$(HIP_PATH))
+         $(info Building with roctracer)
+         RPD_LIBS += -L/opt/rocm/lib -lroctracer64 -lroctx64 -lamdhip64 -lrocm_smi64
+         RPD_INCLUDES += -I/opt/rocm/include -I/opt/rocm/include/roctracer -I/opt/rocm/include/hsa
+-        RPD_SRCS += RoctracerDataSource.cpp RocmSmiDataSource.cpp
++        RPD_SRCS += RoctracerDataSource.cpp
+         RPD_INCLUDES += -D__HIP_PLATFORM_AMD__
+ endif
+```
+2. Add loadTracer.sh file included in this directory to /sglang/python/sglang.
+loadTracer.sh
+```bash
+#!/bin/bash
+################################################################################
+# Copyright (c) 2021 - 2023 Advanced Micro Devices, Inc. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+################################################################################
+OUTPUT_FILE="trace.rpd"
+if [ "$1" = "-o" ] ; then
+  OUTPUT_FILE=$2
+  shift
+  shift
+fi
+if [ -e ${OUTPUT_FILE} ] ; then
+  rm ${OUTPUT_FILE}
+fi
+python3 -m rocpd.schema --create ${OUTPUT_FILE}
+if [ $? != 0 ] ; then
+  echo "Error: Could not create rpd file. Please run 'python setup.py install' from the rocpd_python dir"
+  exit
+fi
+export RPDT_FILENAME=${OUTPUT_FILE}
+export RPDT_AUTOSTART=0
+LD_PRELOAD=librocm-smi_64:librpd_tracer.so "$@"
+```
+3. Apply patch (provided in this directory) with "git apply rpd_profile_server_enable.patch" if the main profiling purpose is to get info on gpu kernels as well as limited cpu activity info.
+#### Common Notes 1
+Please note that although we are doing TP=8 in the example, we purposely only log RPD profiling on 2 ranks in the patch file (i.e.tp_rank=0/1) for profiling/visualization convenience, as even Perfetto streaming mode can only load maximal 8GB json file for visualization. With 2 ranks logged in RPD profiling, we could still check whether there are issues among ranks (e.g. load imbalance issue, nccl issue), and at the same time, we could log relatively longer time duration before the json file generated from RPD file hits 8GB size.
+rpd_profile_server_enable.patch
+```bash
+diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
+index 62d1ff9..9021c01 100644
+--- a/python/sglang/srt/managers/scheduler.py
++++ b/python/sglang/srt/managers/scheduler.py
+@@ -71,6 +71,8 @@ from sglang.srt.utils import (
+     suppress_other_loggers,
+ )
+ from sglang.utils import get_exception_traceback
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
+ logger = logging.getLogger(__name__)
+@@ -245,6 +247,7 @@ class Scheduler:
+                 ],
+                 with_stack=True,
+             )
++            self.rpd = rpdTracerControl()
+     @torch.inference_mode()
+     def event_loop(self):
+@@ -1027,15 +1030,24 @@ class Scheduler:
+     def start_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.start()
++        #self.profiler.start() #block pytorch profiler for rpd profiler enabling
++        if self.tp_rank == 0 or self.tp_rank == 1:
++            self.rpd.start()
++            self.rpd.rangePush("", "rpd profile range", "")
++            logger.info("rpd is enabled")
+     def stop_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.stop()
+-        self.profiler.export_chrome_trace(
+-            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+-        )
++        #self.profiler.stop()
++        #self.profiler.export_chrome_trace(
++        #    self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
++        #)
++        if self.tp_rank ==0 or self.tp_rank ==1:
++            self.rpd.rangePop()
++            self.rpd.stop()
++            self.rpd.flush()
++            logger.info("rpd is done")
+         logger.info("Profiler is done")
+```
+#### Advanced Debugging with RPD Profiler
+Sometimes, we want to use rpd profiler to capture more CPU and python activities in order to debug some challenging issues (e.g. root cause of load imbalance across gpu processes, root cause of bubbles, etc). Only in such cases, we need to apply patch "git apply rpd_profile_server_enable_wCPU_activities.patch", where 3 files are modified.
+rpd_profile_server_enable_wCPU_activities.patch
+```bash
+diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
+index 62d1ff9..2edb427 100644
+--- a/python/sglang/srt/managers/scheduler.py
++++ b/python/sglang/srt/managers/scheduler.py
+@@ -71,6 +71,8 @@ from sglang.srt.utils import (
+     suppress_other_loggers,
+ )
+ from sglang.utils import get_exception_traceback
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
+ logger = logging.getLogger(__name__)
+@@ -245,6 +247,7 @@ class Scheduler:
+                 ],
+                 with_stack=True,
+             )
++            self.rpd = rpdTracerControl()
+     @torch.inference_mode()
+     def event_loop(self):
+@@ -1027,15 +1030,26 @@ class Scheduler:
+     def start_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.start()
++        #self.profiler.start()
++        logger.info("torch profiler is disabled")
++        if self.tp_rank == 0 or self.tp_rank == 1:
++            self.rpd.setPythonTrace(True)
++            self.rpd.start()
++            self.rpd.rangePush("", "scheduler", "")
++        logger.info("rpd is enabled inside scheduler profiling")
+     def stop_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.stop()
+-        self.profiler.export_chrome_trace(
+-            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+-        )
++        #self.profiler.stop()
++        #self.profiler.export_chrome_trace(
++        #    self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
++        #)
++        if self.tp_rank ==0 or self.tp_rank ==1:
++            self.rpd.rangePop()
++            self.rpd.stop()
++            self.rpd.flush()
++            logger.info("rpd is done inside scheduler")
+         logger.info("Profiler is done")
+diff --git a/python/sglang/srt/managers/tokenizer_manager.py b/python/sglang/srt/managers/tokenizer_manager.py
+index 2621ccd..181df85 100644
+--- a/python/sglang/srt/managers/tokenizer_manager.py
++++ b/python/sglang/srt/managers/tokenizer_manager.py
+@@ -58,6 +58,10 @@ from sglang.srt.sampling.sampling_params import SamplingParams
+ from sglang.srt.server_args import PortArgs, ServerArgs
+ from sglang.srt.utils import is_generation_model, is_multimodal_model
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
++
++
+ asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
+ logger = logging.getLogger(__name__)
+@@ -514,10 +518,20 @@ class TokenizerManager:
+         self.send_to_scheduler.send_pyobj(req)
+     def start_profile(self):
++        rpd = rpdTracerControl()
++        rpd.setPythonTrace(True)
++        rpd.start()
++        rpd.rangePush("", "tokenizer_manager", "")
++        logger.info("tokenizer_manager rpd profiling started!")
+         req = ProfileReq.START_PROFILE
+         self.send_to_scheduler.send_pyobj(req)
+     def stop_profile(self):
++        rpd = rpdTracerControl()
++        rpd.rangePop()
++        rpd.stop()
++        rpd.flush()
++        logger.info("rpd profiling is done inside tokenizer_manager!")
+         req = ProfileReq.STOP_PROFILE
+         self.send_to_scheduler.send_pyobj(req)
+diff --git a/python/sglang/srt/server.py b/python/sglang/srt/server.py
+index 7111c93..2bd722c 100644
+--- a/python/sglang/srt/server.py
++++ b/python/sglang/srt/server.py
+@@ -30,6 +30,8 @@ import threading
+ import time
+ from http import HTTPStatus
+ from typing import Dict, List, Optional, Union
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
+ # Fix a bug of Python threading
+ setattr(threading, "_register_atexit", lambda *args, **kwargs: None)
+@@ -152,6 +154,11 @@ async def flush_cache():
+ @app.post("/start_profile")
+ async def start_profile():
+     """Start profiling."""
++    rpd = rpdTracerControl()
++    rpd.setPythonTrace(True)
++    rpd.start()
++    rpd.rangePush("", "server rpd profile range", "")
++    logger.info("rpd profiling started in server.py!")
+     tokenizer_manager.start_profile()
+     return Response(
+         content="Start profiling.\n",
+@@ -164,6 +171,11 @@ async def start_profile():
+ async def stop_profile():
+     """Stop profiling."""
+     tokenizer_manager.stop_profile()
++    rpd = rpdTracerControl()
++    rpd.rangePop()
++    rpd.stop()
++    rpd.flush()
++    logger.info("rpd profiling is done in server.py!")
+     return Response(
+         content="Stop profiling. This will take some time.\n",
+         status_code=200,
+```
+4. As an example for grok1 profiling, we create a dummy_grok1 directory with config.json (see content below) inside this directory and copy this directory to the right path for "--model-path" if you want to use the example server.sh file provided.
+```bash
+cat ../dummy_grok1/config.json
+{
+  "architectures": [
+    "Grok1ModelForCausalLM"
+  ],
+  "embedding_multiplier_scale": 78.38367176906169,
+  "output_multiplier_scale": 0.5773502691896257,
+  "vocab_size": 131072,
+  "hidden_size": 6144,
+  "intermediate_size": 32768,
+  "max_position_embeddings": 8192,
+  "num_experts_per_tok": 2,
+  "num_local_experts": 8,
+  "num_attention_heads": 48,
+  "num_hidden_layers": 64,
+  "num_key_value_heads": 8,
+  "head_dim": 128,
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 10000.0,
+  "model_type": "mixtral",
+  "torch_dtype": "bfloat16"
+}
+```
+5. Launch server with rpd enabled script ./server.sh in one terminal inside the docker container.
+#### Common Notes 2
+- Remember to change model-path to the correct path
+- loadTracer.sh is needed to conduct profiling
+- SGLANG_TORCH_PROFILER_DIR is used for default torch profiler
+- Do not use loadTracer.sh if you are using the torch profiler, simply use python3 -m sglang.launch_server.
+server.sh
+```bash
+#!/bin/bash
+# export SGLANG_TORCH_PROFILER_DIR=/data/sglang/
+export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
+# Get the current timestamp
+TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
+# Define the log file with a timestamp
+LOGFILE="sglang_server_log_$TIMESTAMP.json"
+# Run the Python command and save the output to the log file
+loadTracer.sh python3 -m sglang.launch_server \
+    --model-path /sgl-workspace/sglang/dummy_grok1 \
+    --tokenizer-path Xenova/grok-1-tokenizer \
+    --load-format dummy \
+    --quantization fp8 \
+    --tp 8 \
+    --port 30000 \
+    --disable-radix-cache 2>&1 | tee "$LOGFILE"
+```
+6. Open another terminal for the same docker container, and run the rpd enabled ./client.sh after you see "The server is fired up and is ready to roll!" message from server side terminal.
+#### Common Notes 3
+- Use curl http://localhost:30000/start_profile & curl http://localhost:30000/stop_profile to control the start and end of profiling. Check sglang/python/sglang/srt/managers/scheduler.py for more details.
+- Please don't use RPD profiler together with PyTorch profiler to avoid interference.
+- The rocmProfileData/tools/rpd2tracing.py file is used to generate json file from RPD file.
+client.sh
+```bash
+#!/bin/bash
+# Start profiling via API
+curl http://localhost:30000/start_profile -H "Content-Type: application/json"
+# Benchmark serving using sglang with random dataset and tokenizer
+# Define the log file with a timestamp
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+LOGFILE="sglang_client_log_$TIMESTAMP.json"
+# Run the benchmark with specified parameters and save logs
+python3 -m sglang.bench_serving \
+    --backend sglang \
+    --tokenizer Xenova/grok-1-tokenizer \
+    --dataset-name random \
+    --random-input 1024\
+    --random-output 1024 \
+    --num-prompts 120 \
+    --request-rate 8 \
+    --output-file online.jsonl 2>&1 | tee "$LOGFILE"
+# Stop profiling via API
+curl http://localhost:30000/stop_profile -H "Content-Type: application/json"
+# Convert tracing file to csv & json
+sqlite3 trace.rpd ".mode csv" ".header on" ".output trace.csv" "select * from top;" ".output stdout"
+python3 ./rocmProfileData/tools/rpd2tracing.py trace.rpd trace.json
+```
+7. Follow [Perfetto docs](https://perfetto.dev/docs/visualization/large-traces) to visualize large json files. Try to adjust parameters so that the trace.json file size is less than 9GB.
+### Profiling SGLang Infer System with PyTorch Profiler
+Please use the steps as follows:
+1. Apply the patch torch_profiler.patch. Note that you can modify "if self.tp_rank == 0" in the patch to allow more ranks be recorded in profiling.
+torch_profiler.patch
+```bash
+diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
+index 62d1ff9..6ecd78c 100644
+--- a/python/sglang/srt/managers/scheduler.py
++++ b/python/sglang/srt/managers/scheduler.py
+@@ -240,7 +240,6 @@ class Scheduler:
+             )
+             self.profiler = torch.profiler.profile(
+                 activities=[
+-                    torch.profiler.ProfilerActivity.CPU,
+                     torch.profiler.ProfilerActivity.CUDA,
+                 ],
+                 with_stack=True,
+@@ -1033,9 +1032,11 @@ class Scheduler:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+         self.profiler.stop()
+-        self.profiler.export_chrome_trace(
+-            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+-        )
++        if self.tp_rank == 0:
++            with open(f"stats_repro_{int(time.time())}.txt", "w") as f:
++                print(self.profiler.key_averages(group_by_input_shape=True).table(sort_by="cuda_time_total", row_limit=-1), file=f)
++                print("Profiling stats done.")
++
+         logger.info("Profiler is done")
+```
+2. Create the model path directory and copy it to the right path for "--model-path" if you want to use the server.sh file provided.
+3. Modify the included server.sh by removing "loadTracer.sh" before python command and launch script ./server.sh in one terminal inside the docker container.
+4. Similar to step 6 in RPD profiling section, but remove the last 2 lines in client.sh, which converted rpd file into csv and json files. Run modified client.sh for PyTorch profiling.
+-------
+- [Torch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html)

sglang/3rdparty/amd/profiling/client.sh ADDED Viewed

	@@ -0,0 +1,27 @@

+#!/bin/bash
+# Start profiling via API
+curl http://localhost:30000/start_profile -H "Content-Type: application/json"
+# Benchmark serving using sglang with random dataset and tokenizer
+# Define the log file with a timestamp
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+LOGFILE="sglang_client_log_$TIMESTAMP.json"
+# Run the benchmark with specified parameters and save logs
+python3 -m sglang.bench_serving \
+    --backend sglang \
+    --tokenizer Xenova/grok-1-tokenizer \
+    --dataset-name random \
+    --random-input 1024\
+    --random-output 1024 \
+    --num-prompts 240 \
+    --request-rate 8 \
+    --output-file online.jsonl 2>&1 | tee "$LOGFILE"
+# Stop profiling via API
+curl http://localhost:30000/stop_profile -H "Content-Type: application/json"
+# Convert tracing file to csv & json
+sqlite3 trace.rpd ".mode csv" ".header on" ".output trace.csv" "select * from top;" ".output stdout"
+python3 /sgl-workspace/rocmProfileData/tools/rpd2tracing.py trace.rpd trace.json

sglang/3rdparty/amd/profiling/install_rpd.sh ADDED Viewed

	@@ -0,0 +1,10 @@

+# download and install RPD
+apt update && apt install -y sqlite3 libsqlite3-dev libfmt-dev
+# install rpd module
+git clone https://github.com/ROCmSoftwarePlatform/rocmProfileData
+cd rocmProfileData
+git apply rpd.patch
+make && make install
+cd rocpd_python && python setup.py install && cd ..
+cd rpd_tracer && make clean;make install && python setup.py install && cd ..

sglang/3rdparty/amd/profiling/loadTracer.sh ADDED Viewed

	@@ -0,0 +1,43 @@

+#!/bin/bash
+################################################################################
+# Copyright (c) 2021 - 2023 Advanced Micro Devices, Inc. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+# THE SOFTWARE.
+################################################################################
+OUTPUT_FILE="trace.rpd"
+if [ "$1" = "-o" ] ; then
+  OUTPUT_FILE=$2
+  shift
+  shift
+fi
+if [ -e ${OUTPUT_FILE} ] ; then
+  rm ${OUTPUT_FILE}
+fi
+python3 -m rocpd.schema --create ${OUTPUT_FILE}
+if [ $? != 0 ] ; then
+  echo "Error: Could not create rpd file. Please run 'python setup.py install' from the rocpd_python dir"
+  exit
+fi
+export RPDT_FILENAME=${OUTPUT_FILE}
+export RPDT_AUTOSTART=0
+LD_PRELOAD=librocm-smi_64:librpd_tracer.so "$@"

sglang/3rdparty/amd/profiling/rpd.patch ADDED Viewed

	@@ -0,0 +1,12 @@

+diff --git a/rpd_tracer/Makefile b/rpd_tracer/Makefile
+index e9d9feb..b2e9e1a 100644
+--- a/rpd_tracer/Makefile
++++ b/rpd_tracer/Makefile
+@@ -16,7 +16,7 @@ ifneq (,$(HIP_PATH))
+         $(info Building with roctracer)
+         RPD_LIBS += -L/opt/rocm/lib -lroctracer64 -lroctx64 -lamdhip64 -lrocm_smi64
+         RPD_INCLUDES += -I/opt/rocm/include -I/opt/rocm/include/roctracer -I/opt/rocm/include/hsa
+-        RPD_SRCS += RoctracerDataSource.cpp RocmSmiDataSource.cpp
++        RPD_SRCS += RoctracerDataSource.cpp
+         RPD_INCLUDES += -D__HIP_PLATFORM_AMD__
+ endif

sglang/3rdparty/amd/profiling/rpd_profile_server_enable.patch ADDED Viewed

	@@ -0,0 +1,49 @@

+diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
+index 62d1ff9..9021c01 100644
+--- a/python/sglang/srt/managers/scheduler.py
++++ b/python/sglang/srt/managers/scheduler.py
+@@ -71,6 +71,8 @@ from sglang.srt.utils import (
+     suppress_other_loggers,
+ )
+ from sglang.utils import get_exception_traceback
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
+ logger = logging.getLogger(__name__)
+@@ -245,6 +247,7 @@ class Scheduler:
+                 ],
+                 with_stack=True,
+             )
++            self.rpd = rpdTracerControl()
+     @torch.inference_mode()
+     def event_loop(self):
+@@ -1027,15 +1030,24 @@ class Scheduler:
+     def start_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.start()
++        #self.profiler.start() #block pytorch profiler for rpd profiler enabling
++        if self.tp_rank == 0 or self.tp_rank == 1:
++            self.rpd.start()
++            self.rpd.rangePush("", "rpd profile range", "")
++            logger.info("rpd is enabled")
+     def stop_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.stop()
+-        self.profiler.export_chrome_trace(
+-            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+-        )
++        #self.profiler.stop()
++        #self.profiler.export_chrome_trace(
++        #    self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
++        #)
++        if self.tp_rank ==0 or self.tp_rank ==1:
++            self.rpd.rangePop()
++            self.rpd.stop()
++            self.rpd.flush()
++            logger.info("rpd is done")
+         logger.info("Profiler is done")

sglang/3rdparty/amd/profiling/rpd_profile_server_enable_wCPU_activities.patch ADDED Viewed

	@@ -0,0 +1,126 @@

+diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
+index 62d1ff9..2edb427 100644
+--- a/python/sglang/srt/managers/scheduler.py
++++ b/python/sglang/srt/managers/scheduler.py
+@@ -71,6 +71,8 @@ from sglang.srt.utils import (
+     suppress_other_loggers,
+ )
+ from sglang.utils import get_exception_traceback
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
+ logger = logging.getLogger(__name__)
+@@ -245,6 +247,7 @@ class Scheduler:
+                 ],
+                 with_stack=True,
+             )
++            self.rpd = rpdTracerControl()
+     @torch.inference_mode()
+     def event_loop(self):
+@@ -1027,15 +1030,26 @@ class Scheduler:
+     def start_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.start()
++        #self.profiler.start()
++        logger.info("torch profiler is disabled")
++        if self.tp_rank == 0 or self.tp_rank == 1:
++            self.rpd.setPythonTrace(True)
++            self.rpd.start()
++            self.rpd.rangePush("", "scheduler", "")
++        logger.info("rpd is enabled inside scheduler profiling")
+     def stop_profile(self) -> None:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+-        self.profiler.stop()
+-        self.profiler.export_chrome_trace(
+-            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+-        )
++        #self.profiler.stop()
++        #self.profiler.export_chrome_trace(
++        #    self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
++        #)
++        if self.tp_rank ==0 or self.tp_rank ==1:
++            self.rpd.rangePop()
++            self.rpd.stop()
++            self.rpd.flush()
++            logger.info("rpd is done inside scheduler")
+         logger.info("Profiler is done")
+diff --git a/python/sglang/srt/managers/tokenizer_manager.py b/python/sglang/srt/managers/tokenizer_manager.py
+index 2621ccd..181df85 100644
+--- a/python/sglang/srt/managers/tokenizer_manager.py
++++ b/python/sglang/srt/managers/tokenizer_manager.py
+@@ -58,6 +58,10 @@ from sglang.srt.sampling.sampling_params import SamplingParams
+ from sglang.srt.server_args import PortArgs, ServerArgs
+ from sglang.srt.utils import is_generation_model, is_multimodal_model
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
++
++
+ asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
+ logger = logging.getLogger(__name__)
+@@ -514,10 +518,20 @@ class TokenizerManager:
+         self.send_to_scheduler.send_pyobj(req)
+     def start_profile(self):
++        rpd = rpdTracerControl()
++        rpd.setPythonTrace(True)
++        rpd.start()
++        rpd.rangePush("", "tokenizer_manager", "")
++        logger.info("tokenizer_manager rpd profiling started!")
+         req = ProfileReq.START_PROFILE
+         self.send_to_scheduler.send_pyobj(req)
+     def stop_profile(self):
++        rpd = rpdTracerControl()
++        rpd.rangePop()
++        rpd.stop()
++        rpd.flush()
++        logger.info("rpd profiling is done inside tokenizer_manager!")
+         req = ProfileReq.STOP_PROFILE
+         self.send_to_scheduler.send_pyobj(req)
+diff --git a/python/sglang/srt/server.py b/python/sglang/srt/server.py
+index 7111c93..2bd722c 100644
+--- a/python/sglang/srt/server.py
++++ b/python/sglang/srt/server.py
+@@ -30,6 +30,8 @@ import threading
+ import time
+ from http import HTTPStatus
+ from typing import Dict, List, Optional, Union
++from rpdTracerControl import rpdTracerControl
++rpdTracerControl.skipCreate()
+ # Fix a bug of Python threading
+ setattr(threading, "_register_atexit", lambda *args, **kwargs: None)
+@@ -152,6 +154,11 @@ async def flush_cache():
+ @app.post("/start_profile")
+ async def start_profile():
+     """Start profiling."""
++    rpd = rpdTracerControl()
++    rpd.setPythonTrace(True)
++    rpd.start()
++    rpd.rangePush("", "server rpd profile range", "")
++    logger.info("rpd profiling started in server.py!")
+     tokenizer_manager.start_profile()
+     return Response(
+         content="Start profiling.\n",
+@@ -164,6 +171,11 @@ async def start_profile():
+ async def stop_profile():
+     """Stop profiling."""
+     tokenizer_manager.stop_profile()
++    rpd = rpdTracerControl()
++    rpd.rangePop()
++    rpd.stop()
++    rpd.flush()
++    logger.info("rpd profiling is done in server.py!")
+     return Response(
+         content="Stop profiling. This will take some time.\n",
+         status_code=200,

sglang/3rdparty/amd/profiling/server.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+#!/bin/bash
+# export SGLANG_TORCH_PROFILER_DIR=/data/sglang/
+export SGLANG_TORCH_PROFILER_DIR=/sgl-workspace/sglang/profile/
+# Get the current timestamp
+TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
+# Define the log file with a timestamp
+LOGFILE="sglang_server_log_$TIMESTAMP.json"
+# Run the Python command and save the output to the log file
+loadTracer.sh python3 -m sglang.launch_server \
+    --model-path /sgl-workspace/sglang/dummy_grok1 \
+    --tokenizer-path Xenova/grok-1-tokenizer \
+    --load-format dummy \
+    --quantization fp8 \
+    --tp 8 \
+    --port 30000 \
+    --disable-radix-cache 2>&1 | tee "$LOGFILE"

sglang/3rdparty/amd/profiling/torch_profiler.patch ADDED Viewed

	@@ -0,0 +1,25 @@

+diff --git a/python/sglang/srt/managers/scheduler.py b/python/sglang/srt/managers/scheduler.py
+index 62d1ff9..6ecd78c 100644
+--- a/python/sglang/srt/managers/scheduler.py
++++ b/python/sglang/srt/managers/scheduler.py
+@@ -240,7 +240,6 @@ class Scheduler:
+             )
+             self.profiler = torch.profiler.profile(
+                 activities=[
+-                    torch.profiler.ProfilerActivity.CPU,
+                     torch.profiler.ProfilerActivity.CUDA,
+                 ],
+                 with_stack=True,
+@@ -1033,9 +1032,11 @@ class Scheduler:
+         if self.profiler is None:
+             raise RuntimeError("Profiler is not enabled.")
+         self.profiler.stop()
+-        self.profiler.export_chrome_trace(
+-            self.torch_profiler_trace_dir + "/" + str(time.time()) + ".trace.json.gz"
+-        )
++        if self.tp_rank == 0:
++            with open(f"stats_repro_{int(time.time())}.txt", "w") as f:
++                print(self.profiler.key_averages(group_by_input_shape=True).table(sort_by="cuda_time_total", row_limit=-1), file=f)
++                print("Profiling stats done.")
++
+         logger.info("Profiler is done")

sglang/3rdparty/amd/sgl-kernel/CMakeLists_rocm.txt ADDED Viewed

	@@ -0,0 +1,159 @@

+cmake_minimum_required(VERSION 3.24 FATAL_ERROR)
+project(sgl_kernel LANGUAGES CXX)
+# Cmake
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED ON)
+set(CMAKE_POSITION_INDEPENDENT_CODE ON)
+set(CMAKE_SHARED_LIBRARY_PREFIX "")
+set(CMAKE_COLOR_DIAGNOSTICS ON)
+set(CMAKE_VERBOSE_MAKEFILE ON CACHE BOOL "ON")
+# Python / Torch
+find_package(Python COMPONENTS Interpreter Development.Module ${SKBUILD_SABI_COMPONENT} REQUIRED)
+execute_process(
+  COMMAND ${Python_EXECUTABLE} -c "import torch; print(torch.utils.cmake_prefix_path)"
+  OUTPUT_VARIABLE TORCH_PY_PREFIX
+  OUTPUT_STRIP_TRAILING_WHITESPACE
+)
+set(Torch_DIR "${TORCH_PY_PREFIX}/Torch")
+list(APPEND CMAKE_PREFIX_PATH "${TORCH_PY_PREFIX}/Torch")
+find_package(Torch REQUIRED)
+execute_process(
+  COMMAND ${Python_EXECUTABLE} -c "import torch; print(int(torch._C._GLIBCXX_USE_CXX11_ABI))"
+  OUTPUT_VARIABLE TORCH_CXX11_ABI
+  OUTPUT_STRIP_TRAILING_WHITESPACE
+)
+if(TORCH_CXX11_ABI STREQUAL "0")
+  add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0)
+else()
+  add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=1)
+endif()
+# ROCm/HIP
+enable_language(HIP)
+find_package(hip REQUIRED CONFIG)
+# Determine AMDGPU target from environment variable or default to gfx942
+set(AMDGPU_TARGET_ENV "$ENV{AMDGPU_TARGET}")
+if(AMDGPU_TARGET_ENV)
+  # Use environment variable if specified
+  set(AMDGPU_TARGETS "${AMDGPU_TARGET_ENV}")
+  message(STATUS "Using AMDGPU_TARGET from environment: ${AMDGPU_TARGETS}")
+else()
+  # Default to gfx942 only
+  set(AMDGPU_TARGETS "gfx942")
+  message(STATUS "AMDGPU_TARGET not set, defaulting to gfx942")
+endif()
+# Set HIP architectures
+set(CMAKE_HIP_ARCHITECTURES ${AMDGPU_TARGETS})
+# FP8 macro selection
+# Always define HIP_FP8_TYPE_FNUZ=1 (for gfx942 and host compilation)
+# Additionally define HIP_FP8_TYPE_E4M3=1 when building for gfx950
+# The existing utils.h logic will pick the right one based on architecture
+set(SGL_FP8_MACROS "-DHIP_FP8_TYPE_FNUZ=1")
+if(AMDGPU_TARGETS MATCHES "gfx950")
+  list(APPEND SGL_FP8_MACROS "-DHIP_FP8_TYPE_E4M3=1")
+  message(STATUS "Multi-arch build: Enabling both HIP_FP8_TYPE_FNUZ (gfx942) and HIP_FP8_TYPE_E4M3 (gfx950)")
+elseif(AMDGPU_TARGETS MATCHES "gfx942")
+  message(STATUS "Single-arch build: Enabling HIP_FP8_TYPE_FNUZ for gfx942")
+else()
+  message(FATAL_ERROR "Unsupported AMDGPU_TARGET '${AMDGPU_TARGETS}'. Expected 'gfx942' or 'gfx950' or both.")
+endif()
+# TopK dynamic smem bytes
+# Dynamic shared-memory budget for the TopK kernels.
+# - gfx942 (MI300/MI325): LDS is typically 64KB per workgroup -> keep dynamic smem <= ~48KB
+#   (leaves room for static shared allocations in the kernel).
+# - gfx95x (MI350): LDS is larger (e.g. 160KB per CU) -> allow the original 128KB dynamic smem.
+if(AMDGPU_TARGET_ONE STREQUAL "gfx942")
+  math(EXPR SGL_TOPK_DYNAMIC_SMEM_BYTES "48 * 1024")
+else()
+  math(EXPR SGL_TOPK_DYNAMIC_SMEM_BYTES "32 * 1024 * 4")
+endif()
+set(SGL_TOPK_MACROS "-DSGL_TOPK_DYNAMIC_SMEM_BYTES=${SGL_TOPK_DYNAMIC_SMEM_BYTES}")
+# Paths / includes
+set(PROJ_ROOT ${CMAKE_CURRENT_LIST_DIR})
+set(SGL_INCLUDE_DIRS
+  ${PROJ_ROOT}/include
+  ${PROJ_ROOT}/include/impl
+  ${PROJ_ROOT}/csrc
+  ${TORCH_INCLUDE_DIRS}
+)
+# Platform-specific library directory
+set(PLAT_LIB_DIR "/usr/lib/x86_64-linux-gnu")
+link_directories(${PLAT_LIB_DIR})
+# Sources
+set(SOURCES
+${PROJ_ROOT}/csrc/allreduce/custom_all_reduce.hip
+${PROJ_ROOT}/csrc/allreduce/deterministic_all_reduce.hip
+${PROJ_ROOT}/csrc/allreduce/quick_all_reduce.hip
+${PROJ_ROOT}/csrc/common_extension_rocm.cc
+${PROJ_ROOT}/csrc/elementwise/activation.hip
+${PROJ_ROOT}/csrc/elementwise/pos_enc.hip
+${PROJ_ROOT}/csrc/elementwise/topk.hip
+${PROJ_ROOT}/csrc/grammar/apply_token_bitmask_inplace_hip.hip
+${PROJ_ROOT}/csrc/kvcacheio/transfer.hip
+${PROJ_ROOT}/csrc/moe/moe_align_kernel.hip
+${PROJ_ROOT}/csrc/moe/moe_topk_softmax_kernels.hip
+${PROJ_ROOT}/csrc/moe/moe_topk_sigmoid_kernels.hip
+${PROJ_ROOT}/csrc/speculative/eagle_utils.hip
+)
+set_source_files_properties(
+  ${SOURCES}
+  PROPERTIES
+    LANGUAGE HIP
+)
+# Compile / Link flags
+add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-O3>)
+set(SGL_HIP_FLAGS
+  -DNDEBUG
+  -DOPERATOR_NAMESPACE=sgl_kernel
+  -O3
+  -std=c++17
+  -DENABLE_BF16
+  -DENABLE_FP8
+  ${SGL_FP8_MACROS}
+  -Wno-pass-failed
+  -Wundefined-internal
+  ${SGL_TOPK_MACROS}
+)
+# Python extension
+Python_add_library(common_ops MODULE USE_SABI ${SKBUILD_SABI_VERSION} WITH_SOABI ${SOURCES})
+target_include_directories(common_ops PRIVATE ${SGL_INCLUDE_DIRS})
+# Apply per-language flags
+target_compile_options(common_ops PRIVATE
+  $<$<COMPILE_LANGUAGE:HIP>:${SGL_HIP_FLAGS}>
+)
+target_link_libraries(common_ops PRIVATE
+  ${TORCH_LIBRARIES}
+  hip::device
+  hip::host
+  hiprtc
+  amdhip64
+)
+target_link_options(common_ops PRIVATE
+  "SHELL:-Wl,-rpath,'\$ORIGIN/../../torch/lib'"
+)
+install(TARGETS common_ops
+  LIBRARY DESTINATION sgl_kernel
+)

sglang/3rdparty/amd/sgl-kernel/build_rocm.sh ADDED Viewed

	@@ -0,0 +1,123 @@

+#!/bin/bash
+set -euo pipefail
+ROCM_VERSION=$1
+PYTHON_ROOT_PATH="/opt/venv/bin"
+AMDGPU_TARGET="gfx942;gfx950"
+echo "Python root path is: $PYTHON_ROOT_PATH"
+# Get version from git tags
+SGLANG_VERSION="v0.5.6"   # Default version, will be overridden if git tags are found
+# Fetch tags from origin to ensure we have the latest
+if git fetch --tags origin; then
+  # Get the latest version tag sorted by version number (e.g., v0.5.7)
+  VERSION_FROM_TAG=$(git tag -l 'v[0-9]*' --sort=-v:refname | head -1)
+  if [ -n "$VERSION_FROM_TAG" ]; then
+    SGLANG_VERSION="$VERSION_FROM_TAG"
+    echo "Using SGLang version from git tags: $SGLANG_VERSION"
+  else
+    echo "Warning: No version tags found; using default $SGLANG_VERSION" >&2
+  fi
+else
+  echo "Warning: Failed to fetch tags from origin; using default $SGLANG_VERSION" >&2
+fi
+# Default base tags (can be overridden by command line arguments)
+DEFAULT_MI30X_BASE_TAG="${SGLANG_VERSION}-rocm700-mi30x"
+DEFAULT_MI35X_BASE_TAG="${SGLANG_VERSION}-rocm700-mi35x"
+# Parse command line arguments
+MI30X_BASE_TAG="${DEFAULT_MI30X_BASE_TAG}"
+MI35X_BASE_TAG="${DEFAULT_MI35X_BASE_TAG}"
+# Detect GPU architecture from the Kubernetes runner hostname
+HOSTNAME_VALUE=$(hostname)
+GPU_ARCH="mi30x"   # default
+# Host names look like: linux-mi35x-gpu-1-xxxxx-runner-zzzzz
+if [[ "${HOSTNAME_VALUE}" =~ ^linux-(mi[0-9]+[a-z]*)-gpu-[0-9]+ ]]; then
+  GPU_ARCH="${BASH_REMATCH[1]}"
+  echo "Detected GPU architecture from hostname: ${GPU_ARCH}"
+else
+  echo "Warning: could not parse GPU architecture from '${HOSTNAME_VALUE}', defaulting to ${GPU_ARCH}"
+fi
+case "${GPU_ARCH}" in
+  mi35x)
+    echo "Runner uses ${GPU_ARCH}; will fetch mi35x image."
+    ;;
+  mi30x|mi300|mi325)
+    echo "Runner uses ${GPU_ARCH}; will fetch mi30x image."
+    GPU_ARCH="mi30x"
+    ;;
+  *)
+    echo "Runner architecture '${GPU_ARCH}' unrecognised; defaulting to mi30x image." >&2
+    GPU_ARCH="mi30x"
+    ;;
+esac
+if [[ -f /etc/podinfo/gha-render-devices ]]; then
+  DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
+else
+  DEVICE_FLAG="--device /dev/dri"
+fi
+# Find the latest image
+find_latest_image() {
+  local gpu_arch=$1
+  local base_tag days_back image_tag
+  case "${gpu_arch}" in
+      mi30x) base_tag="${MI30X_BASE_TAG}" ;;
+      mi35x) base_tag="${MI35X_BASE_TAG}" ;;
+      *)     echo "Error: unsupported GPU architecture '${gpu_arch}'" >&2; return 1 ;;
+  esac
+  for days_back in {0..6}; do
+    image_tag="${base_tag}-$(date -d "${days_back} days ago" +%Y%m%d)"
+    echo "Checking for image: rocm/sgl-dev:${image_tag}" >&2
+    if docker manifest inspect "rocm/sgl-dev:${image_tag}" >/dev/null 2>&1; then
+      echo "Found available image: rocm/sgl-dev:${image_tag}" >&2
+      echo "rocm/sgl-dev:${image_tag}"
+      return 0
+    fi
+  done
+  echo "Error: no ${gpu_arch} image found in the last 7 days for base ${base_tag}" >&2
+  echo "Using hard-coded fallback…" >&2
+  if [[ "${gpu_arch}" == "mi35x" ]]; then
+    echo "rocm/sgl-dev:v0.5.3-rocm700-mi35x-20251009"
+  else
+    echo "rocm/sgl-dev:v0.5.3-rocm700-mi30x-20251009"
+  fi
+}
+# Pull and run the latest image
+IMAGE=$(find_latest_image "${GPU_ARCH}")
+echo "Pulling Docker image: ${IMAGE}"
+docker pull "${IMAGE}"
+docker run --rm \
+  -v $(pwd):/sgl-kernel \
+  -e AMDGPU_TARGET="${AMDGPU_TARGET}" \
+  ${IMAGE} \
+  bash -c "
+  # Install CMake (version >= 3.26) - Robust Installation
+  export CMAKE_VERSION_MAJOR=3.31
+  export CMAKE_VERSION_MINOR=1
+  echo \"Downloading CMake from: https://cmake.org/files/v\${CMAKE_VERSION_MAJOR}/cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz\"
+  wget https://cmake.org/files/v\${CMAKE_VERSION_MAJOR}/cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz
+  tar -xzf cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64.tar.gz
+  mv cmake-\${CMAKE_VERSION_MAJOR}.\${CMAKE_VERSION_MINOR}-linux-x86_64 /opt/cmake
+  export PATH=/opt/cmake/bin:\$PATH
+  ${PYTHON_ROOT_PATH}/pip install --no-cache-dir ninja setuptools wheel numpy uv scikit-build-core && \
+  cd /sgl-kernel && \
+  rm -rf CMakeLists.txt && mv CMakeLists_rocm.txt CMakeLists.txt && \
+  ${PYTHON_ROOT_PATH}/python rocm_hipify.py && \
+  ${PYTHON_ROOT_PATH}/python -m uv build --wheel -Cbuild-dir=build . --color=always --no-build-isolation && \
+  ./rename_wheels_rocm.sh
+"

sglang/3rdparty/amd/sgl-kernel/rename_wheels_rocm.sh ADDED Viewed

	@@ -0,0 +1,30 @@

+#!/usr/bin/env bash
+set -ex
+WHEEL_DIR="dist"
+wheel_files=($WHEEL_DIR/*.whl)
+for wheel in "${wheel_files[@]}"; do
+    intermediate_wheel="${wheel/linux/manylinux2014}"
+    # Extract the current python version from the wheel name
+    if [[ $intermediate_wheel =~ -cp([0-9]+)- ]]; then
+        cp_version="${BASH_REMATCH[1]}"
+    else
+        echo "Could not extract Python version from wheel name: $intermediate_wheel"
+        continue
+    fi
+    # Detect ROCm version and add appropriate suffix
+    if ls /opt | grep -q "7.0"; then
+        new_wheel="${intermediate_wheel/-cp${cp_version}/+rocm700-cp${cp_version}}"
+    else
+        new_wheel="$intermediate_wheel"
+    fi
+    if [[ "$wheel" != "$new_wheel" ]]; then
+        echo "Renaming $wheel to $new_wheel"
+        mv -- "$wheel" "$new_wheel"
+    fi
+done
+echo "Wheel renaming completed."

sglang/3rdparty/amd/sgl-kernel/rocm_hipify.py ADDED Viewed

	@@ -0,0 +1,40 @@

+from pathlib import Path
+import torch
+from torch.utils.cpp_extension import CUDAExtension
+root = Path(__file__).parent.resolve()
+include_dirs = [
+    root / "include",
+    root / "include" / "impl",
+    root / "csrc",
+]
+sources = [
+    "csrc/allreduce/custom_all_reduce.hip",
+    "csrc/allreduce/deterministic_all_reduce.hip",
+    "csrc/allreduce/quick_all_reduce.cu",
+    "csrc/common_extension_rocm.cc",
+    "csrc/elementwise/activation.cu",
+    "csrc/elementwise/pos_enc.cu",
+    "csrc/elementwise/topk.cu",
+    "csrc/grammar/apply_token_bitmask_inplace_cuda.cu",
+    "csrc/kvcacheio/transfer.cu",
+    "csrc/moe/moe_align_kernel.cu",
+    "csrc/moe/moe_topk_softmax_kernels.cu",
+    "csrc/moe/moe_topk_sigmoid_kernels.cu",
+    "csrc/speculative/eagle_utils.cu",
+]
+libraries = ["hiprtc", "amdhip64", "c10", "torch", "torch_python"]
+ext_modules = [
+    CUDAExtension(
+        name="sgl_kernel.common_ops",
+        sources=sources,
+        include_dirs=include_dirs,
+        libraries=libraries,
+        py_limited_api=False,
+    ),
+]

sglang/3rdparty/amd/tuning/TUNING.md ADDED Viewed

	@@ -0,0 +1,118 @@

+## Tuning SGLang Infer System with AMD GPUs
+This AppNote describes the SGLang performance tuning technical, code harness and running steps for systems with AMD Instinct GPUs.
+Harness code, examples and steps are provided in detail, to facilitate easy reproduce & use to tune performance towards workloads.
+Three primary runtime areas are covered:
+## 1. Triton Kernels
+To maximize Triton kernel efficiency, several strategies can be employed:
+### Key Environment Variables:
+- **num_stages**: Adjusts the number of pipeline stages to optimize kernel efficiency based on the specific type of operations (e.g., General Matrix Multiplication - GEMM).
+- **waves_per_eu**: Controls the usage of Vector General Purpose Registers (VGPR) to enhance occupancy, thereby improving latency or throughput.
+- **BLOCK_M, BLOCK_N, BLOCK_K**: Tunable tile sizes that assist in balancing memory transfer and computational efficiency.
+- **matrix_instr_nonkdim**: Optimizes the usage of Matrix-Fused Multiply-Add (MFMA) instructions for specific kernel types, such as Flash Attention.
+- **OPTIMIZE_EPILOGUE**: An environment variable that can be set to `1` to enhance performance by eliminating the `convert_layout` operation in the kernel's epilogue.
+```python
+@triton.autotune(configs=[
+        triton.Config({'waves_per_eu': 1}, num_warps=4, num_stages=1),
+        triton.Config({'waves_per_eu': 1}, num_warps=8, num_stages=1),
+        triton.Config({'waves_per_eu': 1}, num_warps=16, num_stages=1),
+        triton.Config({'waves_per_eu': 2}, num_warps=4, num_stages=1),
+        triton.Config({'waves_per_eu': 2}, num_warps=8, num_stages=1),
+        triton.Config({'waves_per_eu': 2}, num_warps=16, num_stages=1),
+        triton.Config({'waves_per_eu': 4}, num_warps=4, num_stages=1),
+        triton.Config({'waves_per_eu': 4}, num_warps=8, num_stages=1),
+        triton.Config({'waves_per_eu': 4}, num_warps=16, num_stages=1),
+    ], key=['BLOCK_N', 'NUM_TOKEN_BLKS'], use_cuda_graph=True)
+@triton.jit
+def _triton_kernel_function():
+    ...
+```
+## 2. Torch Tunable Operations
+**TunableOp** is a feature in PyTorch that allows for the definition and optimization of custom kernels with tunable parameters. This feature is particularly useful for enhancing the performance of kernels by experimenting with different configurations.
+### Key Environment Variables:
+1. **PYTORCH_TUNABLEOP_ENABLED**:
+   - Default: `0`
+   - Set to `1` to enable TunableOp.
+2. **PYTORCH_TUNABLEOP_TUNING**:
+   - Default: `1`
+   - Set to `0` to disable tuning. If a tuned entry is not found, it will run the tuning step and record the entry when PYTORCH_TUNABLEOP_ENABLED is enabled.
+3. **PYTORCH_TUNABLEOP_VERBOSE**:
+   - Default: `0`
+   - Set to `1` to enable verbose output for TunableOp.
+### Usage Example:
+To enable TunableOp and tuning, and optionally enable verbose mode, you can run the following command in your terminal:
+```bash
+#Tuning
+PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=1 your_script.sh
+#Inference with tuning op
+PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 your_script.sh
+#Print out the log
+PYTORCH_TUNABLEOP_ENABLED=1 PYTORCH_TUNABLEOP_TUNING=0 PYTORCH_TUNABLEOP_VERBOSE=1 your_script.sh
+```
+## 3. Torch Compilation
+The following are suggestions for optimizing matrix multiplication (GEMM) and convolution (conv) operations in PyTorch using Inductor, a part of the PyTorch compilation framework. The goal is to leverage Triton to achieve better performance.
+To tune Triton kernels with GEMM and convolution ops (conv), use the `torch.compile` function with the max-autotune mode. This benchmarks a predefined list of Triton configurations and selects the fastest one for each shape.
+### Key Configurations:
+1. **Max Autotune**:
+   - Set `torch._inductor.config.max_autotune = True` or `TORCHINDUCTOR_MAX_AUTOTUNE=1`.
+2. **Fine-Grained Control**:
+   - Enable GEMM tuning: `torch._inductor.config.max_autotune_gemm = True`.
+   - Enable tuning for pointwise/reduction ops: `torch._inductor.config.max_autotune.pointwise = True`.
+3. **Backend Selection**:
+   - Use `torch._inductor.max_autotune_gemm_backends` to limit backends to TRITON for better performance.
+4. **Freezing for Inference**:
+   - Use `torch._inductor.config.freezing=True` to enable constant folding optimizations.
+5. **Debugging**:
+   - Set `TORCH_COMPILE_DEBUG=1` to extract Triton kernels generated by Inductor.
+### Example Code Block:
+```bash
+#Gemm Tuning
+TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 your_script.sh
+#Specify your backend to TRITON for Gemm Tuning
+TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON your_script.sh
+#Inference with large improvement on AMD GPU
+TORCHINDUCTOR_FREEZING=1 your_script.sh
+```
+## 4. Fused MOE kernel
+To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration
+### Key parameters:
+- **--model**: what moe model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
+- **--tp-size**: simulate the whole model run configuration to set the dimension size using tp correctly
+- **--batch**: M dimension size of moe kernel, for prefill moe kernel the value is batch*input_len, for decode moe kernel the value is batch
+- **--dtype**: computation type
+```bash
+#Tuning
+#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
+#so we can tune decode moe use below command
+python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"
+# and use this command to tune prefill moe
+python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"
+```
+## Reference
+For more detailed information on tuning SGLang performance with AMD GPUs, please refer to the following link:
+[ROCm Documentation: Triton Kernel Performance Optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#triton-kernel-performance-optimization)

sglang/3rdparty/amd/tuning/benchmark_moe_rocm.py ADDED Viewed

	@@ -0,0 +1,378 @@

+import argparse
+import json
+import os
+import sys
+import torch
+import torch.nn.functional as F
+import triton
+import triton.language as tl
+from tqdm import tqdm
+from transformers import AutoConfig
+from sglang.srt.layers.moe.fused_moe_triton.fused_moe import (
+    fused_moe,
+    get_config_file_name,
+)
+padding_size = 128 if bool(int(os.getenv("SGLANG_MOE_PADDING", "0"))) else 0
+def main(model, tp_size, dtype: str, batches):
+    method = fused_moe
+    for bs in batches:
+        run_grid(int(bs), model=model, method=method, tp_size=tp_size, dtype=dtype)
+def prune_configs(M, N, K, configs):
+    pruned_configs = []
+    elemBytes_a = 1  # [DV Note] Hard-coded for float16 (2 bytes)
+    elemBytes_b = 1  # [DV Note] Hard-coded for float16 (2 bytes)
+    mfma = 16 if M < 32 or N < 32 else 32
+    # TODO (zhanglx): figure out the boundary between large and small gemms
+    large_gemm = False
+    if M >= 2048 and N >= 2048:
+        large_gemm = True
+    for config in configs:
+        BLOCK_SIZE_M = config.get("BLOCK_SIZE_M")
+        BLOCK_SIZE_N = config.get("BLOCK_SIZE_N")
+        BLOCK_SIZE_K = config.get("BLOCK_SIZE_K")
+        num_warps = config.get("num_warps")
+        matrix_instr_nonkdim = config.get("matrix_instr_nonkdim")
+        # kpack = config.get("kpack")
+        if matrix_instr_nonkdim > mfma:
+            continue
+        if mfma == 4 and BLOCK_SIZE_K < 64:
+            continue
+        # some layouts could not work properly in case
+        # number elements per thread is less 1
+        if BLOCK_SIZE_M * BLOCK_SIZE_N < 64:
+            continue
+        SPLIT_K = 1  # config.get("SPLIT_K")
+        GROUP_M = config.get("GROUP_SIZE_M")
+        if matrix_instr_nonkdim > BLOCK_SIZE_M or matrix_instr_nonkdim > BLOCK_SIZE_N:
+            continue
+        if matrix_instr_nonkdim >= M and matrix_instr_nonkdim != BLOCK_SIZE_M:
+            continue
+        if matrix_instr_nonkdim >= N and matrix_instr_nonkdim != BLOCK_SIZE_N:
+            continue
+        # Skip BLOCK_SIZE that is too large compare to M/N
+        # unless BLOCK_SIZE is already small enough
+        if M * 2 < BLOCK_SIZE_M and BLOCK_SIZE_M != 16:
+            continue
+        if N * 2 < BLOCK_SIZE_N and BLOCK_SIZE_N != 16:
+            continue
+        # skip large split_k when not necessary
+        if SPLIT_K != 1 and not need_split_k(M, N, K):
+            continue
+        # skip split_k that leads to EVEN_K = false
+        leap = SPLIT_K * BLOCK_SIZE_K
+        modv = K % leap
+        if modv != 0:
+            continue
+        # skip large GROUP_M
+        if GROUP_M * BLOCK_SIZE_M > M and GROUP_M != 1:
+            continue
+        # out of shared memory resource
+        # TODO (zhanglx): This does not consider the LDS usage in the epilogue
+        LDS = (
+            BLOCK_SIZE_K * BLOCK_SIZE_M * elemBytes_a
+            + BLOCK_SIZE_K * BLOCK_SIZE_N * elemBytes_b
+        )
+        if LDS > 65536:
+            continue
+        # Skip small block sizes and num_warps for large gemm
+        # For fp16 and f8, we want to only use BLOCK_SIZE >= 64
+        if large_gemm:
+            if BLOCK_SIZE_M < 64 or BLOCK_SIZE_N < 64:
+                continue
+            if BLOCK_SIZE_K < 64:
+                continue
+            if num_warps < 4:
+                continue
+        pruned_configs.append(config)
+    return pruned_configs
+def union_of_list_of_dicts(l1, l2):
+    result = []
+    temp_list = l1.copy()
+    temp_list.extend(l2)
+    for myDict in temp_list:
+        if myDict not in result:
+            result.append(myDict)
+    return result
+def run_grid(bs, model, method, tp_size, dtype: str):
+    config = AutoConfig.from_pretrained(model)
+    top_k = config.num_experts_per_tok
+    d_model = config.hidden_size
+    model_intermediate_size = config.intermediate_size
+    num_layers = config.num_hidden_layers
+    hidden_states_dtype = config.torch_dtype
+    if config.num_experts_per_tok:
+        if config.architectures[0] == "Grok1ModelForCausalLM":
+            num_total_experts = config.num_experts
+        else:
+            num_total_experts = config.num_local_experts
+    else:
+        raise ValueError(f"Unsupported Mixtral model {model}")
+    # tp_size = 2
+    num_warmup_calls = 10
+    num_calls = 30
+    num_warmup_trials = 1
+    num_trials = 1
+    full_configs = []
+    block_m_range = [16, 32, 64, 128, 256]
+    block_n_range = [16, 32, 64, 128, 256]
+    block_k_range = [32, 64, 128, 256]  # MUST >= 32
+    num_warps_range = [1, 2, 4, 8]
+    group_m_range = [1, 4, 8, 16, 32]
+    # For now we see better perf with num_stages=0 for all gemm configs we care
+    # But keep this explicit so that we do not forget we may need to set it to
+    # other values in the future
+    num_stage_range = [2]
+    waves_per_eu_range = [0, 1, 2, 4, 8]
+    # Remove 32 because of triton compiling error
+    matrix_instr_nonkdim_range = [16]
+    kpack_range = [1, 2]
+    for block_size_m in block_m_range:
+        for block_size_n in block_n_range:
+            for block_size_k in block_k_range:
+                for group_size_m in group_m_range:
+                    for num_warps in num_warps_range:
+                        for num_stages in num_stage_range:
+                            for waves_per_eu in waves_per_eu_range:
+                                for matrix_instr_nonkdim in matrix_instr_nonkdim_range:
+                                    for kpack in kpack_range:
+                                        full_configs.append(
+                                            {
+                                                "BLOCK_SIZE_M": block_size_m,
+                                                "BLOCK_SIZE_N": block_size_n,
+                                                "BLOCK_SIZE_K": block_size_k,
+                                                "GROUP_SIZE_M": group_size_m,
+                                                "num_warps": num_warps,
+                                                "num_stages": num_stages,
+                                                "waves_per_eu": waves_per_eu,
+                                                "matrix_instr_nonkdim": matrix_instr_nonkdim,
+                                                "kpack": kpack,
+                                            }
+                                        )
+    M1 = bs * 2
+    N1 = model_intermediate_size * 2 // tp_size
+    K1 = d_model
+    prune_configs_1 = prune_configs(M1, N1, K1, full_configs)
+    M2 = bs * 2
+    N2 = d_model
+    K2 = model_intermediate_size // tp_size
+    prune_configs_2 = prune_configs(M2, N2, K2, full_configs)
+    configs = union_of_list_of_dicts(prune_configs_1, prune_configs_2)
+    print(f"{bs=} || {len(full_configs)=} | {len(prune_configs_1)=} | \
+            {len(prune_configs_2)=} | {len(configs)=}")
+    best_config = None
+    best_time_us = 1e20
+    print(f"{tp_size=} {bs=}")
+    for config in tqdm(configs):
+        # warmup
+        try:
+            print(config)
+            for _ in range(num_warmup_trials):
+                run_timing(
+                    num_calls=num_warmup_calls,
+                    bs=bs,
+                    d_model=d_model,
+                    num_total_experts=num_total_experts,
+                    top_k=top_k,
+                    tp_size=tp_size,
+                    model_intermediate_size=model_intermediate_size,
+                    method=method,
+                    config=config,
+                    dtype=dtype,
+                    hidden_states_dtype=hidden_states_dtype,
+                )
+        except triton.runtime.autotuner.OutOfResources:
+            continue
+        # trial
+        for _ in range(num_trials):
+            kernel_dur_ms = run_timing(
+                num_calls=num_calls,
+                bs=bs,
+                d_model=d_model,
+                num_total_experts=num_total_experts,
+                top_k=top_k,
+                tp_size=tp_size,
+                model_intermediate_size=model_intermediate_size,
+                method=method,
+                config=config,
+                dtype=dtype,
+                hidden_states_dtype=hidden_states_dtype,
+            )
+            kernel_dur_us = 1000 * kernel_dur_ms
+            model_dur_ms = kernel_dur_ms * num_layers
+            if kernel_dur_us < best_time_us:
+                best_config = config
+                best_time_us = kernel_dur_us
+                tqdm.write(
+                    f"{kernel_dur_us=:.1f} {model_dur_ms=:.1f}"
+                    f" {bs=} {tp_size=} {top_k=} {num_total_experts=} "
+                    f"{d_model=} {model_intermediate_size=} {num_layers=}"
+                )
+    print("best_time_us", best_time_us)
+    print("best_config", best_config)
+    # holds Dict[str, Dict[str, int]]
+    filename = get_config_file_name(
+        num_total_experts,
+        model_intermediate_size // tp_size,
+        "float8" if dtype == "float8" else None,
+    )
+    print(f"writing config to file {filename}")
+    existing_content = {}
+    if os.path.exists(filename):
+        with open(filename, "r") as f:
+            existing_content = json.load(f)
+    existing_content[str(bs)] = best_config
+    with open(filename, "w") as f:
+        json.dump(existing_content, f, indent=4)
+        f.write("\n")
+def run_timing(
+    num_calls: int,
+    bs: int,
+    d_model: int,
+    num_total_experts: int,
+    top_k: int,
+    tp_size: int,
+    model_intermediate_size: int,
+    method,
+    config,
+    dtype: str,
+    hidden_states_dtype,
+) -> float:
+    shard_intermediate_size = model_intermediate_size // tp_size
+    hidden_states = torch.rand(
+        (bs, d_model),
+        device="cuda:0",
+        dtype=hidden_states_dtype,
+    )
+    w1 = torch.rand(
+        (num_total_experts, 2 * shard_intermediate_size, d_model + padding_size),
+        device=hidden_states.device,
+        dtype=hidden_states.dtype,
+    )
+    w2 = torch.rand(
+        (num_total_experts, d_model, shard_intermediate_size + padding_size),
+        device=hidden_states.device,
+        dtype=hidden_states.dtype,
+    )
+    w1_scale = None
+    w2_scale = None
+    a1_scale = None
+    a2_scale = None
+    if dtype == "float8":
+        w1 = w1.to(torch.float8_e4m3fnuz)
+        w2 = w2.to(torch.float8_e4m3fnuz)
+        w1_scale = torch.ones(
+            num_total_experts, device=hidden_states.device, dtype=torch.float32
+        )
+        w2_scale = torch.ones(
+            num_total_experts, device=hidden_states.device, dtype=torch.float32
+        )
+        a1_scale = torch.ones(1, device=hidden_states.device, dtype=torch.float32)
+        a2_scale = torch.ones(1, device=hidden_states.device, dtype=torch.float32)
+    gating_output = F.softmax(
+        torch.rand(
+            (num_calls, bs, num_total_experts),
+            device=hidden_states.device,
+            dtype=torch.float32,
+        ),
+        dim=-1,
+    )
+    ##################################
+    start_event = torch.cuda.Event(enable_timing=True)
+    end_event = torch.cuda.Event(enable_timing=True)
+    start_event.record()
+    for i in range(num_calls):
+        hidden_states = method(
+            hidden_states=hidden_states,
+            w1=w1,
+            w2=w2,
+            w1_scale=w1_scale,
+            w2_scale=w2_scale,
+            a1_scale=a1_scale,
+            a2_scale=a2_scale,
+            gating_output=gating_output[0],
+            topk=top_k,
+            renormalize=True,
+            inplace=True,
+            override_config=config,
+            use_fp8=dtype == "float8",
+        )
+    end_event.record()
+    end_event.synchronize()
+    dur_ms = start_event.elapsed_time(end_event) / num_calls
+    return dur_ms
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        prog="benchmark_mixtral_moe",
+        description="Benchmark and tune the fused_moe kernel",
+    )
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        default="auto",
+        choices=["float8", "float16", "bfloat16"],
+        help="Data type used for fused_moe kernel computations",
+    )
+    parser.add_argument("--model", type=str, default="hpcai-tech/grok-1")
+    parser.add_argument("--tp-size", type=int, default=2, help="Tensor paralleli size")
+    parser.add_argument("-b", "--batches", type=str)
+    args = parser.parse_args()
+    batches = args.batches.split(",")
+    sys.exit(main(args.model, args.tp_size, args.dtype, batches))

sglang/docs/supported_models/extending/modelscope.md ADDED Viewed

	@@ -0,0 +1,28 @@

+# Use Models From ModelScope
+To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable `SGLANG_USE_MODELSCOPE`.
+```bash
+export SGLANG_USE_MODELSCOPE=true
+```
+We take [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) as an example.
+Launch the Server:
+```bash
+python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
+```
+Or start it by docker:
+```bash
+docker run --gpus all \
+    -p 30000:30000 \
+    -v ~/.cache/modelscope:/root/.cache/modelscope \
+    --env "SGLANG_USE_MODELSCOPE=true" \
+    --ipc=host \
+    lmsysorg/sglang:latest \
+    python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
+```
+Note that modelscope uses a different cache directory than huggingface. You may need to set it manually to avoid running out of disk space.

sglang/docs/supported_models/extending/support_new_models.md ADDED Viewed

	@@ -0,0 +1,320 @@

+# How to Support New Models
+This document explains how to add support for new language models and multimodal large language models (MLLMs) in
+SGLang. It also covers how to test new models and register external implementations.
+## How to Support a New Language Model
+To support a new model in SGLang, you only need to add a single file under
+the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn
+from existing model implementations and create a new file for your model. For most models, you should be able to find a
+similar model to start with (e.g., starting from Llama). Also refer how
+to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
+## How to Support a New Multimodal Large Language Model
+To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the
+standard LLM support:
+1. **Register your new model as multimodal**:
+   Extend `is_multimodal_model`
+   in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561)
+   to return `True` for your model.
+2. **Register a new chat-template**:
+   Only when your default chat-template is unable to accept images as input: Register a new chat template in [conversation.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/conversation.py) and the corresponding matching function.
+3. **Multimodal Data Processor**:
+   Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
+   model’s dedicated processor.
+   See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors)
+   for more details.
+4. **Handle Multimodal Tokens**:
+   Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be
+   expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data
+   with `RadixAttention`.
+5. **Handle Image Feature Extraction**:
+   Implement a `get_image_feature` function for your new model, which extracts image features from raw image data and converts them into the embeddings used by the language model.
+6. **Adapt to Vision Attention**:
+   Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
+You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or
+other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
+## Testing and Debugging
+Please note all your testing and benchmarking results in PR description.
+### Interactive Debugging
+For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands
+should give the same text output and very similar prefill logits:
+- Get the reference output:
+  ```bash
+  python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm}
+  ```
+- Get the SGLang output:
+  ```bash
+  python3 -m sglang.bench_one_batch --correct --model [new model]
+  ```
+### Add the Model to the Test Suite
+To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in
+the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py)
+file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU,
+MMMU-Pro, etc.) in your PR. \\
+For VLMs, also include a test in `test_vision_openai_server_{x}.py` (e.g. [test_vision_openai_server_a.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_a.py), [test_vision_openai_server_b.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_b.py)).
+This is an example command to run to test a new model on your local machine:
+```bash
+ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others
+```
+### Benchmark
+- **(Required) MMMU**: follow MMMU benchmark [README.md](https://github.com/sgl-project/sglang/blob/main/benchmark/mmmu/README.md) to get SGLang vs. HF Transformer accuracy comparison. The accuracy score from SGLang run should not be much lower than that from HF Transformer run. Similarly, follow https://docs.sglang.io/developer_guide/benchmark_and_profiling.html to get performance comparison: TTFT and throughput must meet or exceed baselines (e.g., HF Transformer).
+- **(Optional) Other evals**: If you ran other evals, please note the results in PR description.
+## Port a Model from vLLM to SGLang
+The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable
+resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models
+from vLLM to SGLang.
+To port a model from vLLM to SGLang:
+- Compare these two files for guidance:
+  - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py)
+  - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py)
+- The major differences include:
+  - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
+  - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
+  - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
+  - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
+  - **Remove `Sample`.**
+  - **Change the `forward()` functions** and add a `forward_batch()` method.
+  - **Add `EntryClass`** at the end.
+  - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
+Note: make sure you add your new model to the supported models list in the supported models documentation.
+## Registering an External Model Implementation
+In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server.
+This allows you to integrate your model without modifying the source code.
+For example:
+```python
+from sglang.srt.models.registry import ModelRegistry
+from sglang.srt.entrypoints.http_server import launch_server
+# For a single model, add it to the registry:
+ModelRegistry.models[model_name] = model_class
+# For multiple models, you can imitate the import_model_classes() function:
+from functools import lru_cache
+@lru_cache()
+def import_new_model_classes():
+    model_arch_name_to_cls = {}
+    # Populate model_arch_name_to_cls with your new model classes.
+    ...
+    return model_arch_name_to_cls
+ModelRegistry.models.update(import_new_model_classes())
+# Launch the server with your server arguments:
+launch_server(server_args)
+```
+## Example: Implementing and Serving a Llama Wrapper Model
+Below is an introductory, step-by-step walkthrough on how to implement a new model end-to-end in SGLang and then run it via the [Offline Engine](https://github.com/sgl-project/sglang/blob/main/docs/basic_usage/offline_engine_api.ipynb).
+### Implementing Our Model
+To keep things simple, this new model will be a simple wrapper around [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), and our goal will be just to bias the output logits for each `forward` call by taking the square root of each individual logit.
+Let's start by defining our model in a file called `llama_wrapper.py`.
+The first step is to import the necessary libraries from SRT, which is SGLang's internal backend.
+```python
+# In the file `llama_wrapper.py`
+import torch
+from transformers import LlamaConfig
+from typing import Optional
+from sglang.srt.layers.logits_processor import LogitsProcessorOutput
+from sglang.srt.layers.quantization.base_config import QuantizationConfig
+from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors
+from sglang.srt.models.llama import LlamaForCausalLM
+```
+Next, we declare a new `class` for our model and have it inherit from `LlamaForCausalLM`, which allows our model to access `LlamaForCausalLM`'s predefined modules and layers, such as `LlamaAttention` and `LlamaMLP`.
+Note that almost all model implementations take in `config` and `quant_config` as arguments for their `__init__` method; `config` and `quant_config` are passed in via [`model_loader/loader.py`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_loader/loader.py#L219).
+Because we have inherited from `LlamaForCausalLM`, we can pass our parameters directly to its constructor, which will set the member variables for us.
+```python
+class LlamaWrapper(LlamaForCausalLM):
+    def __init__(
+        self,
+        config: LlamaConfig,
+        quant_config: Optional[QuantizationConfig] = None,
+        prefix: str = "",
+    ) -> None:
+        super().__init__(config=config, quant_config=quant_config, prefix=prefix)
+```
+Now, we want to define the `forward` method, which is what will be called at inference time.
+Note that the signature for `forward` is essentially the same for any model; you can take a look at the other models defined in the [`models` directory](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/) for references.
+To see where exactly `forward` is called in the SGLang runtime's internals, take a look at [`forward_decode`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1705) and [`forward_extend`](https://github.com/sgl-project/sglang/blob/bf72b80122fd888bf619d17b96fa3e323ab809fc/python/sglang/srt/model_executor/model_runner.py#L1724) in the [`ModelRunner` class](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_executor/model_runner.py).
+```python
+    @torch.no_grad()
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        forward_batch: ForwardBatch,
+        pp_proxy_tensors: Optional[PPProxyTensors] = None,
+        input_embeds: Optional[torch.Tensor] = None,
+        get_embedding: bool = False,
+    ) -> LogitsProcessorOutput:
+```
+We now call the `__call__` method for `self.model` (which is a member variable that `LlamaForCausalLM` defines in its `__init__` method), which eventually calls `LlamaForCausalLM`'s `forward` method.
+After that, we feed the `hidden_states` into our model's `LogitsProcessor` (again defined in `LlamaForCausalLM`).
+```python
+        hidden_states = self.model(
+            input_ids,
+            positions,
+            forward_batch,
+            input_embeds,
+            pp_proxy_tensors=pp_proxy_tensors,
+        )
+        res: LogitsProcessorOutput = self.logits_processor(
+            input_ids,
+            hidden_states,
+            self.lm_head,
+            forward_batch,
+        )
+```
+After receiving the logits for the next token, we can finally perform our biasing step.
+```python
+        orig_logits = res.next_token_logits
+        res.next_token_logits = torch.where(
+            orig_logits > 0,
+            orig_logits.sqrt(),
+            orig_logits
+        )
+        return res
+```
+Now, our `LlamaWrapper` model is created and ready to be served!
+### Serving Our Model Via SGLang's Offline Engine
+The next step of this walkthrough involves hosting our new model offline, so that it can be served locally and without an HTTP server.
+First, create a new file called `run.py`.
+Now, we must ensure that SGLang's `ModelRegistry` can find our model.
+To do this, we first download the model's configuration and weights from Huggingface.
+```python
+# In the file `run.py`
+import asyncio
+from functools import lru_cache
+from huggingface_hub import snapshot_download
+from llama_wrapper import LlamaWrapper # Make sure to import our new model!
+import sglang as sgl
+from sglang.srt.models.registry import ModelRegistry
+# Make sure to request access to this model on Huggingface, then export your
+# `HF_TOKEN` to download the model snapshot
+llama_dir = snapshot_download(
+    repo_id="meta-llama/Llama-3.1-8B-Instruct",
+    local_dir="./llama_ckpt",
+)
+```
+Now that we have our model on disk, we want to point it to `LlamaWrapper` by changing the `architectures` field in `./llama_ckpt/config.json` to be `LlamaWrapper`.
+That way, when we pass in the path of our model checkpoint to SGLang, it will know that we want to use "LlamaWrapper" instead of "LlamaForCausalLM" as our model.
+```python
+{
+  "architectures": [
+   #  "LlamaForCausalLM"
+    "LlamaWrapper"
+  ],
+  ...
+}
+```
+However, if we don't link our `LlamaWrapper` class to the "LlamaWrapper" registry keyword, then SGLang won't be able to find our model.
+Thus, to register our `LlamaWrapper`, we want to follow the steps in the above section titled "Registering an External Model Implementation".
+```python
+@lru_cache()
+def import_new_model_classes():
+    model_arch_name_to_cls = {"LlamaWrapper": LlamaWrapper}
+    return model_arch_name_to_cls
+ModelRegistry.models.update(import_new_model_classes())
+```
+Lastly, when we create our `Engine`, we just pass in the path to the local model directory.
+Then, our `LlamaWrapper` is ready to be served; for this walkthrough, we will use SGLang `Engine`'s non-streaming asynchronous generation endpoint.
+```python
+def main():
+    llm = sgl.Engine(model_path="./llama_ckpt")
+    sampling_params = {"temperature": 0.2, "top_k": 5}
+    prompts = [
+        "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
+        "Provide a concise factual statement about France’s capital city. The capital of France is",
+        "Explain possible future trends in artificial intelligence. The future of AI is",
+    ]
+    asyncio.run(run_llm(llm, sampling_params, prompts))
+    llm.shutdown()
+async def run_llm(
+    llm,
+    sampling_params,
+    prompts,
+) -> None:
+    outputs = await llm.async_generate(prompts, sampling_params)
+    for prompt, output in zip(prompts, outputs):
+        print(f"\nPrompt: {prompt}")
+        print(f"Generated text: {output['text']}")
+if __name__ == "__main__":
+    main()
+```
+Now, when we call `python run.py`, we will get the outputs of our newly created model!
+## Documentation
+Add to table of supported models in [generative_models.md](../text_generation/generative_models.md) or [multimodal_language_models.md](../text_generation/multimodal_language_models.md)
+---
+By following these guidelines, you can add support for new language models and multimodal large language models in
+SGLang and ensure they are thoroughly tested and easily integrated into the system.

sglang/docs/supported_models/retrieval_ranking/classify_models.md ADDED Viewed

	@@ -0,0 +1,162 @@

+# Classification API
+This document describes the `/v1/classify` API endpoint implementation in SGLang, which is compatible with vLLM's classification API format.
+## Overview
+The classification API allows you to classify text inputs using classification models. This implementation follows the same format as vLLM's 0.7.0 classification API.
+## API Endpoint
+```
+POST /v1/classify
+```
+## Request Format
+```json
+{
+  "model": "model_name",
+  "input": "text to classify"
+}
+```
+### Parameters
+- `model` (string, required): The name of the classification model to use
+- `input` (string, required): The text to classify
+- `user` (string, optional): User identifier for tracking
+- `rid` (string, optional): Request ID for tracking
+- `priority` (integer, optional): Request priority
+## Response Format
+```json
+{
+  "id": "classify-9bf17f2847b046c7b2d5495f4b4f9682",
+  "object": "list",
+  "created": 1745383213,
+  "model": "jason9693/Qwen2.5-1.5B-apeach",
+  "data": [
+    {
+      "index": 0,
+      "label": "Default",
+      "probs": [0.565970778465271, 0.4340292513370514],
+      "num_classes": 2
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 10,
+    "total_tokens": 10,
+    "completion_tokens": 0,
+    "prompt_tokens_details": null
+  }
+}
+```
+### Response Fields
+- `id`: Unique identifier for the classification request
+- `object`: Always "list"
+- `created`: Unix timestamp when the request was created
+- `model`: The model used for classification
+- `data`: Array of classification results
+  - `index`: Index of the result
+  - `label`: Predicted class label
+  - `probs`: Array of probabilities for each class
+  - `num_classes`: Total number of classes
+- `usage`: Token usage information
+  - `prompt_tokens`: Number of input tokens
+  - `total_tokens`: Total number of tokens
+  - `completion_tokens`: Number of completion tokens (always 0 for classification)
+  - `prompt_tokens_details`: Additional token details (optional)
+## Example Usage
+### Using curl
+```bash
+curl -v "http://127.0.0.1:8000/v1/classify" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "jason9693/Qwen2.5-1.5B-apeach",
+    "input": "Loved the new café—coffee was great."
+  }'
+```
+### Using Python
+```python
+import requests
+import json
+# Make classification request
+response = requests.post(
+    "http://127.0.0.1:8000/v1/classify",
+    headers={"Content-Type": "application/json"},
+    json={
+        "model": "jason9693/Qwen2.5-1.5B-apeach",
+        "input": "Loved the new café—coffee was great."
+    }
+)
+# Parse response
+result = response.json()
+print(json.dumps(result, indent=2))
+```
+## Supported Models
+The classification API works with any classification model supported by SGLang, including:
+### Classification Models (Multi-class)
+- `LlamaForSequenceClassification` - Multi-class classification
+- `Qwen2ForSequenceClassification` - Multi-class classification
+- `Qwen3ForSequenceClassification` - Multi-class classification
+- `BertForSequenceClassification` - Multi-class classification
+- `Gemma2ForSequenceClassification` - Multi-class classification
+**Label Mapping**: The API automatically uses the `id2label` mapping from the model's `config.json` file to provide meaningful label names instead of generic class names. If `id2label` is not available, it falls back to `LABEL_0`, `LABEL_1`, etc., or `Class_0`, `Class_1` as a last resort.
+### Reward Models (Single score)
+- `InternLM2ForRewardModel` - Single reward score
+- `Qwen2ForRewardModel` - Single reward score
+- `LlamaForSequenceClassificationWithNormal_Weights` - Special reward model
+**Note**: The `/classify` endpoint in SGLang was originally designed for reward models but now supports all non-generative models. Our `/v1/classify` endpoint provides a standardized vLLM-compatible interface for classification tasks.
+## Error Handling
+The API returns appropriate HTTP status codes and error messages:
+- `400 Bad Request`: Invalid request format or missing required fields
+- `500 Internal Server Error`: Server-side processing error
+Error response format:
+```json
+{
+  "error": "Error message",
+  "type": "error_type",
+  "code": 400
+}
+```
+## Implementation Details
+The classification API is implemented using:
+1. **Rust Model Gateway**: Handles routing and request/response models in `sgl-model-gateway/src/protocols/spec.rs`
+2. **Python HTTP Server**: Implements the actual endpoint in `python/sglang/srt/entrypoints/http_server.py`
+3. **Classification Service**: Handles the classification logic in `python/sglang/srt/entrypoints/openai/serving_classify.py`
+## Testing
+Use the provided test script to verify the implementation:
+```bash
+python test_classify_api.py
+```
+## Compatibility
+This implementation is compatible with vLLM's classification API format, allowing seamless migration from vLLM to SGLang for classification tasks.

sglang/docs/supported_models/retrieval_ranking/embedding_models.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# Embedding Models
+SGLang provides robust support for embedding models by integrating efficient serving mechanisms with its flexible programming interface. This integration allows for streamlined handling of embedding tasks, facilitating faster and more accurate retrieval and semantic search operations. SGLang's architecture enables better resource utilization and reduced latency in embedding model deployment.
+```{important}
+Embedding models are executed with `--is-embedding` flag and some may require `--trust-remote-code`
+```
+## Quick Start
+### Launch Server
+```shell
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen3-Embedding-4B \
+  --is-embedding \
+  --host 0.0.0.0 \
+  --port 30000
+```
+### Client Request
+```python
+import requests
+url = "http://127.0.0.1:30000"
+payload = {
+    "model": "Qwen/Qwen3-Embedding-4B",
+    "input": "What is the capital of France?",
+    "encoding_format": "float"
+}
+response = requests.post(url + "/v1/embeddings", json=payload).json()
+print("Embedding:", response["data"][0]["embedding"])
+```
+## Multimodal Embedding Example
+For multimodal models like GME that support both text and images:
+```shell
+python3 -m sglang.launch_server \
+  --model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct \
+  --is-embedding \
+  --chat-template gme-qwen2-vl \
+  --host 0.0.0.0 \
+  --port 30000
+```
+```python
+import requests
+url = "http://127.0.0.1:30000"
+text_input = "Represent this image in embedding space."
+image_path = "https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild/resolve/main/images/023.jpg"
+payload = {
+    "model": "gme-qwen2-vl",
+    "input": [
+        {
+            "text": text_input
+        },
+        {
+            "image": image_path
+        }
+    ],
+}
+response = requests.post(url + "/v1/embeddings", json=payload).json()
+print("Embeddings:", [x.get("embedding") for x in response.get("data", [])])
+```
+## Matryoshka Embedding Example
+[Matryoshka Embeddings](https://sbert.net/examples/sentence_transformer/training/matryoshka/README.html#matryoshka-embeddings) or [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) is a technique used in training embedding models. It allows user to trade off between performance and cost.
+### 1. Launch a Matryoshka‑capable model
+If the model config already includes `matryoshka_dimensions` or `is_matryoshka` then no override is needed. Otherwise, you can use `--json-model-override-args` as below:
+```shell
+python3 -m sglang.launch_server \
+    --model-path Qwen/Qwen3-Embedding-0.6B \
+    --is-embedding \
+    --host 0.0.0.0 \
+    --port 30000 \
+    --json-model-override-args '{"matryoshka_dimensions": [128, 256, 512, 1024, 1536]}'
+```
+1. Setting `"is_matryoshka": true` allows truncating to any dimension. Otherwise, the server will validate that the specified dimension in the request is one of `matryoshka_dimensions`.
+2. Omitting `dimensions` in a request returns the full vector.
+### 2. Make requests with different output dimensions
+```python
+import requests
+url = "http://127.0.0.1:30000"
+# Request a truncated (Matryoshka) embedding by specifying a supported dimension.
+payload = {
+    "model": "Qwen/Qwen3-Embedding-0.6B",
+    "input": "Explain diffusion models simply.",
+    "dimensions": 512  # change to 128 / 1024 / omit for full size
+}
+response = requests.post(url + "/v1/embeddings", json=payload).json()
+print("Embedding:", response["data"][0]["embedding"])
+```
+## Supported Models
+| Model Family                               | Example Model                          | Chat Template | Description                                                                 |
+| ------------------------------------------ | -------------------------------------- | ------------- | --------------------------------------------------------------------------- |
+| **E5 (Llama/Mistral based)**              | `intfloat/e5-mistral-7b-instruct`     | N/A           | High-quality text embeddings based on Mistral/Llama architectures          |
+| **GTE-Qwen2**                             | `Alibaba-NLP/gte-Qwen2-7B-instruct`   | N/A           | Alibaba's text embedding model with multilingual support                   |
+| **Qwen3-Embedding**                       | `Qwen/Qwen3-Embedding-4B`             | N/A           | Latest Qwen3-based text embedding model for semantic representation        |
+| **BGE**                                    | `BAAI/bge-large-en-v1.5`              | N/A           | BAAI's text embeddings (requires `attention-backend` triton/torch_native)  |
+| **GME (Multimodal)**                      | `Alibaba-NLP/gme-Qwen2-VL-2B-Instruct`| `gme-qwen2-vl`| Multimodal embedding for text and image cross-modal tasks                  |
+| **CLIP**                                   | `openai/clip-vit-large-patch14-336`   | N/A           | OpenAI's CLIP for image and text embeddings                                |

sglang/docs/supported_models/retrieval_ranking/rerank_models.md ADDED Viewed

	@@ -0,0 +1,313 @@

+# Rerank Models
+SGLang offers comprehensive support for rerank models by incorporating optimized serving frameworks with a flexible programming interface. This setup enables efficient processing of cross-encoder reranking tasks, improving the accuracy and relevance of search result ordering. SGLang’s design ensures high throughput and low latency during reranker model deployment, making it ideal for semantic-based result refinement in large-scale retrieval systems.
+```{important}
+Rerank models in SGLang fall into two categories:
+- **Cross-encoder rerank models**: run with `--is-embedding` (embedding runner).
+- **Decoder-only rerank models**: run **without** `--is-embedding` and use next-token logprob scoring (yes/no).
+  - Text-only (e.g. Qwen3-Reranker)
+  - Multimodal (e.g. Qwen3-VL-Reranker): also supports image/video content
+Some models may require `--trust-remote-code`.
+```
+## Supported rerank models
+| Model Family (Rerank)                          | Example HuggingFace Identifier       | Chat Template | Description                                                                                                                      |
+|------------------------------------------------|--------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------|
+| **BGE-Reranker (BgeRerankModel)**              | `BAAI/bge-reranker-v2-m3`            | N/A           | Currently only support `attention-backend` `triton` and `torch_native`. High-performance cross-encoder reranker model from BAAI. Suitable for reranking search results based on semantic relevance.   |
+| **Qwen3-Reranker (decoder-only yes/no)**       | `Qwen/Qwen3-Reranker-8B`             | `examples/chat_template/qwen3_reranker.jinja` | Decoder-only reranker using next-token logprob scoring for labels (yes/no). Launch **without** `--is-embedding`. |
+| **Qwen3-VL-Reranker (multimodal yes/no)**      | `Qwen/Qwen3-VL-Reranker-2B`          | `examples/chat_template/qwen3_vl_reranker.jinja` | Multimodal decoder-only reranker supporting text, images, and videos. Uses yes/no logprob scoring. Launch **without** `--is-embedding`. |
+## Cross-Encoder Rerank (embedding runner)
+### Launch Command
+```shell
+python3 -m sglang.launch_server \
+  --model-path BAAI/bge-reranker-v2-m3 \
+  --host 0.0.0.0 \
+  --disable-radix-cache \
+  --chunked-prefill-size -1 \
+  --attention-backend triton \
+  --is-embedding \
+  --port 30000
+```
+### Example Client Request
+```python
+import requests
+url = "http://127.0.0.1:30000/v1/rerank"
+payload = {
+    "model": "BAAI/bge-reranker-v2-m3",
+    "query": "what is panda?",
+    "documents": [
+        "hi",
+        "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
+    ],
+    "top_n": 1,
+    "return_documents": True
+}
+response = requests.post(url, json=payload)
+response_json = response.json()
+for item in response_json:
+    if item.get("document"):
+        print(f"Score: {item['score']:.2f} - Document: '{item['document']}'")
+    else:
+        print(f"Score: {item['score']:.2f} - Index: {item['index']}")
+```
+**Request Parameters:**
+- `query` (required): The query text to rank documents against
+- `documents` (required): List of documents to be ranked
+- `model` (required): Model to use for reranking
+- `top_n` (optional): Maximum number of documents to return. Defaults to returning all documents. If specified value is greater than the total number of documents, all documents will be returned.
+- `return_documents` (optional): Whether to return documents in the response. Defaults to `True`.
+## Qwen3-Reranker (decoder-only yes/no rerank)
+### Launch Command
+```shell
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen3-Reranker-0.6B \
+  --trust-remote-code \
+  --disable-radix-cache \
+  --host 0.0.0.0 \
+  --port 8001 \
+  --chat-template examples/chat_template/qwen3_reranker.jinja
+```
+```{note}
+Qwen3-Reranker uses decoder-only logprob scoring (yes/no). Do NOT launch it with `--is-embedding`.
+```
+### Example Client Request (supports optional instruct, top_n, and return_documents)
+```shell
+curl -X POST http://127.0.0.1:8001/v1/rerank \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen3-Reranker-0.6B",
+    "query": "法国首都是哪里？",
+    "documents": [
+      "法国的首都是巴黎。",
+      "德国的首都是柏林。",
+      "香蕉是黄色的水果。"
+    ],
+    "instruct": "Given a web search query, retrieve relevant passages that answer the query.",
+    "top_n": 2,
+    "return_documents": true
+  }'
+```
+**Request Parameters:**
+- `query` (required): The query text to rank documents against
+- `documents` (required): List of documents to be ranked
+- `model` (required): Model to use for reranking
+- `instruct` (optional): Instruction text for the reranker
+- `top_n` (optional): Maximum number of documents to return. Defaults to returning all documents. If specified value is greater than the total number of documents, all documents will be returned.
+- `return_documents` (optional): Whether to return documents in the response. Defaults to `True`.
+### Response Format
+`/v1/rerank` returns a list of objects (sorted by descending score):
+- `score`: float, higher means more relevant
+- `document`: the original document string (only included when `return_documents` is `true`)
+- `index`: the original index in the input `documents`
+- `meta_info`: optional debug/usage info (may be present for some models)
+The number of returned results is controlled by the `top_n` parameter. If `top_n` is not specified or is greater than the total number of documents, all documents are returned.
+Example (with `return_documents: true`):
+```json
+[
+  {"score": 0.99, "document": "法国的首都是巴黎。", "index": 0},
+  {"score": 0.01, "document": "德国的首都是柏林。", "index": 1},
+  {"score": 0.00, "document": "香蕉是黄色的水果。", "index": 2}
+]
+```
+Example (with `return_documents: false`):
+```json
+[
+  {"score": 0.99, "index": 0},
+  {"score": 0.01, "index": 1},
+  {"score": 0.00, "index": 2}
+]
+```
+Example (with `top_n: 2`):
+```json
+[
+  {"score": 0.99, "document": "法国的首都是巴黎。", "index": 0},
+  {"score": 0.01, "document": "德国的首都是柏林。", "index": 1}
+]
+```
+### Common Pitfalls
+- If you launch Qwen3-Reranker with `--is-embedding`, `/v1/rerank` cannot compute yes/no logprob scores. Relaunch **without** `--is-embedding`.
+- If you see a validation error like "score should be a valid number" and the backend returned a list, upgrade to a version that coerces `embedding[0]` into `score` for rerank responses.
+## Qwen3-VL-Reranker (multimodal decoder-only rerank)
+Qwen3-VL-Reranker extends the Qwen3-Reranker to support multimodal content, allowing reranking of documents containing text, images, and videos.
+### Launch Command
+```shell
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen3-VL-Reranker-2B \
+  --trust-remote-code \
+  --disable-radix-cache \
+  --host 0.0.0.0 \
+  --port 30000 \
+  --chat-template examples/chat_template/qwen3_vl_reranker.jinja
+```
+```{note}
+Qwen3-VL-Reranker uses decoder-only logprob scoring (yes/no) like Qwen3-Reranker. Do NOT launch it with `--is-embedding`.
+```
+### Text-Only Reranking (backward compatible)
+```python
+import requests
+url = "http://127.0.0.1:30000/v1/rerank"
+payload = {
+    "model": "Qwen3-VL-Reranker-2B",
+    "query": "What is machine learning?",
+    "documents": [
+        "Machine learning is a branch of artificial intelligence that enables computers to learn from data.",
+        "The weather in Paris is usually mild with occasional rain.",
+        "Deep learning is a subset of machine learning using neural networks with many layers.",
+    ],
+    "instruct": "Retrieve passages that answer the question.",
+    "return_documents": True
+}
+response = requests.post(url, json=payload)
+results = response.json()
+for item in results:
+    print(f"Score: {item['score']:.4f} - {item['document'][:60]}...")
+```
+### Image Reranking (text query, image/mixed documents)
+```python
+import requests
+url = "http://127.0.0.1:30000/v1/rerank"
+payload = {
+    "query": "A woman playing with her dog on a beach at sunset.",
+    "documents": [
+        # Document 1: Text description
+        "A woman shares a joyful moment with her golden retriever on a sun-drenched beach at sunset.",
+        # Document 2: Image URL
+        [
+            {
+                "type": "image_url",
+                "image_url": {
+                    "url": "https://example.com/beach_dog.jpeg"
+                }
+            }
+        ],
+        # Document 3: Text + Image (mixed)
+        [
+            {"type": "text", "text": "A joyful scene at the beach:"},
+            {
+                "type": "image_url",
+                "image_url": {
+                    "url": "https://example.com/beach_dog.jpeg"
+                }
+            }
+        ]
+    ],
+    "instruct": "Retrieve images or text relevant to the user's query.",
+    "return_documents": False
+}
+response = requests.post(url, json=payload)
+results = response.json()
+for item in results:
+    print(f"Index: {item['index']}, Score: {item['score']:.4f}")
+```
+### Multimodal Query Reranking (query with image)
+```python
+import requests
+url = "http://127.0.0.1:30000/v1/rerank"
+payload = {
+    # Query with text and image
+    "query": [
+        {"type": "text", "text": "Find similar images to this:"},
+        {
+            "type": "image_url",
+            "image_url": {
+                "url": "https://example.com/reference_image.jpeg"
+            }
+        }
+    ],
+    "documents": [
+        "A cat sleeping on a couch.",
+        "A woman and her dog enjoying the sunset at the beach.",
+        "A busy city street with cars and pedestrians.",
+        [
+            {
+                "type": "image_url",
+                "image_url": {
+                    "url": "https://example.com/similar_image.jpeg"
+                }
+            }
+        ]
+    ],
+    "instruct": "Find images or descriptions similar to the query image."
+}
+response = requests.post(url, json=payload)
+results = response.json()
+for item in results:
+    print(f"Index: {item['index']}, Score: {item['score']:.4f}")
+```
+### Request Parameters (Multimodal)
+- `query` (required): Can be a string (text-only) or a list of content parts:
+  - `{"type": "text", "text": "..."}` for text
+  - `{"type": "image_url", "image_url": {"url": "..."}}` for images
+  - `{"type": "video_url", "video_url": {"url": "..."}}` for videos
+- `documents` (required): List where each document can be a string or list of content parts (same format as query)
+- `instruct` (optional): Instruction text for the reranker
+- `top_n` (optional): Maximum number of documents to return
+- `return_documents` (optional): Whether to return documents in the response (default: `false`)
+### Common Pitfalls
+- Always use `--chat-template examples/chat_template/qwen3_vl_reranker.jinja` for Qwen3-VL-Reranker.
+- Do NOT launch with `--is-embedding`.
+- For best results, use `--disable-radix-cache` to avoid caching issues with multimodal content.
+- **Note**: Currently only `Qwen3-VL-Reranker-2B` is tested and supported. The 8B model may have different behavior and is not guaranteed to work with this template.

sglang/docs/supported_models/specialized/index.rst ADDED Viewed

	@@ -0,0 +1,9 @@

+Specialized Models
+==================
+Models for specialized tasks like reward modeling.
+.. toctree::
+   :maxdepth: 1
+   reward_models.md

sglang/docs/supported_models/specialized/reward_models.md ADDED Viewed

	@@ -0,0 +1,28 @@

+# Reward Models
+These models output a scalar reward score or classification result, often used in reinforcement learning or content moderation tasks.
+```{important}
+They are executed with `--is-embedding` and some may require `--trust-remote-code`.
+```
+## Example launch Command
+```shell
+python3 -m sglang.launch_server \
+  --model-path Qwen/Qwen2.5-Math-RM-72B \  # example HF/local path
+  --is-embedding \
+  --host 0.0.0.0 \
+  --tp-size=4 \                          # set for tensor parallelism
+  --port 30000 \
+```
+## Supported models
+| Model Family (Reward)                                                     | Example HuggingFace Identifier                              | Description                                                                     |
+|---------------------------------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------------------|
+| **Llama (3.1 Reward / `LlamaForSequenceClassification`)**                   | `Skywork/Skywork-Reward-Llama-3.1-8B-v0.2`            | Reward model (preference classifier) based on Llama 3.1 (8B) for scoring and ranking responses for RLHF.  |
+| **Gemma 2 (27B Reward / `Gemma2ForSequenceClassification`)**                | `Skywork/Skywork-Reward-Gemma-2-27B-v0.2`             | Derived from Gemma‑2 (27B), this model provides human preference scoring for RLHF and multilingual tasks.  |
+| **InternLM 2 (Reward / `InternLM2ForRewardMode`)**                         | `internlm/internlm2-7b-reward`                       | InternLM 2 (7B)–based reward model used in alignment pipelines to guide outputs toward preferred behavior.  |
+| **Qwen2.5 (Reward - Math / `Qwen2ForRewardModel`)**                         | `Qwen/Qwen2.5-Math-RM-72B`                           | A 72B math-specialized RLHF reward model from the Qwen2.5 series, tuned for evaluating and refining responses.  |
+| **Qwen2.5 (Reward - Sequence / `Qwen2ForSequenceClassification`)**          | `jason9693/Qwen2.5-1.5B-apeach`                      | A smaller Qwen2.5 variant used for sequence classification, offering an alternative RLHF scoring mechanism.  |

sglang/docs/supported_models/text_generation/diffusion_language_models.md ADDED Viewed

	@@ -0,0 +1,111 @@

+# Diffusion Language Models
+Diffusion language models have shown promise for non-autoregressive text generation with parallel decoding capabilities. Unlike auto-regressive language models, different diffusion language models require different decoding strategies.
+## Example Launch Command
+SGLang supports different DLLM algorithms such as `LowConfidence` and `JointThreshold`.
+```shell
+python3 -m sglang.launch_server \
+  --model-path inclusionAI/LLaDA2.0-mini \ # example HF/local path
+  --dllm-algorithm LowConfidence \
+  --dllm-algorithm-config ./config.yaml \ # Optional. Uses the algorithm's default if not set.
+  --host 0.0.0.0 \
+  --port 30000
+```
+## Example Configuration File
+Depending on the algorithm selected, the configuration parameters vary.
+LowConfidence Config:
+```yaml
+# Confidence threshold for accepting predicted tokens
+# - Higher values: More conservative, better quality but slower
+# - Lower values: More aggressive, faster but potentially lower quality
+# Range: 0.0 - 1.0
+threshold: 0.95
+# Default: 32, for LLaDA2MoeModelLM
+block_size: 32
+```
+JointThreshold Config:
+```yaml
+# Decoding threshold for Mask-to-Token (M2T) phase
+# - Higher values: More conservative, better quality but slower
+# - Lower values: More aggressive, faster but potentially lower quality
+# Range: 0.0 - 1.0
+threshold: 0.5
+# Decoding threshold for Token-to-Token (T2T) phase
+# Range: 0.0 - 1.0
+# Setting to 0.0 allows full editing (recommended for most cases).
+edit_threshold: 0.0
+# Max extra T2T steps after all masks are removed. Prevents infinite loops.
+max_post_edit_steps: 16
+# 2-gram repetition penalty (default 0).
+# An empirical value of 3 is often sufficient to mitigate most repetitions.
+penalty_lambda: 0
+```
+## Example Client Code Snippet
+Just like other supported models, diffusion language models can be used via the REST API or Python client.
+Python client example for making a generation request to the launched server:
+```python
+import sglang as sgl
+def main():
+    llm = sgl.Engine(model_path="inclusionAI/LLaDA2.0-mini",
+                     dllm_algorithm="LowConfidence",
+                     max_running_requests=1,
+                     trust_remote_code=True)
+    prompts = [
+        "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
+    ]
+    sampling_params = {
+        "temperature": 0,
+        "max_new_tokens": 1024,
+    }
+    outputs = llm.generate(prompts, sampling_params)
+    print(outputs)
+if __name__ == '__main__':
+    main()
+```
+Curl example for making a generation request to the launched server:
+```bash
+curl -X POST "http://127.0.0.1:30000/generate" \
+     -H "Content-Type: application/json" \
+     -d '{
+        "text": [
+            "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write the number from 1 to 128 <|role_end|><role>ASSISTANT</role>",
+            "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role> Write a brief introduction of the great wall <|role_end|><role>ASSISTANT</role>"
+        ],
+        "stream": true,
+        "sampling_params": {
+            "temperature": 0,
+            "max_new_tokens": 1024
+        }
+    }'
+```
+## Supported Models
+Below the supported models are summarized in a table.
+| Model Family               | Example Model                | Description                                                                                          |
+| -------------------------- | ---------------------------- | ---------------------------------------------------------------------------------------------------- |
+| **LLaDA2.0 (mini, flash)** | `inclusionAI/LLaDA2.0-flash` | LLaDA2.0-flash is a diffusion language model featuring a 100B Mixture-of-Experts (MoE) architecture. |
+| **SDAR (JetLM)**           | `JetLM/SDAR-8B-Chat`         | SDAR series diffusion language model (Chat), dense architecture.                                 |
+| **SDAR (JetLM)**           | `JetLM/SDAR-30B-A3B-Chat`    | SDAR series diffusion language model (Chat), MoE architecture.                                   |

sglang/docs/supported_models/text_generation/generative_models.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# Large Language Models
+These models accept text input and produce text output (e.g., chat completions). They are primarily large language models (LLMs), some with mixture-of-experts (MoE) architectures for scaling.
+## Example launch Command
+```shell
+python3 -m sglang.launch_server \
+  --model-path meta-llama/Llama-3.2-1B-Instruct \  # example HF/local path
+  --host 0.0.0.0 \
+  --port 30000 \
+```
+## Supported models
+Below the supported models are summarized in a table.
+If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen3ForCausalLM`, use the expression:
+```
+repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen3ForCausalLM
+```
+in the GitHub search bar.
+| Model Family (Variants)             | Example HuggingFace Identifier                     | Description                                                                            |
+|-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------|
+| **DeepSeek** (v1, v2, v3/R1)        | `deepseek-ai/DeepSeek-R1`                        | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../basic_usage/deepseek.md) and [Reasoning Parser](../advanced_features/separate_reasoning.ipynb)|
+| **Kimi K2** (Thinking, Instruct)    | `moonshotai/Kimi-K2-Instruct`                    | Moonshot AI's 1 trillion parameter MoE model (32B active) with 128K–256K context; state-of-the-art agentic intelligence with stable long-horizon agency across 200–300 sequential tool calls. Features MLA attention and native INT4 quantization. [See Reasoning Parser docs](../advanced_features/separate_reasoning.ipynb)|
+| **Kimi Linear** (48B-A3B)           | `moonshotai/Kimi-Linear-48B-A3B-Instruct`        | Moonshot AI's hybrid linear attention model (48B total, 3B active) with 1M token context; features Kimi Delta Attention (KDA) for up to 6× faster decoding and 75% KV cache reduction vs full attention. |
+| **GPT-OSS**       | `openai/gpt-oss-20b`, `openai/gpt-oss-120b`       | OpenAI’s latest GPT-OSS series for complex reasoning, agentic tasks, and versatile developer use cases.|
+| **Qwen** (3.5, 3, 3MoE, 3Next, 2.5, 2 series)       | `Qwen/Qwen3.5-397B-A17B`, `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B`      | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../advanced_features/separate_reasoning.ipynb)|
+| **Llama** (2, 3.x, 4 series)        | `meta-llama/Llama-4-Scout-17B-16E-Instruct`       | Meta's open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../basic_usage/llama4.md)  |
+| **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2`             | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
+| **Gemma** (v1, v2, v3)              | `google/gemma-3-1b-it`                            | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
+| **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. |
+| **MiniCPM** (v3, 4B)               | `openbmb/MiniCPM3-4B`                            | OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. |
+| **OLMo** (2, 3) | `allenai/OLMo-3-1125-32B`, `allenai/OLMo-3-32B-Think`, `allenai/OLMo-2-1124-7B-Instruct` | Allen AI’s series of Open Language Models designed to enable the science of language models. |
+| **OLMoE** (Open MoE)               | `allenai/OLMoE-1B-7B-0924`                       | Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. |
+| **MiniMax-M2** (M2, M2.1, M2.5)               | `MiniMaxAI/MiniMax-M2.5`, `MiniMaxAI/MiniMax-M2.1`, `MiniMaxAI/MiniMax-M2` | MiniMax's SOTA LLM for coding & agentic workflows. |
+| **StableLM** (3B, 7B)               | `stabilityai/stablelm-tuned-alpha-7b`            | StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. |
+| **Command-(R,A)** (Cohere)              | `CohereLabs/c4ai-command-r-v01`, `CohereLabs/c4ai-command-r7b-12-2024`, `CohereLabs/c4ai-command-a-03-2025`                 | Cohere’s open conversational LLM (Command series) optimized for long context, retrieval-augmented generation, and tool use. |
+| **DBRX** (Databricks)              | `databricks/dbrx-instruct`                       | Databricks’ 132B-parameter MoE model (36B active) trained on 12T tokens; competes with GPT-3.5 quality as a fully open foundation model. |
+| **Grok** (xAI)                     | `xai-org/grok-1`                                | xAI’s grok-1 model known for vast size(314B parameters) and high quality; integrated in SGLang for high-performance inference. |
+| **ChatGLM** (GLM-130B family)       | `THUDM/chatglm2-6b`                              | Zhipu AI’s bilingual chat model (6B) excelling at Chinese-English dialogue; fine-tuned for conversational quality and alignment. |
+| **InternLM 2** (7B, 20B)           | `internlm/internlm2-7b`                          | Next-gen InternLM (7B and 20B) from SenseTime, offering strong reasoning and ultra-long context support (up to 200K tokens). |
+| **ExaONE 3** (Korean-English)      | `LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct`           | LG AI Research’s Korean-English model (7.8B) trained on 8T tokens; provides high-quality bilingual understanding and generation. |
+| **Baichuan 2** (7B, 13B)           | `baichuan-inc/Baichuan2-13B-Chat`                | BaichuanAI’s second-generation Chinese-English LLM (7B/13B) with improved performance and an open commercial license. |
+| **XVERSE** (MoE)                   | `xverse/XVERSE-MoE-A36B`                         | Yuanxiang’s open MoE LLM (XVERSE-MoE-A36B: 255B total, 36B active) supporting ~40 languages; delivers 100B+ dense-level performance via expert routing. |
+| **SmolLM** (135M–1.7B)            | `HuggingFaceTB/SmolLM-1.7B`                      | Hugging Face’s ultra-small LLM series (135M–1.7B params) offering surprisingly strong results, enabling advanced AI on mobile/edge devices. |
+| **GLM-4** (Multilingual 9B)        | `ZhipuAI/glm-4-9b-chat`                          | Zhipu’s GLM-4 series (up to 9B parameters) – open multilingual models with support for 1M-token context and even a 5.6B multimodal variant (Phi-4V). |
+| **MiMo** (7B series)               | `XiaomiMiMo/MiMo-7B-RL`                         | Xiaomi's reasoning-optimized model series, leverages Multiple-Token Prediction for faster inference. |
+| **ERNIE-4.5** (4.5, 4.5MoE series) | `baidu/ERNIE-4.5-21B-A3B-PT`                    | Baidu's ERNIE-4.5 series which consists of MoE with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. |
+| **Arcee AFM-4.5B**               | `arcee-ai/AFM-4.5B-Base`                         | Arcee's foundational model series for real world reliability and edge deployments. |
+| **Persimmon** (8B)               | `adept/persimmon-8b-chat`                         | Adept’s open 8B model with a 16K context window and fast inference; trained for broad usability and licensed under Apache 2.0. |
+| **Solar** (10.7B)               | `upstage/SOLAR-10.7B-Instruct-v1.0`                         | Upstage's 10.7B parameter model, optimized for instruction-following tasks. This architecture incorporates a depth-up scaling methodology, enhancing model performance. |
+| **Tele FLM** (52B-1T)               | `CofeAI/Tele-FLM`                         | BAAI & TeleAI's multilingual model, available in 52-billion and 1-trillion parameter variants. It is a decoder-only transformer trained on ~2T tokens |
+| **Ling** (16.8B–290B) | `inclusionAI/Ling-lite`, `inclusionAI/Ling-plus` | InclusionAI’s open MoE models. Ling-Lite has 16.8B total / 2.75B active parameters, and Ling-Plus has 290B total / 28.8B active parameters. They are designed for high performance on NLP and complex reasoning tasks. |
+| **Granite 3.0, 3.1** (IBM)               | `ibm-granite/granite-3.1-8b-instruct`                          | IBM's open dense foundation models optimized for reasoning, code, and business AI use cases. Integrated with Red Hat and watsonx systems. |
+| **Granite 3.0 MoE** (IBM)               | `ibm-granite/granite-3.0-3b-a800m-instruct`                          | IBM’s Mixture-of-Experts models offering strong performance with cost-efficiency. MoE expert routing designed for enterprise deployment at scale. |
+| **GPT-J** (6B)                    | `EleutherAI/gpt-j-6b`                             | EleutherAI's GPT-2-like causal language model (6B) trained on the [Pile](https://pile.eleuther.ai/) dataset. |
+| **Orion** (14B)               | `OrionStarAI/Orion-14B-Base`                         | A series of open-source multilingual large language models by OrionStarAI, pretrained on a 2.5T token multilingual corpus including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages. |
+| **Llama Nemotron Super** (v1, v1.5, NVIDIA) | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, `nvidia/Llama-3_3-Nemotron-Super-49B-v1_5` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. |
+| **Llama Nemotron Ultra** (v1, NVIDIA) | `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. |
+| **NVIDIA Nemotron Nano 2.0** | `nvidia/NVIDIA-Nemotron-Nano-9B-v2` | The [NVIDIA Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) family of multimodal models provides state-of-the-art reasoning models specifically designed for enterprise-ready AI agents. `Nemotron-Nano-9B-v2` is a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. |
+| **StarCoder2** (3B-15B) | `bigcode/starcoder2-7b` | StarCoder2 is a family of open large language models (LLMs) specialized for code generation and understanding. It is the successor to StarCoder, jointly developed by the BigCode project (a collaboration between Hugging Face, ServiceNow Research, and other contributors). |
+| **Jet-Nemotron** | `jet-ai/Jet-Nemotron-2B` | Jet-Nemotron is a new family of hybrid-architecture language models that surpass state-of-the-art open-source full-attention language models, while achieving significant efficiency gains. |
+| **Trinity** (Nano, Mini) | `arcee-ai/Trinity-Mini` | Arcee's foundational MoE Trinity family of models, open weights under Apache 2.0. |
+| **Falcon-H1** (0.5B–34B) | `tiiuae/Falcon-H1-34B-Instruct` | TII's hybrid Mamba-Transformer architecture combining attention and state-space models for efficient long-context inference. |
+| **Hunyuan-Large** (389B, MoE) | `tencent/Tencent-Hunyuan-Large` | Tencent's open-source MoE model with 389B total / 52B active parameters, featuring Cross-Layer Attention (CLA) for improved efficiency. |
+| **IBM Granite 4.0 (Hybrid, Dense)** | `ibm-granite/granite-4.0-h-micro`, `ibm-granite/granite-4.0-micro` | IBM Granite 4.0 micro models: hybrid Mamba–MoE (`h-micro`) and dense (`micro`) variants. Enterprise-focused reasoning models |
+| **Sarvam 2** (30B-A2B, 105B-A10B) | `sarvamai/sarvam-2` | Sarvam's Mixture-of-Experts models. The 105B variant uses MLA (Multi-head Latent Attention) and the 30B variant uses GQA, both with 128 routed experts. |

sglang/docs/supported_models/text_generation/index.rst ADDED Viewed

	@@ -0,0 +1,11 @@

+Text Generation
+===============
+Models for generating text from text or multimodal inputs.
+.. toctree::
+   :maxdepth: 1
+   generative_models.md
+   multimodal_language_models.md
+   diffusion_language_models.md

sglang/docs/supported_models/text_generation/multimodal_language_models.md ADDED Viewed

	@@ -0,0 +1,136 @@

+# Multimodal Language Models
+These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with multimodal encoders.
+## Example launch Command
+```shell
+python3 -m sglang.launch_server \
+  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \  # example HF/local path
+  --host 0.0.0.0 \
+  --port 30000 \
+```
+> See the [OpenAI APIs section](https://docs.sglang.io/basic_usage/openai_api_vision.html) for how to send multimodal requests.
+## Supported models
+Below the supported models are summarized in a table.
+If you are unsure if a specific architecture is implemented, you can search for it via GitHub. For example, to search for `Qwen2_5_VLForConditionalGeneration`, use the expression:
+```
+repo:sgl-project/sglang path:/^python\/sglang\/srt\/models\// Qwen2_5_VLForConditionalGeneration
+```
+in the GitHub search bar.
+| Model Family (Variants)    | Example HuggingFace Identifier             | Description                                                                                                                                                                                                     | Notes |
+|----------------------------|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
+| **Qwen-VL** | `Qwen/Qwen3-VL-235B-A22B-Instruct`              | Alibaba's vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content.                                                                     |  |
+| **DeepSeek-VL2**           | `deepseek-ai/deepseek-vl2`                 | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs.                                                                        |  |
+| **DeepSeek-OCR / OCR-2**   | `deepseek-ai/DeepSeek-OCR-2`               | OCR-focused DeepSeek models for document understanding and text extraction.                                                                                                                                    | Use `--trust-remote-code`. |
+| **Janus-Pro** (1B, 7B)     | `deepseek-ai/Janus-Pro-7B`                 | DeepSeek's open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |  |
+| **MiniCPM-V / MiniCPM-o**  | `openbmb/MiniCPM-V-2_6`                    | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices.                                                 |  |
+| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks.                                                                                     |  |
+| **LLaVA** (v1.5 & v1.6)    | *e.g.* `liuhaotian/llava-v1.5-13b`         | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts.                                                                               |  |
+| **LLaVA-NeXT** (8B, 72B)   | `lmms-lab/llava-next-72b`                  | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks.                                                       |  |
+| **LLaVA-OneVision**        | `lmms-lab/llava-onevision-qwen2-7b-ov`     | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format.                                                 |  |
+| **Gemma 3 (Multimodal)**   | `google/gemma-3-4b-it`                     | Gemma 3's larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context.                                                                        |  |
+| **Kimi-VL** (A3B)          | `moonshotai/Kimi-VL-A3B-Instruct`          | Kimi-VL is a multimodal model that can understand and generate text from images.                                                                                                                                |  |
+| **Mistral-Small-3.1-24B**  | `mistralai/Mistral-Small-3.1-24B-Instruct-2503` | Mistral 3.1 is a multimodal model that can generate text from text or images input. It also supports tool calling and structured output. |  |
+| **Phi-4-multimodal-instruct**  | `microsoft/Phi-4-multimodal-instruct` | Phi-4-multimodal-instruct is the multimodal variant of the Phi-4-mini model, enhanced with LoRA for improved multimodal capabilities. It supports text, vision and audio modalities in SGLang. |  |
+| **MiMo-VL** (7B)           | `XiaomiMiMo/MiMo-VL-7B-RL`                 | Xiaomi's compact yet powerful vision-language model featuring a native resolution ViT encoder for fine-grained visual details, an MLP projector for cross-modal alignment, and the MiMo-7B language model optimized for complex reasoning tasks. |  |
+| **GLM-4.5V** (106B) /  **GLM-4.1V**(9B)           | `zai-org/GLM-4.5V`                   | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning                                                                                                                                                                                                      | Use `--chat-template glm-4v` |
+| **GLM-OCR**          | `zai-org/GLM-OCR`                   | GLM-OCR: A fast and accurate general OCR model                                                                   |  |
+| **DotsVLM** (General/OCR)  | `rednote-hilab/dots.vlm1.inst`             | RedNote's vision-language model built on a 1.2B vision encoder and DeepSeek V3 LLM, featuring NaViT vision encoder trained from scratch with dynamic resolution support and enhanced OCR capabilities through structured image data training. |  |
+| **DotsVLM-OCR**            | `rednote-hilab/dots.ocr`                   | Specialized OCR variant of DotsVLM optimized for optical character recognition tasks with enhanced text extraction and document understanding capabilities. | Don't use `--trust-remote-code` |
+| **NVILA** (8B, 15B, Lite-2B, Lite-8B, Lite-15B) | `Efficient-Large-Model/NVILA-8B` | `chatml` | NVILA explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance. |
+| **NVIDIA Nemotron Nano 2.0 VL** | `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16` | NVIDIA Nemotron Nano v2 VL enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities. It builds on Nemotron Nano V2, a hybrid Mamba-Transformer LLM, in order to achieve higher inference throughput in long document and video scenarios. | Use `--trust-remote-code`. You may need to adjust `--max-mamba-cache-size` [default is 512] to fit memory constraints. |
+| **Ernie4.5-VL** | `baidu/ERNIE-4.5-VL-28B-A3B-PT`              | Baidu's vision-language models(28B,424B). Support image and video comprehension, and also support thinking.                                                                     |  |
+| **JetVLM** |  | JetVLM is an vision-language model designed for high-performance multimodal understanding and generation tasks built upon Jet-Nemotron. | Coming soon |
+| **Step3-VL** (10B) | `stepfun-ai/Step3-VL-10B` | StepFun's lightweight open-source 10B parameter VLM for multimodal intelligence, excelling in visual perception, complex reasoning, and human alignment. |  |
+| **Qwen3-Omni** | `Qwen/Qwen3-Omni-30B-A3B-Instruct` |  Alibaba's omni-modal MoE model. Currently supports the **Thinker** component (multimodal understanding for text, images, audio, and video), while the **Talker** component (audio generation) is not yet supported. |  |
+## Video Input Support
+SGLang supports video input for Vision-Language Models (VLMs), enabling temporal reasoning tasks such as video question answering, captioning, and holistic scene understanding. Video clips are decoded, key frames are sampled, and the resulting tensors are batched together with the text prompt, allowing multimodal inference to integrate visual and linguistic context.
+| Model Family | Example Identifier | Video notes |
+|--------------|--------------------|-------------|
+| **Qwen-VL** (Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-Omni) | `Qwen/Qwen3-VL-235B-A22B-Instruct` | The processor gathers `video_data`, runs Qwen's frame sampler, and merges the resulting features with text tokens before inference. |
+| **GLM-4v** (4.5V, 4.1V, MOE) | `zai-org/GLM-4.5V` | Video clips are read with Decord, converted to tensors, and passed to the model alongside metadata for rotary-position handling. |
+| **NVILA** (Full & Lite) | `Efficient-Large-Model/NVILA-8B` | The runtime samples eight frames per clip and attaches them to the multimodal request when `video_data` is present. |
+| **LLaVA video variants** (LLaVA-NeXT-Video, LLaVA-OneVision) | `lmms-lab/LLaVA-NeXT-Video-7B` | The processor routes video prompts to the LlavaVid video-enabled architecture, and the provided example shows how to query it with `sgl.video(...)` clips. |
+| **NVIDIA Nemotron Nano 2.0 VL** | `nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16` | The processor samples at 2 FPS, at a max of 128 frames, as per model training. The model uses [EVS](../../python/sglang/srt/multimodal/evs/README.md), a pruning method that removes redundant tokens from video embeddings. By default `video_pruning_rate=0.7`. Change this by providing: `--json-model-override-args '{"video_pruning_rate": 0.0}'` to disable EVS, for example. |
+| **JetVLM** |  | The runtime samples eight frames per clip and attaches them to the multimodal request when `video_data` is present. |
+Use `sgl.video(path, num_frames)` when building prompts to attach clips from your SGLang programs.
+Example OpenAI-compatible request that sends a video clip:
+```python
+import requests
+url = "http://localhost:30000/v1/chat/completions"
+data = {
+    "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text", "text": "What’s happening in this video?"},
+                {
+                    "type": "video_url",
+                    "video_url": {
+                        "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
+                    },
+                },
+            ],
+        }
+    ],
+    "max_tokens": 300,
+}
+response = requests.post(url, json=data)
+print(response.text)
+```
+## Usage Notes
+### Performance Optimization
+For multimodal models, you can use the `--keep-mm-feature-on-device` flag to optimize for latency at the cost of increased GPU memory usage:
+- **Default behavior**: Multimodal feature tensors are moved to CPU after processing to save GPU memory
+- **With `--keep-mm-feature-on-device`**: Feature tensors remain on GPU, reducing device-to-host copy overhead and improving latency, but consuming more GPU memory
+Use this flag when you have sufficient GPU memory and want to minimize latency for multimodal inference.
+### Multimodal Inputs Limitation
+- **Use `--mm-process-config '{"image":{"max_pixels":1048576},"video":{"fps":3,"max_pixels":602112,"max_frames":60}}'`**: To set `image`, `video`, and `audio` input limits.
+This can reduce GPU memory usage, improve inference speed, and help to avoid OOM, but may impact model performance, thus set a proper value based on your specific use case. Currently, only `qwen_vl` supports this config. Please refer to [qwen_vl processor](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/multimodal/processors/qwen_vl.py) for understanding the meaning of each parameter.
+### Bidirectional Attention in Multimodal Model Serving
+**Note for serving the Gemma-3 multimodal model**:
+As mentioned in [Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM
+](https://huggingface.co/blog/gemma3#multimodality), Gemma-3 employs bidirectional attention between image tokens during the prefill phase. Currently, SGLang only supports bidirectional attention when using the Triton Attention Backend. Note, however, that SGLang's current bidirectional attention implementation is incompatible with both CUDA Graph and Chunked Prefill.
+To enable bidirectional attention, you can use the `TritonAttnBackend` while disabling CUDA Graph and Chunked Prefill. Example launch command:
+```shell
+python -m sglang.launch_server \
+  --model-path google/gemma-3-4b-it \
+  --host 0.0.0.0 --port 30000 \
+  --enable-multimodal \
+  --dtype bfloat16 --triton-attention-reduce-in-fp32 \
+  --attention-backend triton \ # Use Triton attention backend
+  --disable-cuda-graph \ # Disable Cuda Graph
+  --chunked-prefill-size -1 # Disable Chunked Prefill
+```
+If higher serving performance is required and a certain degree of accuracy loss is acceptable, you may choose to use other attention backends, and you can also enable features like CUDA Graph and Chunked Prefill for better performance, but note that the model will fall back to using causal attention instead of bidirectional attention.

sglang/python/sglang/srt/__pycache__/constants.cpython-311.pyc ADDED Viewed

Binary file (348 Bytes). View file

sglang/python/sglang/srt/__pycache__/environ.cpython-311.pyc ADDED Viewed

Binary file (35.1 kB). View file

sglang/python/sglang/srt/batch_overlap/__pycache__/operations.cpython-311.pyc ADDED Viewed

Binary file (12.4 kB). View file

sglang/python/sglang/srt/batch_overlap/__pycache__/operations_strategy.cpython-311.pyc ADDED Viewed

Binary file (12.4 kB). View file

sglang/python/sglang/srt/batch_overlap/__pycache__/single_batch_overlap.cpython-311.pyc ADDED Viewed

Binary file (5.8 kB). View file

sglang/python/sglang/srt/batch_overlap/__pycache__/two_batch_overlap.cpython-311.pyc ADDED Viewed

Binary file (42.2 kB). View file

sglang/python/sglang/srt/batch_overlap/operations.py ADDED Viewed

	@@ -0,0 +1,213 @@

+from __future__ import annotations
+import os
+from contextlib import contextmanager
+from dataclasses import dataclass
+from typing import TYPE_CHECKING, Any, Callable, Dict, Generator, List, Sequence, Union
+import torch
+from sglang.srt.layers.dp_attention import set_dp_buffer_len
+if TYPE_CHECKING:
+    from sglang.srt.model_executor.forward_batch_info import ForwardBatch
+_ENABLE_PROFILE = bool(int(os.environ.get("SGLANG_OPERATIONS_ENABLE_PROFILE", "0")))
+if _ENABLE_PROFILE:
+    import nvtx
+def execute_operations(inputs, operations):
+    stages = _convert_operations_to_stages(operations)
+    executor = _StageExecutor("primary", stages, inputs=inputs)
+    for _ in range(executor.num_stages):
+        executor.next()
+    assert executor.done
+    return executor.output
+def execute_overlapped_operations(
+    inputs_arr: Sequence,
+    operations_arr: Sequence,
+    delta_stages: Sequence[int],
+) -> Sequence:
+    # Make it explicit for clarity; if we need multi-batch overlap, this can be generalized
+    inputs_a, inputs_b = inputs_arr
+    operations_a, operations_b = operations_arr
+    delta_stage_a, delta_stage_b = delta_stages
+    assert delta_stage_a == 0
+    delta_stage = delta_stage_b
+    stages_a = _convert_operations_to_stages(operations_a)
+    stages_b = _convert_operations_to_stages(operations_b)
+    executor_a = _StageExecutor("a", stages_a, inputs=inputs_a)
+    executor_b = _StageExecutor("b", stages_b, inputs=inputs_b)
+    for _ in range(delta_stage):
+        executor_a.next()
+    for _ in range(executor_a.num_stages - delta_stage):
+        executor_a.next()
+        executor_b.next()
+    for _ in range(delta_stage):
+        executor_b.next()
+    assert executor_a.done and executor_b.done
+    return [executor_a.output, executor_b.output]
+class YieldOperation:
+    pass
+@dataclass
+class ExecutionOperation:
+    debug_name: str
+    fn: Callable
+Operation = Union[YieldOperation, ExecutionOperation, Callable]
+Stage = List[ExecutionOperation]
+class _StageExecutor:
+    def __init__(self, debug_name: str, stages: List[Stage], inputs: dict):
+        self._debug_name = debug_name
+        self._stages = stages
+        self._index = 0
+        self._stage_state = _StateDict()
+        self._stage_output = inputs
+        # handling DP attention
+        forward_batch: ForwardBatch = inputs["forward_batch"]
+        self._global_dp_buffer_len = forward_batch.global_dp_buffer_len
+        self._local_dp_buffer_len = forward_batch.tbo_padded_len
+        self._global_num_tokens = forward_batch.global_num_tokens_cpu
+        self._is_dp_max_padding = forward_batch.dp_padding_mode.is_max_len()
+    def next(self):
+        assert not self.done
+        stage = self._stages[self._index]
+        # TODO: We currently always call set_dp_buffer_len here because sub-batches
+        # may have different padded lengths. It can likely be removed after TBO slice &
+        # pad logic is refactored.
+        set_dp_buffer_len(
+            self._global_dp_buffer_len,
+            self._local_dp_buffer_len,
+            self._is_dp_max_padding,
+            self._global_num_tokens,
+        )
+        with _annotate_region(debug_name=f"{self._debug_name}{self._index}"):
+            for op in stage:
+                with _annotate_region(debug_name=op.debug_name):
+                    self._stage_output = op.fn(
+                        state=self._stage_state,
+                        **(
+                            self._stage_output if self._stage_output is not None else {}
+                        ),
+                    )
+        self._index += 1
+    @property
+    def output(self):
+        assert self.done
+        return self._stage_output
+    @property
+    def done(self):
+        return self._index >= self.num_stages
+    @property
+    def num_stages(self):
+        return len(self._stages)
+@contextmanager
+def _annotate_region(debug_name):
+    if _ENABLE_PROFILE:
+        with torch.autograd.profiler.record_function(debug_name):
+            with nvtx.annotate(debug_name):
+                yield
+    else:
+        yield
+class _StateDict:
+    def __init__(self):
+        self._data = {}
+    def __setattr__(self, key, value):
+        if key == "_data":
+            super().__setattr__(key, value)
+            return
+        assert (
+            key not in self._data
+        ), f"`{key}` already exist, are you sure you want to override it?"
+        self._data[key] = value
+    def __getattr__(self, item):
+        return self._data[item]
+    def __delattr__(self, item):
+        del self._data[item]
+    def pop(self, item):
+        return self._data.pop(item)
+    def update(self, values: Dict[str, Any]):
+        for k, v in values.items():
+            setattr(self, k, v)
+    def get(self, item):
+        return self._data.get(item)
+    def clear(self, expect_keys: Sequence[str]):
+        if set(self._data.keys()) != set(expect_keys):
+            raise Exception(
+                f"Unexpected keys when clearing. This may indicate you do not release memory early enough but leave it until here. {list(self._data.keys())=} {expect_keys=}"
+            )
+        self._data.clear()
+def _convert_operations_to_stages(operations: List[Operation]) -> List[Stage]:
+    operations = _decorate_operations(operations)
+    operation_chunks = list(
+        _chunk_by_separator(operations, lambda op: isinstance(op, YieldOperation))
+    )
+    assert all(len(chunk) > 0 for chunk in operation_chunks)
+    return operation_chunks
+def _chunk_by_separator(
+    items: List[Any], is_separator: Callable[[Any], bool]
+) -> Generator[List[Any], None, None]:
+    pending_items = []
+    for item in items:
+        if is_separator(item):
+            yield pending_items
+            pending_items = []
+        else:
+            pending_items.append(item)
+    if len(pending_items) > 0:
+        yield pending_items
+def _decorate_operations(operations: List[Operation], debug_name_prefix: str = ""):
+    return [_decorate_operation(op, debug_name_prefix) for op in operations]
+def _decorate_operation(operation: Operation, debug_name_prefix: str):
+    if isinstance(operation, YieldOperation):
+        return operation
+    return ExecutionOperation(
+        debug_name=debug_name_prefix
+        + getattr(operation, "__name__", "unknown").replace("op_", ""),
+        fn=operation,
+    )

sglang/python/sglang/srt/batch_overlap/operations_strategy.py ADDED Viewed

	@@ -0,0 +1,302 @@

+from dataclasses import dataclass
+from typing import List, Optional
+import torch
+from sglang.srt.batch_overlap import operations
+from sglang.srt.batch_overlap.operations import Operation
+from sglang.srt.layers.moe.token_dispatcher import DeepEPConfig
+from sglang.srt.model_executor.forward_batch_info import ForwardMode
+from sglang.srt.utils import is_hip
+_is_hip = is_hip()
+@dataclass
+class OperationsStrategy:
+    operations: List[Operation]
+    deep_gemm_num_sms: Optional[int] = None
+    tbo_delta_stages: Optional[int] = None
+    @classmethod
+    def concat(cls, items: List["OperationsStrategy"]) -> "OperationsStrategy":
+        return OperationsStrategy(
+            operations=[x for item in items for x in item.operations],
+            deep_gemm_num_sms=_assert_all_same(
+                [item.deep_gemm_num_sms for item in items]
+            ),
+            tbo_delta_stages=_assert_all_same(
+                [item.tbo_delta_stages for item in items]
+            ),
+        )
+    @staticmethod
+    def init_new_tbo(
+        layers: torch.nn.ModuleList,
+        forward_mode: ForwardMode,
+    ) -> "OperationsStrategy":
+        layer_name = layers[0].__class__.__name__
+        if layer_name == "DeepseekV2DecoderLayer":
+            return OperationsStrategy.concat(
+                [
+                    _compute_moe_deepseek_layer_operations_strategy_tbo(
+                        layer, forward_mode
+                    )
+                    for layer in layers
+                ]
+            )
+        elif layer_name == "Qwen3MoeDecoderLayer":
+            return OperationsStrategy.concat(
+                [
+                    _compute_moe_qwen3_layer_operations_strategy_tbo(
+                        layer, forward_mode
+                    )
+                    for layer in layers
+                ]
+            )
+        elif layer_name == "MiMoV2DecoderLayer":
+            return OperationsStrategy.concat(
+                [
+                    _compute_moe_mimov2_layer_operations_strategy_tbo(
+                        layer, forward_mode
+                    )
+                    for layer in layers
+                ]
+            )
+        else:
+            raise NotImplementedError
+def _assert_all_same(items: List):
+    assert all(item == items[0] for item in items)
+    return items[0]
+# -------------------------------- Strategy for DeepSeek ---------------------------------------
+# TODO can refactor to make it more fancy if we have more complex strategies
+def _compute_moe_deepseek_layer_operations_strategy_tbo(
+    layer: torch.nn.Module,
+    forward_mode: ForwardMode,
+) -> OperationsStrategy:
+    assert layer.is_layer_sparse, "dense layer TBO not yet implemented"
+    if forward_mode == ForwardMode.EXTEND:
+        return _compute_moe_deepseek_blog_prefill(layer)
+    elif (
+        forward_mode == ForwardMode.DECODE or forward_mode == ForwardMode.TARGET_VERIFY
+    ):
+        return _compute_moe_deepseek_blog_decode(layer)
+    else:
+        raise NotImplementedError(f"Unsupported {forward_mode=}")
+def _compute_moe_deepseek_blog_prefill(layer):
+    device_properties = torch.cuda.get_device_properties(device="cuda")
+    total_num_sms = device_properties.multi_processor_count
+    deep_gemm_num_sms = None
+    if not _is_hip:
+        deep_gemm_num_sms = total_num_sms - DeepEPConfig.get_instance().num_sms
+    return OperationsStrategy(
+        deep_gemm_num_sms=deep_gemm_num_sms,
+        tbo_delta_stages=0,
+        operations=[
+            layer.op_comm_prepare_attn,
+            layer.self_attn.op_prepare,
+            layer.self_attn.op_core,
+            layer.op_comm_prepare_mlp,
+            layer.mlp.op_gate,
+            layer.mlp.op_select_experts,
+            layer.mlp.op_dispatch_a,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_b,
+            layer.mlp.op_experts,
+            layer.mlp.op_combine_a,
+            operations.YieldOperation(),
+            layer.mlp.op_shared_experts,
+            layer.mlp.op_combine_b,
+            layer.mlp.op_output,
+            layer.op_comm_postprocess_layer,
+        ],
+    )
+def _compute_moe_deepseek_blog_decode(layer):
+    return OperationsStrategy(
+        deep_gemm_num_sms=None,
+        tbo_delta_stages=2,
+        operations=[
+            layer.op_comm_prepare_attn,
+            layer.self_attn.op_prepare,
+            operations.YieldOperation(),
+            layer.self_attn.op_core,
+            layer.op_comm_prepare_mlp,
+            layer.mlp.op_gate,
+            layer.mlp.op_select_experts,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_a,
+            layer.mlp.op_shared_experts,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_b,
+            layer.mlp.op_experts,
+            layer.mlp.op_combine_a,
+            operations.YieldOperation(),
+            layer.mlp.op_combine_b,
+            operations.YieldOperation(),
+            layer.mlp.op_output,
+            layer.op_comm_postprocess_layer,
+        ],
+    )
+# -------------------------------- Strategy for Qwen3 ---------------------------------------
+# TODO: unstable, current strategy is almost the same as DeepSeek, keep redundant code here for
+# convenience to adjust strategy
+def _compute_moe_qwen3_layer_operations_strategy_tbo(
+    layer: torch.nn.Module,
+    forward_mode: ForwardMode,
+) -> OperationsStrategy:
+    assert layer.is_layer_sparse, "qwen3 moe only support sparse layers"
+    if forward_mode == ForwardMode.EXTEND:
+        return _compute_moe_qwen3_prefill(layer)
+    elif (
+        forward_mode == ForwardMode.DECODE or forward_mode == ForwardMode.TARGET_VERIFY
+    ):
+        return _compute_moe_qwen3_decode(layer)
+    else:
+        raise NotImplementedError(f"Unsupported {forward_mode=}")
+def _compute_moe_qwen3_prefill(layer):
+    device_properties = torch.cuda.get_device_properties(device="cuda")
+    total_num_sms = device_properties.multi_processor_count
+    deep_gemm_num_sms = None
+    if not _is_hip:
+        deep_gemm_num_sms = total_num_sms - DeepEPConfig.get_instance().num_sms
+    return OperationsStrategy(
+        deep_gemm_num_sms=deep_gemm_num_sms,
+        tbo_delta_stages=0,
+        operations=[
+            layer.op_comm_prepare_attn,
+            layer.self_attn.op_prepare,
+            layer.self_attn.op_core,
+            layer.op_comm_prepare_mlp,
+            layer.mlp.op_gate,
+            layer.mlp.op_select_experts,
+            layer.mlp.op_dispatch_a,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_b,
+            layer.mlp.op_experts,
+            layer.mlp.op_combine_a,
+            operations.YieldOperation(),
+            layer.mlp.op_combine_b,
+            layer.mlp.op_output,
+            layer.op_comm_postprocess_layer,
+        ],
+    )
+def _compute_moe_qwen3_decode(layer):
+    return OperationsStrategy(
+        deep_gemm_num_sms=None,
+        tbo_delta_stages=2,
+        operations=[
+            layer.op_comm_prepare_attn,
+            layer.self_attn.op_prepare,
+            operations.YieldOperation(),
+            layer.self_attn.op_core,
+            layer.op_comm_prepare_mlp,
+            layer.mlp.op_gate,
+            layer.mlp.op_select_experts,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_a,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_b,
+            layer.mlp.op_experts,
+            layer.mlp.op_combine_a,
+            operations.YieldOperation(),
+            layer.mlp.op_combine_b,
+            layer.mlp.op_output,
+            layer.op_comm_postprocess_layer,
+            operations.YieldOperation(),
+        ],
+    )
+# -------------------------------- Strategy for MiMoV2DecoderLayer ---------------------------------------
+# TODO: unstable; current strategy matches DeepSeek for the common operations (MiMoV2 has no op_shared_experts),
+# so we keep this redundant code here for convenience when adjusting the strategy
+def _compute_moe_mimov2_layer_operations_strategy_tbo(
+    layer: torch.nn.Module,
+    forward_mode: ForwardMode,
+) -> OperationsStrategy:
+    assert layer.is_layer_sparse, "MiMoV2DecoderLayer moe only support sparse layers"
+    if forward_mode == ForwardMode.EXTEND:
+        return _compute_moe_mimov2_prefill(layer)
+    elif (
+        forward_mode == ForwardMode.DECODE or forward_mode == ForwardMode.TARGET_VERIFY
+    ):
+        return _compute_moe_mimov2_decode(layer)
+    else:
+        raise NotImplementedError(f"Unsupported {forward_mode=}")
+def _compute_moe_mimov2_prefill(layer):
+    device_properties = torch.cuda.get_device_properties(device="cuda")
+    total_num_sms = device_properties.multi_processor_count
+    deep_gemm_num_sms = total_num_sms - DeepEPConfig.get_instance().num_sms
+    return OperationsStrategy(
+        deep_gemm_num_sms=deep_gemm_num_sms,
+        tbo_delta_stages=0,
+        operations=[
+            layer.op_comm_prepare_attn,
+            layer.self_attn.op_prepare,
+            layer.self_attn.op_core,
+            layer.op_comm_prepare_mlp,
+            layer.mlp.op_gate,
+            layer.mlp.op_select_experts,
+            layer.mlp.op_dispatch_a,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_b,
+            layer.mlp.op_experts,
+            layer.mlp.op_combine_a,
+            operations.YieldOperation(),
+            layer.mlp.op_combine_b,
+            layer.mlp.op_output,
+            layer.op_comm_postprocess_layer,
+        ],
+    )
+def _compute_moe_mimov2_decode(layer):
+    return OperationsStrategy(
+        deep_gemm_num_sms=None,
+        tbo_delta_stages=2,
+        operations=[
+            layer.op_comm_prepare_attn,
+            layer.self_attn.op_prepare,
+            operations.YieldOperation(),
+            layer.self_attn.op_core,
+            layer.op_comm_prepare_mlp,
+            layer.mlp.op_gate,
+            layer.mlp.op_select_experts,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_a,
+            operations.YieldOperation(),
+            layer.mlp.op_dispatch_b,
+            layer.mlp.op_experts,
+            layer.mlp.op_combine_a,
+            operations.YieldOperation(),
+            layer.mlp.op_combine_b,
+            layer.mlp.op_output,
+            layer.op_comm_postprocess_layer,
+            operations.YieldOperation(),
+        ],
+    )

sglang/python/sglang/srt/batch_overlap/single_batch_overlap.py ADDED Viewed

	@@ -0,0 +1,144 @@

+# Copyright 2025 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Optional
+import torch
+from sglang.srt.environ import envs
+from sglang.srt.layers.moe import get_moe_runner_backend
+from sglang.srt.layers.moe.utils import is_sbo_enabled
+from sglang.srt.utils import is_blackwell
+class SboFlags:
+    # TODO may have: "enable_dispatch_gateup_gemm_two_stream_overlap", ...
+    @classmethod
+    def enable_combine_down_gemm_two_stream_overlap(cls):
+        return (
+            is_sbo_enabled()
+            # currently only cutedsl backend supports it
+            and (
+                get_moe_runner_backend().is_flashinfer_cutedsl()
+                or (get_moe_runner_backend().is_deep_gemm() and not is_blackwell())
+            )
+        )
+    @classmethod
+    def enable_combine_shared_two_stream_overlap(cls):
+        return (
+            is_sbo_enabled()
+            and not cls.enable_dispatch_shared_one_stream_overlap()
+            and not envs.SGLANG_BLACKWELL_OVERLAP_SHARED_EXPERTS_OUTSIDE_SBO.get()
+        )
+    @classmethod
+    def enable_dispatch_shared_one_stream_overlap(cls):
+        return is_sbo_enabled() and not is_blackwell()
+    @classmethod
+    def fuse_shared_experts_inside_sbo(cls):
+        return (
+            cls.enable_combine_shared_two_stream_overlap()
+            or cls.enable_dispatch_shared_one_stream_overlap()
+        )
+@dataclass
+class CombineOverlapArgs:
+    # this "overlap" flag means overlapping with down gemm, not the general two-stream overlap
+    overlap: bool
+    stream: torch.cuda.Stream
+    wait_event: torch.cuda.Event
+    num_sms: Optional[int] = None
+    signal: Optional[torch.Tensor] = None
+    block_m: Optional[int] = 64
+    threshold: Optional[int] = 0
+@dataclass
+class DownGemmOverlapArgs:
+    num_sms: int
+    signal: torch.Tensor
+    start_event: torch.cuda.Event
+def compute_overlap_args(dispatch_output, alt_stream):
+    if not (
+        SboFlags.enable_combine_down_gemm_two_stream_overlap()
+        or SboFlags.enable_combine_shared_two_stream_overlap()
+    ):
+        return None, None, {}
+    hidden_states = dispatch_output.hidden_states
+    num_local_experts, num_tokens_static, hidden_dim = hidden_states.shape
+    total_num_sms = torch.cuda.get_device_properties(
+        device="cuda"
+    ).multi_processor_count
+    if envs.SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS.is_set():
+        communicate_num_sms = envs.SGLANG_DEEPEP_LL_COMBINE_SEND_NUM_SMS.get()
+    else:
+        communicate_num_sms = 32 if is_blackwell() else 3
+    compute_num_sms = total_num_sms - communicate_num_sms
+    assert alt_stream is not None
+    combine_wait_event = torch.cuda.Event()
+    combine_overlap_args = CombineOverlapArgs(
+        overlap=False,
+        num_sms=communicate_num_sms,
+        stream=alt_stream,
+        wait_event=combine_wait_event,
+    )
+    meta_overlap_args = dict(
+        compute_num_sms=compute_num_sms,
+    )
+    down_gemm_overlap_args = None
+    if SboFlags.enable_combine_down_gemm_two_stream_overlap():
+        # TODO use zero_allocator to remove this `torch.zeros` call
+        # NOTE ours v2 use uint32 not int32 currently
+        if is_blackwell():
+            combine_signal = torch.zeros(
+                num_local_experts, dtype=torch.uint32, device=hidden_states.device
+            )
+        else:
+            MIN_BLOCK_M = 64
+            combine_signal_size = num_local_experts * (
+                (num_tokens_static + MIN_BLOCK_M - 1) // MIN_BLOCK_M
+            )
+            combine_signal = torch.zeros(
+                combine_signal_size, dtype=torch.int32, device=hidden_states.device
+            )
+        down_gemm_overlap_args = DownGemmOverlapArgs(
+            signal=combine_signal,
+            start_event=combine_wait_event,
+            num_sms=compute_num_sms,
+        )
+        combine_overlap_args.overlap = True
+        combine_overlap_args.signal = combine_signal
+        combine_overlap_args.threshold = compute_num_sms
+    else:
+        meta_overlap_args |= dict(
+            record_event_after_down=combine_wait_event,
+        )
+    return combine_overlap_args, down_gemm_overlap_args, meta_overlap_args

sglang/python/sglang/srt/batch_overlap/two_batch_overlap.py ADDED Viewed

	@@ -0,0 +1,1082 @@

+from __future__ import annotations
+import copy
+import dataclasses
+import logging
+from dataclasses import replace
+from typing import TYPE_CHECKING, Dict, List, Optional, Sequence
+import torch
+from sglang.srt.batch_overlap.operations import (
+    execute_operations,
+    execute_overlapped_operations,
+)
+from sglang.srt.batch_overlap.operations_strategy import OperationsStrategy
+from sglang.srt.layers import deep_gemm_wrapper
+from sglang.srt.layers.attention.base_attn_backend import AttentionBackend
+from sglang.srt.layers.communicator import (
+    CommunicateContext,
+    CommunicateSummableTensorPairFn,
+    ScatterMode,
+)
+from sglang.srt.layers.dp_attention import get_attention_tp_size
+from sglang.srt.layers.moe import (
+    get_deepep_mode,
+    get_moe_a2a_backend,
+    get_tbo_token_distribution_threshold,
+    is_tbo_enabled,
+)
+from sglang.srt.layers.moe.token_dispatcher import (
+    DeepEPDispatcher,
+    MooncakeEPDispatcher,
+    MoriEPDispatcher,
+)
+from sglang.srt.layers.moe.token_dispatcher.base import BaseDispatcher
+from sglang.srt.managers.schedule_batch import ScheduleBatch
+from sglang.srt.model_executor.forward_batch_info import (
+    ForwardBatch,
+    ForwardMode,
+    compute_position,
+)
+from sglang.srt.server_args import get_global_server_args
+from sglang.srt.speculative.spec_info import SpecInput
+from sglang.srt.utils import BumpAllocator, empty_context, get_bool_env_var, is_hip
+if TYPE_CHECKING:
+    from sglang.srt.batch_overlap.single_batch_overlap import CombineOverlapArgs
+    from sglang.srt.layers.moe.token_dispatcher import DispatchOutput
+    from sglang.srt.speculative.eagle_info import EagleVerifyInput
+_is_hip = is_hip()
+_tbo_debug = get_bool_env_var("SGLANG_TBO_DEBUG")
+logger = logging.getLogger(__name__)
+# -------------------------------- Compute Basic Info ---------------------------------------
+def get_token_num_per_seq(
+    forward_mode: ForwardMode,
+    spec_info: Optional[SpecInput] = None,
+):
+    if forward_mode.is_target_verify():
+        return spec_info.draft_token_num
+    elif forward_mode.is_decode():
+        return 1
+    elif forward_mode.is_idle():
+        return 0
+    else:
+        # For extend, we should not use `token_num_per_seq`.
+        return None
+# TODO: may smartly disable TBO when batch size is too small b/c it will slow down
+def compute_split_seq_index(
+    forward_mode: ForwardMode,
+    num_tokens: int,
+    extend_lens: Optional[Sequence[int]],
+    token_num_per_seq: Optional[int],
+) -> Optional[int]:
+    if forward_mode == ForwardMode.EXTEND:
+        assert extend_lens is not None
+        return _split_extend_seqs(extend_lens)
+    elif forward_mode.is_target_verify() or forward_mode.is_decode():
+        assert token_num_per_seq is not None
+        return (num_tokens // token_num_per_seq) // 2
+    elif forward_mode.is_idle() or forward_mode.is_prebuilt():
+        assert num_tokens == 0
+        return 0
+    else:
+        raise NotImplementedError()
+def _is_two_chunk_split_enabled(extend_lens: Sequence[int]) -> bool:
+    if extend_lens is None:
+        return False
+    vanilla_split_seq_index = _split_array_by_balanced_sum(extend_lens)
+    left_sum = sum(extend_lens[:vanilla_split_seq_index])
+    overall_sum = sum(extend_lens)
+    threshold = get_tbo_token_distribution_threshold()
+    assert threshold <= 0.5, f"{threshold=}"
+    return left_sum < overall_sum * threshold or left_sum > overall_sum * (
+        1 - threshold
+    )
+def _split_extend_seqs(arr: Sequence[int]) -> int:
+    if _is_two_chunk_split_enabled(arr):
+        return _split_array_by_cum_less_than_half(arr)
+    return _split_array_by_balanced_sum(arr)
+def _split_array_by_cum_less_than_half(arr: Sequence[int]) -> int:
+    left_sum = 0
+    overall_sum = sum(arr)
+    half_sum = overall_sum // 2
+    chosen_index = 0
+    for i in range(len(arr)):
+        left_sum += arr[i]
+        if left_sum > half_sum:
+            chosen_index = i
+            break
+    return chosen_index
+def _split_array_by_balanced_sum(arr: Sequence[int]) -> int:
+    overall_sum = sum(arr)
+    left_sum = 0
+    min_diff = float("inf")
+    best_index = 0
+    for i in range(1, len(arr)):
+        left_sum += arr[i - 1]
+        right_sum = overall_sum - left_sum
+        diff = abs(left_sum - right_sum)
+        if diff <= min_diff:
+            min_diff = diff
+            best_index = i
+        else:
+            break
+    return best_index
+def _update_device_and_sum_field_from_cpu_field(
+    batch: ForwardBatch, cpu_field: str, device_field: str, sum_field: str = None
+):
+    cpu_value = getattr(batch, cpu_field, None)
+    old_device_value = getattr(batch, device_field, None)
+    if (
+        cpu_value is None
+        or old_device_value is None
+        or not (isinstance(cpu_value, torch.Tensor) or isinstance(cpu_value, list))
+    ):
+        return
+    new_device_value = (
+        cpu_value
+        if isinstance(cpu_value, torch.Tensor)
+        else torch.tensor(cpu_value, dtype=old_device_value.dtype)
+    ).to(device=get_global_server_args().device, non_blocking=True)
+    setattr(batch, device_field, new_device_value)
+    if sum_field is not None:
+        sum_value = (
+            cpu_value.sum().item()
+            if isinstance(cpu_value, torch.Tensor)
+            else sum(cpu_value)
+        )
+        setattr(batch, sum_field, sum_value)
+def _compute_mask_offset(seq_index: int, spec_info: Optional[EagleVerifyInput]) -> int:
+    if seq_index == 0:
+        return 0
+    offset = 0
+    max_seq_len = min(seq_index, spec_info.seq_lens_cpu.shape[0])
+    for i in range(max_seq_len):
+        offset += (
+            spec_info.seq_lens_cpu[i] + spec_info.draft_token_num
+        ) * spec_info.draft_token_num
+    return offset
+def split_spec_info(
+    spec_info: Optional[EagleVerifyInput],
+    start_seq_index: int,
+    end_seq_index: int,
+    start_token_index: int,
+    end_token_index: int,
+):
+    if spec_info is None:
+        return None
+    if spec_info.draft_token is not None:
+        draft_token = spec_info.draft_token[start_token_index:end_token_index]
+    else:
+        draft_token = None
+    if spec_info.custom_mask is not None and spec_info.draft_token is not None:
+        custom_mask_start = _compute_mask_offset(start_seq_index, spec_info)
+        if end_seq_index == spec_info.seq_lens_cpu.shape[0]:
+            custom_mask_end = spec_info.custom_mask.shape[0]
+        else:
+            custom_mask_end = _compute_mask_offset(end_seq_index, spec_info)
+        if custom_mask_end > custom_mask_start:
+            custom_mask = spec_info.custom_mask[custom_mask_start:custom_mask_end]
+        else:
+            custom_mask = spec_info.custom_mask
+    else:
+        custom_mask = spec_info.custom_mask
+    if spec_info.positions is not None:
+        positions = spec_info.positions[start_token_index:end_token_index]
+    else:
+        positions = None
+    if spec_info.retrive_index is not None:
+        retrive_index = spec_info.retrive_index[start_seq_index:end_seq_index]
+    else:
+        retrive_index = None
+    if spec_info.retrive_next_token is not None:
+        retrive_next_token = spec_info.retrive_next_token[start_seq_index:end_seq_index]
+    else:
+        retrive_next_token = None
+    if spec_info.retrive_next_sibling is not None:
+        retrive_next_sibling = spec_info.retrive_next_sibling[
+            start_seq_index:end_seq_index
+        ]
+    else:
+        retrive_next_sibling = None
+    if spec_info.retrive_cum_len is not None:
+        retrive_cum_len = spec_info.retrive_cum_len[start_seq_index:end_seq_index]
+    else:
+        retrive_cum_len = None
+    if spec_info.seq_lens_cpu is not None:
+        seq_lens_cpu = spec_info.seq_lens_cpu[start_seq_index:end_seq_index]
+    else:
+        seq_lens_cpu = None
+    if seq_lens_cpu is not None:
+        seq_lens_sum = seq_lens_cpu.sum()
+    else:
+        seq_lens_sum = None
+    output_spec_info = replace(
+        spec_info,
+        custom_mask=custom_mask,
+        draft_token=draft_token,
+        positions=positions,
+        retrive_index=retrive_index,
+        retrive_next_token=retrive_next_token,
+        retrive_next_sibling=retrive_next_sibling,
+        retrive_cum_len=retrive_cum_len,
+        seq_lens_cpu=seq_lens_cpu,
+        seq_lens_sum=seq_lens_sum,
+    )
+    return output_spec_info
+def compute_split_token_index(
+    split_seq_index: int,
+    forward_mode: "ForwardMode",
+    extend_seq_lens: Optional[Sequence[int]],
+    token_num_per_seq: Optional[int],
+) -> int:
+    if forward_mode == ForwardMode.EXTEND:
+        assert extend_seq_lens is not None
+        if _is_two_chunk_split_enabled(extend_seq_lens):
+            return sum(extend_seq_lens) // 2
+        return sum(extend_seq_lens[:split_seq_index])
+    elif forward_mode.is_target_verify() or forward_mode.is_decode():
+        assert token_num_per_seq is not None
+        return split_seq_index * token_num_per_seq
+    elif forward_mode.is_idle():
+        assert split_seq_index == 0
+        return 0
+    else:
+        raise NotImplementedError
+def compute_split_indices_for_cuda_graph_replay(
+    forward_mode: ForwardMode,
+    cuda_graph_num_tokens: int,
+    spec_info: Optional[SpecInput],
+):
+    forward_mode_for_tbo_split = (
+        forward_mode if forward_mode != ForwardMode.IDLE else ForwardMode.DECODE
+    )
+    token_num_per_seq = get_token_num_per_seq(
+        forward_mode=forward_mode, spec_info=spec_info
+    )
+    tbo_split_seq_index = compute_split_seq_index(
+        forward_mode=forward_mode_for_tbo_split,
+        num_tokens=cuda_graph_num_tokens,
+        extend_lens=None,
+        token_num_per_seq=token_num_per_seq,
+    )
+    tbo_split_token_index = compute_split_token_index(
+        split_seq_index=tbo_split_seq_index,
+        forward_mode=forward_mode_for_tbo_split,
+        extend_seq_lens=None,
+        token_num_per_seq=token_num_per_seq,
+    )
+    return tbo_split_seq_index, tbo_split_token_index
+# -------------------------------- Preparation ---------------------------------------
+class TboCudaGraphRunnerPlugin:
+    def __init__(self):
+        self._tbo_children_num_token_non_padded = torch.zeros((2,), dtype=torch.int32)
+    def capture_one_batch_size(self, batch: ForwardBatch, num_tokens: int):
+        if not is_tbo_enabled():
+            return
+        token_num_per_seq = get_token_num_per_seq(
+            forward_mode=batch.forward_mode, spec_info=batch.spec_info
+        )
+        batch.tbo_split_seq_index = compute_split_seq_index(
+            forward_mode=batch.forward_mode,
+            num_tokens=num_tokens,
+            extend_lens=None,
+            token_num_per_seq=token_num_per_seq,
+        )
+        # For simplicity, when two_batch_overlap is enabled, we only capture CUDA Graph for tbo=true
+        assert batch.tbo_split_seq_index is not None, f"{num_tokens=}"
+        self._tbo_children_num_token_non_padded[...] = (
+            TboForwardBatchPreparer.compute_tbo_children_num_token_non_padded(batch)
+        )
+        TboForwardBatchPreparer.prepare_raw(
+            batch,
+            tbo_children_num_token_non_padded=self._tbo_children_num_token_non_padded,
+        )
+    def replay_prepare(
+        self,
+        forward_mode: ForwardMode,
+        bs: int,
+        num_token_non_padded: int,
+        spec_info: Optional[SpecInput],
+    ):
+        token_num_per_seq = get_token_num_per_seq(
+            forward_mode=forward_mode, spec_info=spec_info
+        )
+        tbo_split_seq_index, tbo_split_token_index = (
+            compute_split_indices_for_cuda_graph_replay(
+                forward_mode=forward_mode,
+                cuda_graph_num_tokens=bs * token_num_per_seq,
+                spec_info=spec_info,
+            )
+        )
+        self._tbo_children_num_token_non_padded[...] = (
+            TboForwardBatchPreparer.compute_tbo_children_num_token_non_padded_raw(
+                tbo_split_token_index=tbo_split_token_index,
+                num_token_non_padded=num_token_non_padded,
+            )
+        )
+class TboDPAttentionPreparer:
+    def prepare_all_gather(
+        self,
+        local_batch: ScheduleBatch,
+    ):
+        deepep_mode = get_deepep_mode()
+        enable_a2a_moe = not get_moe_a2a_backend().is_none()
+        enable_two_batch_overlap = is_tbo_enabled()
+        self.enable_two_batch_overlap = enable_two_batch_overlap
+        if local_batch is not None:
+            token_num_per_seq = get_token_num_per_seq(
+                forward_mode=local_batch.forward_mode, spec_info=local_batch.spec_info
+            )
+            if (
+                local_batch.forward_mode.is_target_verify()
+                or local_batch.forward_mode.is_decode()
+            ):
+                num_tokens = local_batch.batch_size() * token_num_per_seq
+            elif local_batch.forward_mode.is_prebuilt():
+                num_tokens = 0
+            else:
+                num_tokens = local_batch.extend_num_tokens
+            self.local_tbo_split_seq_index = compute_split_seq_index(
+                forward_mode=local_batch.forward_mode,
+                num_tokens=num_tokens,
+                extend_lens=local_batch.extend_lens,
+                token_num_per_seq=token_num_per_seq,
+            )
+            resolved_deepep_mode = deepep_mode.resolve(local_batch.is_extend_in_batch)
+            local_can_run_tbo = (self.local_tbo_split_seq_index is not None) and not (
+                (
+                    local_batch.forward_mode.is_extend()
+                    and not local_batch.forward_mode.is_target_verify()
+                )
+                and enable_a2a_moe
+                and (resolved_deepep_mode.is_low_latency())
+            )
+        else:
+            self.local_tbo_split_seq_index = 0
+            local_can_run_tbo = True
+        local_forward_mode = self._compute_local_forward_mode(local_batch)
+        return local_can_run_tbo, local_forward_mode
+    def compute_output(self, partial_global_info):
+        # Perform only one Device-to-Host (D2H) memory copy
+        cpu_data = partial_global_info[:, :2].cpu()
+        local_can_run_tbo_aggregated = min(cpu_data[:, 0].tolist())
+        forward_modes = cpu_data[:, 1].tolist()
+        global_forward_mode, forward_mode_agree = self._compute_global_forward_mode(
+            forward_modes
+        )
+        can_run_tbo = (
+            self.enable_two_batch_overlap
+            and local_can_run_tbo_aggregated
+            and forward_mode_agree
+        )
+        tbo_split_seq_index = self.local_tbo_split_seq_index if can_run_tbo else None
+        global_forward_mode = global_forward_mode if can_run_tbo else None
+        return tbo_split_seq_index, global_forward_mode
+    @staticmethod
+    def _compute_local_forward_mode(local_batch):
+        return (
+            local_batch.forward_mode if local_batch is not None else ForwardMode.IDLE
+        ).value
+    @staticmethod
+    def _compute_global_forward_mode(forward_modes):
+        forward_modes_excluding_idle_and_prebuilt = [
+            x
+            for x in forward_modes
+            if x != ForwardMode.IDLE.value and x != ForwardMode.PREBUILT.value
+        ]
+        if not forward_modes_excluding_idle_and_prebuilt:
+            return ForwardMode.IDLE, False
+        forward_mode_agree = TboDPAttentionPreparer._is_all_same(
+            forward_modes_excluding_idle_and_prebuilt
+        )
+        global_forward_mode = (
+            ForwardMode(forward_modes_excluding_idle_and_prebuilt[0])
+            if forward_mode_agree
+            else None
+        )
+        return global_forward_mode, forward_mode_agree
+    @staticmethod
+    def _is_all_same(x):
+        return all(value == x[0] for value in x)
+class TboForwardBatchPreparer:
+    @classmethod
+    def prepare(cls, batch: ForwardBatch, is_draft_worker: bool = False):
+        if batch.tbo_split_seq_index is None or is_draft_worker:
+            return
+        tbo_children_num_token_non_padded = (
+            cls.compute_tbo_children_num_token_non_padded(batch)
+        )
+        cls.prepare_raw(
+            batch, tbo_children_num_token_non_padded=tbo_children_num_token_non_padded
+        )
+    @classmethod
+    def prepare_raw(
+        cls, batch: ForwardBatch, tbo_children_num_token_non_padded: torch.Tensor
+    ):
+        from sglang.srt.layers.attention.tbo_backend import TboAttnBackend
+        tbo_split_token_index = cls._compute_split_token_index(batch)
+        is_enable_two_chunk = (
+            batch.forward_mode == ForwardMode.EXTEND
+            and _is_two_chunk_split_enabled(batch.extend_seq_lens_cpu)
+        )
+        if _tbo_debug:
+            logger.info(
+                f"TboForwardBatchPreparer.prepare "
+                f"is_enable_two_chunk={is_enable_two_chunk} "
+                f"tbo_split_seq_index={batch.tbo_split_seq_index} "
+                f"tbo_split_token_index={tbo_split_token_index} "
+                f"extend_seq_lens={batch.extend_seq_lens_cpu} "
+                f"bs={batch.batch_size} "
+                f"forward_mode={batch.forward_mode}"
+            )
+        assert isinstance(batch.attn_backend, TboAttnBackend)
+        attn_backend_child_a, attn_backend_child_b = batch.attn_backend.children
+        [out_num_token_non_padded_a, out_num_token_non_padded_b] = (
+            tbo_children_num_token_non_padded
+        )
+        child_a = cls.filter_batch(
+            batch,
+            start_token_index=0,
+            end_token_index=tbo_split_token_index,
+            start_seq_index=0,
+            end_seq_index=(
+                batch.tbo_split_seq_index + 1
+                if is_enable_two_chunk
+                else batch.tbo_split_seq_index
+            ),
+            output_attn_backend=attn_backend_child_a,
+            out_num_token_non_padded=out_num_token_non_padded_a,
+        )
+        child_b = cls.filter_batch(
+            batch,
+            start_token_index=tbo_split_token_index,
+            end_token_index=batch.input_ids.shape[0],
+            start_seq_index=batch.tbo_split_seq_index,
+            end_seq_index=batch.batch_size,
+            output_attn_backend=attn_backend_child_b,
+            out_num_token_non_padded=out_num_token_non_padded_b,
+        )
+        if is_enable_two_chunk:
+            cls.derive_fields_related_to_seq_len_for_two_chunk(
+                batch,
+                child_a=child_a,
+                child_b=child_b,
+                tbo_split_seq_index=batch.tbo_split_seq_index,
+            )
+        assert batch.tbo_children is None
+        batch.tbo_children = [child_a, child_b]
+    @classmethod
+    def derive_fields_related_to_seq_len_for_two_chunk(
+        cls,
+        batch: ForwardBatch,
+        *,
+        child_a: ForwardBatch,
+        child_b: ForwardBatch,
+        tbo_split_seq_index: int,
+    ):
+        extend_seq_lens_cpu = batch.extend_seq_lens_cpu
+        overall_seq_lens_sum = sum(extend_seq_lens_cpu)
+        half_seq_lens_sum = overall_seq_lens_sum // 2
+        left_last_seq_token_num = half_seq_lens_sum - sum(
+            extend_seq_lens_cpu[:tbo_split_seq_index]
+        )
+        right_first_seq_token_num = (
+            extend_seq_lens_cpu[tbo_split_seq_index] - left_last_seq_token_num
+        )
+        # making deepcopy to be extra safe
+        child_a.extend_seq_lens_cpu = copy.deepcopy(child_a.extend_seq_lens_cpu)
+        child_a.extend_seq_lens_cpu[-1] = left_last_seq_token_num
+        child_b.extend_seq_lens_cpu = copy.deepcopy(child_b.extend_seq_lens_cpu)
+        child_b.extend_seq_lens_cpu[0] = right_first_seq_token_num
+        for child in [child_a, child_b]:
+            _update_device_and_sum_field_from_cpu_field(
+                batch=child,
+                cpu_field="extend_seq_lens_cpu",
+                device_field="extend_seq_lens",
+                sum_field="extend_num_tokens",
+            )
+        assert (
+            child_a.extend_num_tokens == half_seq_lens_sum
+        ), f"{child_a.extend_num_tokens=}, {half_seq_lens_sum=}"
+        child_a.seq_lens_cpu = copy.deepcopy(child_a.seq_lens_cpu)
+        child_a.seq_lens_cpu[-1] = (
+            child_a.extend_seq_lens_cpu[-1] + child_a.extend_prefix_lens_cpu[-1]
+        )
+        _update_device_and_sum_field_from_cpu_field(
+            batch=child_a,
+            cpu_field="seq_lens_cpu",
+            device_field="seq_lens",
+            sum_field="seq_lens_sum",
+        )
+        child_b.extend_prefix_lens_cpu = copy.deepcopy(child_b.extend_prefix_lens_cpu)
+        child_b.extend_prefix_lens_cpu[0] += left_last_seq_token_num
+        _update_device_and_sum_field_from_cpu_field(
+            batch=child_b,
+            cpu_field="extend_prefix_lens_cpu",
+            device_field="extend_prefix_lens",
+            sum_field=None,
+        )
+        _, child_b.extend_start_loc = compute_position(
+            get_global_server_args().attention_backend,
+            child_b.extend_prefix_lens,
+            child_b.extend_seq_lens,
+            child_b.extend_num_tokens,
+        )
+    @classmethod
+    def filter_batch(
+        cls,
+        batch: ForwardBatch,
+        *,
+        start_token_index: int,
+        end_token_index: int,
+        start_seq_index: int,
+        end_seq_index: int,
+        output_attn_backend: AttentionBackend,
+        out_num_token_non_padded: torch.Tensor,
+    ):
+        assert (
+            end_token_index >= start_token_index
+        ), f"{end_token_index=}, {start_token_index=}, batch={batch}"
+        num_tokens = batch.input_ids.shape[0]
+        num_seqs = batch.batch_size
+        output_dict = dict()
+        for key in [
+            "input_ids",
+            "positions",
+            "out_cache_loc",
+        ]:
+            old_value = getattr(batch, key)
+            assert (
+                old_value.shape[0] == num_tokens
+            ), f"{key=} {old_value=} {num_tokens=} {batch=}"
+            output_dict[key] = old_value[start_token_index:end_token_index]
+        attention_tp_size = get_attention_tp_size()
+        output_dict["tbo_padded_len"] = (
+            (end_token_index - start_token_index - 1) // attention_tp_size + 1
+        ) * attention_tp_size
+        for key in [
+            "req_pool_indices",
+            "seq_lens",
+            "seq_lens_cpu",
+            "extend_seq_lens",
+            "extend_prefix_lens",
+            "extend_start_loc",
+            "extend_prefix_lens_cpu",
+            "extend_seq_lens_cpu",
+            "extend_logprob_start_lens_cpu",
+            "lora_ids",
+            "rids",
+        ]:
+            old_value = getattr(batch, key)
+            if old_value is None:
+                continue
+            elif batch.forward_mode.is_target_verify() and (
+                key == "extend_seq_lens"
+                or key == "extend_prefix_lens"
+                or key == "extend_start_loc"
+                or key == "extend_prefix_lens_cpu"
+                or key == "extend_seq_lens_cpu"
+                or key == "extend_logprob_start_lens_cpu"
+            ):
+                output_dict[key] = None
+                continue
+            assert (
+                len(old_value) == num_seqs
+            ), f"{key=} {old_value=} {num_seqs=} {batch=}"
+            output_dict[key] = old_value[start_seq_index:end_seq_index]
+        spec_info = getattr(batch, "spec_info")
+        output_spec_info = split_spec_info(
+            spec_info=spec_info,
+            start_token_index=start_token_index,
+            end_token_index=end_token_index,
+            start_seq_index=start_seq_index,
+            end_seq_index=end_seq_index,
+        )
+        output_dict["spec_info"] = output_spec_info
+        for key in [
+            "forward_mode",
+            "is_extend_in_batch",
+            "all_extend_in_batch",
+            "return_logprob",
+            "req_to_token_pool",
+            "token_to_kv_pool",
+            "can_run_dp_cuda_graph",
+            "dp_padding_mode",
+            "global_forward_mode",
+            "is_prefill_only",
+            "spec_algorithm",
+            "capture_hidden_mode",
+            "padded_static_len",
+            "mrope_positions",  # only used by qwen2-vl, thus not care
+            "split_index",  # for split prefill
+            "orig_seq_lens",  # only used by qwen-1m, thus not care
+        ]:
+            output_dict[key] = getattr(batch, key)
+        if not batch.forward_mode.is_target_verify():
+            assert (
+                _compute_extend_num_tokens(batch.input_ids, batch.forward_mode)
+                == batch.extend_num_tokens
+            ), f"{batch=}"
+        extend_num_tokens = _compute_extend_num_tokens(
+            output_dict["input_ids"], output_dict["forward_mode"]
+        )
+        # TODO improve, e.g. unify w/ `init_raw`
+        if (
+            get_global_server_args().moe_dense_tp_size == 1
+            and batch.global_dp_buffer_len is not None
+        ):
+            sum_len = end_token_index - start_token_index
+            global_dp_buffer_len = sum_len
+        else:
+            global_dp_buffer_len = None
+        output_dict.update(
+            dict(
+                batch_size=end_seq_index - start_seq_index,
+                seq_lens_sum=(
+                    output_dict["seq_lens_cpu"].sum()
+                    if "seq_lens_cpu" in output_dict
+                    else None
+                ),
+                extend_num_tokens=extend_num_tokens,
+                attn_backend=output_attn_backend,
+                num_token_non_padded=out_num_token_non_padded,
+                # TODO: handle it when we need TBO + DeepSeek V3.2
+                num_token_non_padded_cpu=None,
+                tbo_split_seq_index=None,
+                tbo_parent_token_range=(start_token_index, end_token_index),
+                tbo_children=None,
+                original_global_num_tokens_cpu=None,
+                global_num_tokens_gpu=None,
+                global_num_tokens_cpu=None,
+                global_dp_buffer_len=global_dp_buffer_len,
+                global_num_tokens_for_logprob_gpu=None,
+                global_num_tokens_for_logprob_cpu=None,
+                sampling_info=None,
+                # For logits and logprobs post processing, thus we do not care
+                temp_scaled_logprobs=False,
+                temperature=None,
+                top_p_normalized_logprobs=False,
+                top_p=None,
+                mm_inputs=None,
+                top_logprobs_nums=None,
+                token_ids_logprobs=None,
+                next_token_logits_buffer=None,
+                return_hidden_states_before_norm=False,
+            )
+        )
+        errors = []
+        for field in dataclasses.fields(ForwardBatch):
+            if getattr(batch, field.name) is not None and field.name not in output_dict:
+                errors.append(
+                    f"Field {field.name} has value, but is not yet supported (value={getattr(batch, field.name)} batch={batch})"
+                )
+        if len(errors) > 0:
+            raise Exception(f"{len(errors)} errors happen:\n" + "\n\n".join(errors))
+        return ForwardBatch(**output_dict)
+    @classmethod
+    def compute_tbo_children_num_token_non_padded(cls, batch: ForwardBatch):
+        return cls.compute_tbo_children_num_token_non_padded_raw(
+            tbo_split_token_index=cls._compute_split_token_index(batch),
+            num_token_non_padded=len(batch.input_ids),
+        )
+    @classmethod
+    def compute_tbo_children_num_token_non_padded_raw(
+        cls, tbo_split_token_index: int, num_token_non_padded: int
+    ):
+        # TODO we may make padding on both sub-batches to make it slightly more balanced
+        value_a = min(tbo_split_token_index, num_token_non_padded)
+        value_b = max(0, num_token_non_padded - tbo_split_token_index)
+        return torch.tensor([value_a, value_b], dtype=torch.int32).to(
+            device=get_global_server_args().device, non_blocking=True
+        )
+    @classmethod
+    def _compute_split_token_index(cls, batch: ForwardBatch):
+        token_num_per_seq = get_token_num_per_seq(
+            forward_mode=batch.forward_mode, spec_info=batch.spec_info
+        )
+        return compute_split_token_index(
+            split_seq_index=batch.tbo_split_seq_index,
+            forward_mode=batch.forward_mode,
+            extend_seq_lens=batch.extend_seq_lens_cpu,
+            token_num_per_seq=token_num_per_seq,
+        )
+def _compute_extend_num_tokens(input_ids, forward_mode: ForwardMode):
+    if (
+        forward_mode.is_decode()
+        or forward_mode.is_idle()
+        or forward_mode.is_target_verify()
+    ):
+        return None
+    elif forward_mode.is_extend():
+        return input_ids.shape[0]
+    raise NotImplementedError
+# -------------------------------- Execution ---------------------------------------
+def model_forward_maybe_tbo(
+    layers,
+    enable_tbo: bool,
+    positions: torch.Tensor,
+    forward_batch: ForwardBatch,
+    hidden_states: torch.Tensor,
+    input_data_scatter_mode: ScatterMode,
+    residual: Optional[torch.Tensor],
+    zero_allocator: Optional[BumpAllocator] = None,
+):
+    inputs = dict(
+        positions=positions,
+        hidden_states=hidden_states,
+        forward_batch=forward_batch,
+        residual=residual,
+        zero_allocator=zero_allocator,
+    )
+    layer_input_scatter_mode = layers[0].layer_scatter_modes.layer_input_mode
+    operations_strategy = OperationsStrategy.init_new_tbo(
+        layers, forward_batch.global_forward_mode
+    )
+    if enable_tbo:
+        return _model_forward_tbo(
+            inputs=inputs,
+            operations_strategy=operations_strategy,
+            input_data_scatter_mode=input_data_scatter_mode,
+            layer_input_scatter_mode=layer_input_scatter_mode,
+        )
+    else:
+        return _model_forward_non_tbo(inputs, operations_strategy)
+def _model_forward_tbo(
+    inputs,
+    operations_strategy: OperationsStrategy,
+    input_data_scatter_mode: ScatterMode,
+    layer_input_scatter_mode: ScatterMode,
+):
+    inputs_arr = _model_forward_tbo_split_inputs(
+        **inputs,
+        input_data_scatter_mode=input_data_scatter_mode,
+        layer_input_scatter_mode=layer_input_scatter_mode,
+    )
+    original_hidden_states_len = inputs["hidden_states"].shape[0]
+    del inputs
+    context = (
+        empty_context()
+        if _is_hip
+        else deep_gemm_wrapper.configure_deep_gemm_num_sms(
+            operations_strategy.deep_gemm_num_sms
+        )
+    )
+    with context:
+        outputs_arr = execute_overlapped_operations(
+            inputs_arr=inputs_arr,
+            operations_arr=[operations_strategy.operations] * 2,
+            delta_stages=[0, operations_strategy.tbo_delta_stages],
+        )
+    return _model_forward_tbo_merge_outputs(*outputs_arr, original_hidden_states_len)
+def _model_forward_non_tbo(inputs, operations_strategy: OperationsStrategy):
+    outputs = execute_operations(inputs, operations_strategy.operations)
+    return outputs["hidden_states"], outputs["residual"]
+def _model_forward_tbo_split_inputs(
+    hidden_states: torch.Tensor,
+    residual: torch.Tensor,
+    positions: torch.Tensor,
+    forward_batch: ForwardBatch,
+    zero_allocator: Optional[BumpAllocator],
+    input_data_scatter_mode: ScatterMode,
+    layer_input_scatter_mode: ScatterMode,
+) -> List[Dict]:
+    tbo_splitter_scatter_mode = ScatterMode.TP_ATTN_FULL
+    context = CommunicateContext.init_new()
+    hidden_states, residual = CommunicateSummableTensorPairFn.execute(
+        hidden_states_input_mode=input_data_scatter_mode,
+        residual_input_mode=input_data_scatter_mode,
+        output_mode=tbo_splitter_scatter_mode,
+        hidden_states=hidden_states,
+        residual=residual,
+        forward_batch=forward_batch,
+        context=context,
+    )
+    inputs_arr = _model_forward_tbo_split_inputs_raw(
+        hidden_states=hidden_states,
+        residual=residual,
+        positions=positions,
+        forward_batch=forward_batch,
+        zero_allocator=zero_allocator,
+    )
+    def _post_transform(hidden_states, residual, forward_batch, **kwargs):
+        hidden_states, residual = CommunicateSummableTensorPairFn.execute(
+            hidden_states_input_mode=tbo_splitter_scatter_mode,
+            residual_input_mode=tbo_splitter_scatter_mode,
+            output_mode=layer_input_scatter_mode,
+            hidden_states=hidden_states,
+            residual=residual,
+            forward_batch=forward_batch,
+            context=context,
+        )
+        return dict(
+            hidden_states=hidden_states,
+            residual=residual,
+            forward_batch=forward_batch,
+            **kwargs,
+        )
+    return [_post_transform(**inputs) for inputs in inputs_arr]
+def _model_forward_tbo_split_inputs_raw(
+    hidden_states: torch.Tensor,
+    residual: torch.Tensor,
+    positions: torch.Tensor,
+    forward_batch: ForwardBatch,
+    zero_allocator: Optional[BumpAllocator],
+) -> List[Dict]:
+    return [
+        dict(
+            **_model_forward_filter_inputs(
+                hidden_states=hidden_states,
+                residual=residual,
+                positions=positions,
+                output_forward_batch=output_forward_batch,
+                tbo_subbatch_index=tbo_subbatch_index,
+            ),
+            **(
+                dict(zero_allocator=zero_allocator)
+                if zero_allocator is not None
+                else {}
+            ),
+        )
+        for tbo_subbatch_index, output_forward_batch in enumerate(
+            forward_batch.tbo_children
+        )
+    ]
+def _model_forward_filter_inputs(
+    hidden_states: torch.Tensor,
+    residual: torch.Tensor,
+    positions: torch.Tensor,
+    output_forward_batch: ForwardBatch,
+    tbo_subbatch_index: int,
+) -> Dict:
+    token_slice = slice(*output_forward_batch.tbo_parent_token_range)
+    hidden_states = hidden_states[token_slice]
+    residual = None if residual is None else residual[token_slice]
+    positions = positions[token_slice]
+    assert output_forward_batch.tbo_padded_len is not None
+    padded_len = output_forward_batch.tbo_padded_len
+    def _pad(x):
+        nonlocal padded_len
+        if x is None:
+            return None
+        if x.shape[0] == padded_len:
+            return x
+        res = torch.zeros((padded_len, *x.shape[1:]), dtype=x.dtype, device=x.device)
+        res[: x.shape[0]] = x
+        return res
+    return dict(
+        hidden_states=_pad(hidden_states),
+        residual=_pad(residual),
+        positions=_pad(positions),
+        forward_batch=output_forward_batch,
+        tbo_subbatch_index=tbo_subbatch_index,
+    )
+def _model_forward_tbo_merge_outputs(output_a, output_b, original_len):
+    def _handle_key(name):
+        value_a = output_a[name]
+        value_b = output_b[name]
+        assert (value_a is None) == (value_b is None)
+        if value_a is None:
+            return None
+        s0, t0 = output_a["forward_batch"].tbo_parent_token_range
+        s1, t1 = output_b["forward_batch"].tbo_parent_token_range
+        res = torch.zeros(
+            (original_len, *value_a.shape[1:]),
+            dtype=value_a.dtype,
+            device=value_a.device,
+        )
+        res[slice(s0, t0)] = value_a[: t0 - s0]
+        res[slice(s1, t1)] = value_b[: t1 - s1]
+        return res
+    return _handle_key("hidden_states"), _handle_key("residual")
+# -------------------------------- Utilities and wrappers ---------------------------------------
+class MaybeTboDeepEPDispatcher(BaseDispatcher):
+    def __init__(self, **kwargs):
+        super().__init__()
+        num_inner_dispatchers = 2 if is_tbo_enabled() else 1
+        if get_moe_a2a_backend().is_deepep():
+            self._inners = [
+                DeepEPDispatcher(**kwargs) for _ in range(num_inner_dispatchers)
+            ]
+        elif get_moe_a2a_backend().is_mooncake():
+            self._inners = [
+                MooncakeEPDispatcher(**kwargs) for _ in range(num_inner_dispatchers)
+            ]
+        elif get_moe_a2a_backend().is_mori():
+            self._inners = [
+                MoriEPDispatcher(**kwargs) for _ in range(num_inner_dispatchers)
+            ]
+    def _execute(self, name, tbo_subbatch_index: Optional[int] = None, **kwargs):
+        return getattr(self._inners[tbo_subbatch_index or 0], name)(**kwargs)
+    def dispatch(self, **kwargs) -> DispatchOutput:
+        return self._execute("dispatch", **kwargs)
+    def dispatch_a(self, **kwargs):
+        return self._execute("dispatch_a", **kwargs)
+    def dispatch_b(self, **kwargs):
+        return self._execute("dispatch_b", **kwargs)
+    def combine(self, **kwargs) -> torch.Tensor:
+        return self._execute("combine", **kwargs)
+    def combine_a(self, **kwargs):
+        return self._execute("combine_a", **kwargs)
+    def combine_b(self, **kwargs):
+        return self._execute("combine_b", **kwargs)
+    def register_deepep_dispatch_hook(self, hook):
+        handle_list = []
+        for inner in self._inners:
+            handle_list.append(inner.register_deepep_dispatch_hook(hook))
+        return handle_list
+    def set_quant_config(self, quant_config: dict):
+        super().set_quant_config(quant_config)
+        for inner in self._inners:
+            inner.set_quant_config(quant_config)
+    def set_overlap_args(
+        self, combine_overlap_args: CombineOverlapArgs, meta_overlap_args: dict
+    ):
+        super().set_overlap_args(combine_overlap_args, meta_overlap_args)
+        for inner in self._inners:
+            inner.set_overlap_args(combine_overlap_args, meta_overlap_args)
+    def clear_overlap_args(self):
+        super().clear_overlap_args()
+        for inner in self._inners:
+            inner.clear_overlap_args()

sglang/python/sglang/srt/checkpoint_engine/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""
+Checkpoint engine module for SGLang.
+This module provides functionality for updating model weights via checkpoint engine.
+"""
+from sglang.srt.checkpoint_engine.update import main
+__all__ = ["main"]

sglang/python/sglang/srt/checkpoint_engine/checkpoint_engine_worker.py ADDED Viewed

	@@ -0,0 +1,143 @@

+# Copyright 2023-2024 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""
+Checkpoint-engine integration for SGLang.
+This module provides weight update functionality via IPC for checkpoint-engine compatibility.
+"""
+import logging
+from typing import Callable, Dict, Optional
+import torch
+import zmq
+try:
+    from checkpoint_engine.worker import update_weights_from_ipc
+except ImportError:
+    raise ImportError(
+        "checkpoint-engine is not installed. "
+        "Please install it with: pip install sglang[checkpoint-engine]"
+    )
+logger = logging.getLogger(__name__)
+class SGLangCheckpointEngineWorkerExtension:
+    """
+    Worker extension for SGLang to support checkpoint-engine IPC weight updates.
+    This class provides the interface needed for checkpoint-engine integration.
+    """
+    def __init__(self):
+        self._zmq_ctx: Optional[zmq.Context] = None
+    def get_device_uuid(self) -> str:
+        """Get the UUID of current device."""
+        # We need to implement this to get the device UUID
+        # This will be overridden when integrated into SGLang's worker
+        raise NotImplementedError(
+            "This method should be overridden by SGLang integration"
+        )
+    def get_device_id(self) -> int:
+        """Get the device ID."""
+        raise NotImplementedError(
+            "This method should be overridden by SGLang integration"
+        )
+    def get_model_loader(self) -> Callable:
+        """Get the model weight loader function."""
+        raise NotImplementedError(
+            "This method should be overridden by SGLang integration"
+        )
+    def get_post_hook(self) -> Optional[Callable]:
+        """Get the post-processing hook after weight loading."""
+        return None
+    def update_weights_from_ipc(self, zmq_handles: Dict[str, str]):
+        """
+        Update weights from IPC communication.
+        Args:
+            zmq_handles: Dict mapping device UUID to ZMQ socket path
+        """
+        if self._zmq_ctx is None:
+            self._zmq_ctx = zmq.Context()
+        device_uuid = self.get_device_uuid()
+        device_id = self.get_device_id()
+        if device_uuid not in zmq_handles:
+            raise ValueError(
+                f"Device UUID {device_uuid} not found in zmq_handles: {list(zmq_handles.keys())}"
+            )
+        update_weights_from_ipc(
+            self._zmq_ctx,
+            zmq_handles[device_uuid],
+            device_id=device_id,
+            run=self.get_model_loader(),
+            post_hook=self.get_post_hook(),
+        )
+class SGLangCheckpointEngineWorkerExtensionImpl(SGLangCheckpointEngineWorkerExtension):
+    """
+    Implementation of SGLangCheckpointEngineWorkerExtension that integrates with SGLang's model runner.
+    This class provides the concrete implementation for checkpoint-engine IPC weight updates.
+    """
+    def __init__(self, model_runner):
+        super().__init__()
+        self.model_runner = model_runner
+    def get_device_uuid(self) -> str:
+        """Get the UUID of current device."""
+        # Get device UUID for current device
+        device_id = torch.cuda.current_device()
+        try:
+            return f"GPU-{torch.cuda.get_device_properties(device_id).uuid!s}"
+        except AssertionError as e:
+            raise ValueError(f"Failed to get GPU UUID for device {device_id}") from e
+    def get_device_id(self) -> int:
+        """Get the device ID."""
+        return torch.cuda.current_device()
+    def get_model_loader(self) -> Callable:
+        """Get the model weight loader function."""
+        return self.model_runner.model.load_weights
+    def get_post_hook(self) -> Optional[Callable]:
+        """Get the post-processing hook after weight loading."""
+        def post_hook():
+            # Perform post-processing after weight loading similar to DefaultModelLoader
+            try:
+                from sglang.srt.model_loader.loader import device_loading_context
+                # Process quantization methods after loading weights
+                for _, module in self.model_runner.model.named_modules():
+                    quant_method = getattr(module, "quant_method", None)
+                    if quant_method is not None:
+                        # Move parameters to device if needed for quantization processing
+                        target_device = torch.device(
+                            "cuda", torch.cuda.current_device()
+                        )
+                        with device_loading_context(module, target_device):
+                            quant_method.process_weights_after_loading(module)
+                # Call model-specific post-loading hook if available
+                if hasattr(self.model_runner.model, "post_load_weights"):
+                    self.model_runner.model.post_load_weights()
+            except Exception as e:
+                logger.warning(f"Post-hook processing failed: {e}")
+        return post_hook

sglang/python/sglang/srt/checkpoint_engine/update.py ADDED Viewed

	@@ -0,0 +1,317 @@

+"""
+Usage:
+1) Launch the server with wait-for-initial-weights option in one terminal:
+   python -m sglang.launch_server --model-path /workspace/Qwen/Qwen3-4B/ --tensor-parallel-size 2 --port 19730 --load-format dummy --checkpoint-engine-wait-weights-before-ready --mem-fraction-static 0.7
+2) Torchrun this script in another terminal:
+    torchrun --nproc-per-node 2 update.py --update-method broadcast --checkpoint-path /workspace/Qwen/Qwen3-4B/  --inference-parallel-size 2
+Or use the integrated entry point:
+    python -m sglang.srt.checkpoint_engine.update --update-method broadcast --checkpoint-path /workspace/Qwen/Qwen3-4B/  --inference-parallel-size 2
+"""
+import argparse
+import json
+import os
+import pickle
+import subprocess
+import sys
+import time
+from collections import defaultdict
+from collections.abc import Callable
+from contextlib import contextmanager
+from typing import Literal
+import httpx
+import torch
+import torch.distributed as dist
+from safetensors import safe_open
+try:
+    from checkpoint_engine.ps import ParameterServer
+    from loguru import logger
+except ImportError:
+    # Fallback for when checkpoint_engine is not available
+    ParameterServer = None
+    import logging
+    logger = logging.getLogger(__name__)
+@contextmanager
+def timer(msg: str):
+    start = time.perf_counter()
+    yield
+    end = time.perf_counter()
+    logger.info(f"{msg} duration: {end - start:.2f} seconds")
+def check_sglang_ready(
+    endpoint: str, inference_parallel_size: int, uds: str | None = None
+):
+    rank = int(os.getenv("RANK", 0))
+    if rank != rank // inference_parallel_size * inference_parallel_size:
+        return
+    retry_num = 0
+    transport = None
+    if uds is not None:
+        transport = httpx.HTTPTransport(uds=uds)
+    with httpx.Client(transport=transport) as client:
+        while True:
+            try:
+                response = client.get(f"{endpoint}/ping", timeout=10)
+                response.raise_for_status()
+                break
+            except (httpx.ConnectError, httpx.HTTPStatusError) as e:
+                if retry_num % 10 == 0:
+                    logger.warning(
+                        f"fail to check sglang ready, retry {retry_num} times, error: {e}"
+                    )
+                retry_num += 1
+                time.sleep(0.1)
+def split_checkpoint_files(
+    checkpoint_path: str, rank: int, world_size: int
+) -> list[str]:
+    checkpoint_files = [
+        os.path.join(checkpoint_path, f)
+        for f in filter(
+            lambda x: x.endswith(".safetensors"), os.listdir(checkpoint_path)
+        )
+    ]
+    files_per_rank = (len(checkpoint_files) + world_size - 1) // world_size
+    return checkpoint_files[rank * files_per_rank : (rank + 1) * files_per_rank]
+def split_tensors(
+    checkpoint_path: str, rank: int, world_size: int
+) -> dict[str, torch.Tensor]:
+    index_fn = os.path.join(checkpoint_path, "model.safetensors.index.json")
+    with open(index_fn) as f:
+        weight_map: dict[str, str] = json.load(f)["weight_map"]
+    weights_per_rank = (len(weight_map) + world_size - 1) // world_size
+    fn_tensors: dict[str, list[str]] = defaultdict(list)
+    weight_keys = list(weight_map.items())
+    for name, file in weight_keys[
+        rank * weights_per_rank : (rank + 1) * weights_per_rank
+    ]:
+        fn_tensors[file].append(name)
+    named_tensors = {}
+    for file, names in fn_tensors.items():
+        with safe_open(os.path.join(checkpoint_path, file), framework="pt") as f:
+            for name in names:
+                named_tensors[name] = f.get_tensor(name)
+    return named_tensors
+def req_inference(
+    endpoint: str,
+    inference_parallel_size: int,
+    timeout: float = 300.0,
+    uds: str | None = None,
+    weight_version: str | None = None,
+) -> Callable[[list[tuple[str, str]]], None]:
+    rank = int(os.getenv("RANK", 0))
+    src = rank // inference_parallel_size * inference_parallel_size
+    def req_func(socket_paths: list[tuple[str, str]]):
+        if rank == src:
+            with httpx.Client(transport=httpx.HTTPTransport(uds=uds)) as client:
+                resp = client.post(
+                    f"{endpoint}/update_weights_from_ipc",
+                    json={
+                        "zmq_handles": dict(
+                            socket_paths[src : src + inference_parallel_size]
+                        ),
+                        "flush_cache": True,
+                        "weight_version": weight_version,
+                    },
+                    timeout=timeout,
+                )
+                resp.raise_for_status()
+    return req_func
+def update_weights(
+    ps,
+    checkpoint_name: str,
+    checkpoint_files: list[str],
+    named_tensors: dict[str, torch.Tensor],
+    req_func: Callable[[list[tuple[str, str]]], None],
+    inference_parallel_size: int,
+    endpoint: str,
+    save_metas_file: str | None = None,
+    update_method: Literal["broadcast", "p2p", "all"] = "broadcast",
+    uds: str | None = None,
+):
+    ps.register_checkpoint(
+        checkpoint_name, files=checkpoint_files, named_tensors=named_tensors
+    )
+    ps.init_process_group()
+    check_sglang_ready(endpoint, inference_parallel_size, uds)
+    dist.barrier()
+    with timer("Gather metas"):
+        ps.gather_metas(checkpoint_name)
+    if save_metas_file and int(os.getenv("RANK")) == 0:
+        with open(save_metas_file, "wb") as f:
+            pickle.dump(ps.get_metas(), f)
+    if update_method == "broadcast" or update_method == "all":
+        with timer("Update weights without setting ranks"):
+            ps.update(checkpoint_name, req_func)
+    if update_method == "p2p" or update_method == "all":
+        if update_method:
+            # sleep 2s to wait destroy process group
+            time.sleep(2)
+        with timer("Update weights with setting ranks"):
+            ps.update(
+                checkpoint_name, req_func, ranks=list(range(inference_parallel_size))
+            )
+def join(
+    ps: ParameterServer,
+    checkpoint_name: str,
+    load_metas_file: str,
+    req_func: Callable[[list[tuple[str, str]]], None],
+    inference_parallel_size: int,
+    endpoint: str,
+    uds: str | None = None,
+):
+    assert load_metas_file, "load_metas_file is required"
+    with open(load_metas_file, "rb") as f:
+        metas = pickle.load(f)
+    ps.init_process_group()
+    check_sglang_ready(endpoint, inference_parallel_size, uds)
+    dist.barrier()
+    with timer("Gather metas before join"):
+        ps.gather_metas(checkpoint_name)
+    ps.load_metas(metas)
+    with timer(
+        f"Update weights with setting ranks as range(0, {inference_parallel_size}) by using p2p"
+    ):
+        ps.update(checkpoint_name, req_func, ranks=list(range(inference_parallel_size)))
+def run_with_torchrun():
+    """Run the update script with torchrun automatically."""
+    # Parse inference_parallel_size from command line arguments to determine nproc-per-node
+    inference_parallel_size = 8  # default
+    args = sys.argv[1:]  # Skip the script name
+    # Look for --inference-parallel-size in arguments
+    for i, arg in enumerate(args):
+        if arg == "--inference-parallel-size" and i + 1 < len(args):
+            try:
+                inference_parallel_size = int(args[i + 1])
+            except ValueError:
+                pass
+            break
+        elif arg.startswith("--inference-parallel-size="):
+            try:
+                inference_parallel_size = int(arg.split("=", 1)[1])
+            except ValueError:
+                pass
+            break
+    # Build torchrun command
+    cmd = ["torchrun", f"--nproc-per-node={inference_parallel_size}", __file__] + args
+    print(f"Running: {' '.join(cmd)}", file=sys.stderr)
+    # Execute torchrun with the original script
+    try:
+        result = subprocess.run(cmd, check=False)
+        sys.exit(result.returncode)
+    except FileNotFoundError:
+        print(
+            "Error: torchrun command not found. Please ensure PyTorch is installed.",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+    except KeyboardInterrupt:
+        print("\nInterrupted by user", file=sys.stderr)
+        sys.exit(130)
+def main():
+    # Check if we're running under torchrun or need to invoke it
+    if os.getenv("RANK") is None:
+        # Not running under torchrun, so invoke it
+        run_with_torchrun()
+        return
+    # Running under torchrun, proceed with normal execution
+    parser = argparse.ArgumentParser(description="Update weights example")
+    parser.add_argument("--checkpoint-path", type=str, default=None)
+    parser.add_argument("--save-metas-file", type=str, default=None)
+    parser.add_argument("--load-metas-file", type=str, default=None)
+    parser.add_argument("--sleep-time", type=int, default=0)
+    parser.add_argument("--endpoint", type=str, default="http://localhost:19730")
+    parser.add_argument("--inference-parallel-size", type=int, default=8)
+    parser.add_argument("--checkpoint-name", type=str, default="my-checkpoint-iter-0")
+    parser.add_argument("--update-method", type=str, default="broadcast")
+    parser.add_argument("--uds", type=str, default=None)
+    parser.add_argument("--weight-version", type=str, default=None)
+    args = parser.parse_args()
+    # Get rank and world_size from environment (set by torchrun)
+    rank = int(os.getenv("RANK", 0))
+    world_size = int(os.getenv("WORLD_SIZE", 1))
+    req_func = req_inference(
+        args.endpoint,
+        args.inference_parallel_size,
+        uds=args.uds,
+        weight_version=args.weight_version,
+    )
+    if ParameterServer is None:
+        print("Error: checkpoint_engine package not available", file=sys.stderr)
+        sys.exit(1)
+    ps = ParameterServer(auto_pg=True)
+    ps._p2p_store = None
+    if args.load_metas_file:
+        join(
+            ps,
+            args.checkpoint_name,
+            args.load_metas_file,
+            req_func,
+            args.inference_parallel_size,
+            args.endpoint,
+            args.uds,
+        )
+    else:
+        if args.checkpoint_path and os.path.exists(
+            os.path.join(args.checkpoint_path, "model.safetensors.index.json")
+        ):
+            named_tensors = split_tensors(args.checkpoint_path, rank, world_size)
+            checkpoint_files = []
+        else:
+            checkpoint_files = (
+                split_checkpoint_files(args.checkpoint_path, rank, world_size)
+                if args.checkpoint_path
+                else []
+            )
+            named_tensors = {}
+        update_weights(
+            ps,
+            args.checkpoint_name,
+            checkpoint_files,
+            named_tensors,
+            req_func,
+            args.inference_parallel_size,
+            args.endpoint,
+            args.save_metas_file,
+            args.update_method,
+            args.uds,
+        )
+    time.sleep(args.sleep_time)
+if __name__ == "__main__":
+    main()

sglang/python/sglang/srt/compilation/__pycache__/compilation_config.cpython-311.pyc ADDED Viewed

Binary file (2.79 kB). View file

sglang/python/sglang/srt/compilation/__pycache__/compile.cpython-311.pyc ADDED Viewed

Binary file (12 kB). View file

sglang/python/sglang/srt/compilation/__pycache__/piecewise_context_manager.cpython-311.pyc ADDED Viewed

Binary file (5.35 kB). View file

sglang/python/sglang/srt/compilation/backend.py ADDED Viewed

	@@ -0,0 +1,472 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/backend.py
+import ast
+import dataclasses
+import logging
+import os
+import pprint
+import time
+from collections.abc import Sequence
+from contextlib import contextmanager
+from typing import Any, Callable, Optional
+import torch
+import torch.fx as fx
+from torch._dispatch.python import enable_python_dispatcher
+from sglang.srt.compilation.compilation_config import CompilationConfig
+from sglang.srt.compilation.compilation_counter import compilation_counter
+from sglang.srt.compilation.compiler_interface import EagerAdapter, InductorAdaptor
+from sglang.srt.compilation.cuda_piecewise_backend import CUDAPiecewiseBackend
+from sglang.srt.compilation.npu_piecewise_backend import NPUPiecewiseBackend
+from sglang.srt.compilation.pass_manager import PostGradPassManager
+from sglang.srt.utils.common import is_npu, rank0_log
+logger = logging.getLogger(__name__)
+def make_compiler(config: CompilationConfig):
+    if config.compiler == "eager":
+        return EagerAdapter()
+    elif config.compiler == "inductor":
+        return InductorAdaptor()
+    else:
+        raise ValueError(f"Unknown compiler: {config.compiler}")
+def make_backend(
+    graph: fx.GraphModule,
+    compile_config: CompilationConfig,
+    inductor_config: dict[str, Any],
+    graph_pool: Any,
+    piecewise_compile_index: int,
+    total_piecewise_compiles: int,
+    sym_shape_indices: list[int],
+    compiled_graph_for_general_shape: Callable,
+    sglang_backend,
+):
+    backend_cls = CUDAPiecewiseBackend if not is_npu() else NPUPiecewiseBackend
+    return backend_cls(
+        graph,
+        compile_config,
+        inductor_config,
+        graph_pool,
+        piecewise_compile_index,
+        total_piecewise_compiles,
+        sym_shape_indices,
+        compiled_graph_for_general_shape,
+        sglang_backend,
+    )
+class CompilerManager:
+    def __init__(
+        self,
+        config: CompilationConfig,
+    ):
+        self.cache = dict()
+        self.is_cache_updated = False
+        self.compiler = make_compiler(config)
+    def compute_hash(self):
+        return self.compiler.compute_hash()
+    def initialize_cache(
+        self, cache_dir: str, disable_cache: bool = False, prefix: str = ""
+    ):
+        self.disable_cache = disable_cache
+        self.cache_dir = cache_dir
+        self.cache_file_path = os.path.join(cache_dir, "sglang_compile_cache.py")
+        if not disable_cache and os.path.exists(self.cache_file_path):
+            with open(self.cache_file_path) as f:
+                self.cache = ast.literal_eval(f.read())
+        self.compiler.initialize_cache(
+            cache_dir=cache_dir, disable_cache=disable_cache, prefix=prefix
+        )
+    def save_to_file(self):
+        if self.disable_cache or not self.is_cache_updated:
+            return
+        printer = pprint.PrettyPrinter(indent=4)
+        data = printer.pformat(self.cache)
+        with open(self.cache_file_path, "w") as f:
+            f.write(data)
+    def load(
+        self,
+        graph: fx.GraphModule,
+        example_inputs: list[Any],
+        graph_index: int,
+        runtime_shape: Optional[int] = None,
+    ) -> Optional[Callable]:
+        handle = self.cache[(runtime_shape, graph_index, self.compiler.name)]
+        compiled_graph = self.compiler.load(
+            handle, graph, example_inputs, graph_index, runtime_shape
+        )
+        if runtime_shape is None:
+            logger.debug(
+                "Directly load the %s-th graph for dynamic shape from %s via "
+                "handle %s",
+                graph_index,
+                self.compiler.name,
+                handle,
+            )
+        else:
+            logger.debug(
+                "Directly load the %s-th graph for shape %s from %s via " "handle %s",
+                graph_index,
+                str(runtime_shape),
+                self.compiler.name,
+                handle,
+            )
+        return compiled_graph
+    def compile(
+        self,
+        graph: fx.GraphModule,
+        example_inputs,
+        inductor_config: dict[str, Any],
+        graph_index: int = 0,
+        num_graphs: int = 1,
+        runtime_shape: Optional[int] = None,
+    ) -> Any:
+        if graph_index == 0:
+            # before compiling the first graph, record the start time
+            global compilation_start_time
+            compilation_start_time = time.time()
+        compilation_counter.num_backend_compilations += 1
+        compiled_graph = None
+        # TODO(Yuwei): support cache loading
+        # no compiler cached the graph, or the cache is disabled,
+        # we need to compile it
+        if isinstance(self.compiler, InductorAdaptor):
+            maybe_key = None
+        else:
+            maybe_key = f"artifact_shape_{runtime_shape}_subgraph_{graph_index}"
+        compiled_graph, handle = self.compiler.compile(
+            graph, example_inputs, inductor_config, runtime_shape, maybe_key
+        )
+        assert compiled_graph is not None, "Failed to compile the graph"
+        # store the artifact in the cache
+        if handle is not None:
+            self.cache[(runtime_shape, graph_index, self.compiler.name)] = handle
+            compilation_counter.num_cache_entries_updated += 1
+            self.is_cache_updated = True
+            if graph_index == 0:
+                # adds some info logging for the first graph
+                if runtime_shape is None:
+                    logger.info("Cache the graph for dynamic shape for later use")
+                else:
+                    logger.info(
+                        "Cache the graph of shape %s for later use", str(runtime_shape)
+                    )
+            if runtime_shape is None:
+                logger.debug(
+                    "Store the %s-th graph for dynamic shape from %s via " "handle %s",
+                    graph_index,
+                    self.compiler.name,
+                    handle,
+                )
+            else:
+                logger.debug(
+                    "Store the %s-th graph for shape %s from %s via handle %s",
+                    graph_index,
+                    str(runtime_shape),
+                    self.compiler.name,
+                    handle,
+                )
+        # after compiling the last graph, record the end time
+        if graph_index == num_graphs - 1:
+            now = time.time()
+            elapsed = now - compilation_start_time
+            if runtime_shape is None:
+                logger.info("Compiling a graph for dynamic shape takes %.2f s", elapsed)
+            else:
+                logger.info(
+                    "Compiling a graph for shape %s takes %.2f s",
+                    runtime_shape,
+                    elapsed,
+                )
+        return compiled_graph
+@dataclasses.dataclass
+class SplitItem:
+    submod_name: str
+    graph_id: int
+    is_splitting_graph: bool
+    graph: fx.GraphModule
+def split_graph(
+    graph: fx.GraphModule, ops: list[str]
+) -> tuple[fx.GraphModule, list[SplitItem]]:
+    # split graph by ops
+    subgraph_id = 0
+    node_to_subgraph_id = {}
+    split_op_graphs = []
+    for node in graph.graph.nodes:
+        if node.op in ("output", "placeholder"):
+            continue
+        if node.op == "call_function" and str(node.target) in ops:
+            subgraph_id += 1
+            node_to_subgraph_id[node] = subgraph_id
+            split_op_graphs.append(subgraph_id)
+            subgraph_id += 1
+        else:
+            node_to_subgraph_id[node] = subgraph_id
+    # `keep_original_order` is important!
+    # otherwise pytorch might reorder the nodes and
+    # the semantics of the graph will change when we
+    # have mutations in the graph
+    split_gm = torch.fx.passes.split_module.split_module(
+        graph, None, lambda node: node_to_subgraph_id[node], keep_original_order=True
+    )
+    outputs = []
+    names = [name for (name, module) in split_gm.named_modules()]
+    for name in names:
+        if "." in name or name == "":
+            # recursive child module or the root module
+            continue
+        module = getattr(split_gm, name)
+        graph_id = int(name.replace("submod_", ""))
+        outputs.append(SplitItem(name, graph_id, (graph_id in split_op_graphs), module))
+    # sort by intetger graph_id, rather than string name
+    outputs.sort(key=lambda x: x.graph_id)
+    return split_gm, outputs
+# we share the global graph pool among all the backends
+global_graph_pool = None
+compilation_start_time = 0.0
+class PiecewiseCompileInterpreter(torch.fx.Interpreter):
+    def __init__(
+        self,
+        module: torch.fx.GraphModule,
+        compile_submod_names: list[str],
+        inductor_config: dict[str, Any],
+        graph_pool,
+        compile_config: CompilationConfig,
+        sglang_backend: "SGLangBackend",
+    ):
+        super().__init__(module)
+        from torch._guards import detect_fake_mode
+        self.fake_mode = detect_fake_mode()
+        self.compile_submod_names = compile_submod_names
+        self.graph_pool = graph_pool
+        self.sglang_backend = sglang_backend
+        # When True, it annoyingly dumps the torch.fx.Graph on errors.
+        self.extra_traceback = False
+        self.inductor_config = inductor_config
+        self.compile_config = compile_config
+    def run(self, *args):
+        fake_args = [
+            self.fake_mode.from_tensor(t) if isinstance(t, torch.Tensor) else t
+            for t in args
+        ]
+        with self.fake_mode, enable_python_dispatcher():
+            return super().run(*fake_args)
+    def call_module(
+        self,
+        target: torch.fx.node.Target,
+        args: tuple[torch.fx.node.Argument, ...],
+        kwargs: dict[str, Any],
+    ) -> Any:
+        assert isinstance(target, str)
+        output = super().call_module(target, args, kwargs)
+        if target in self.compile_submod_names:
+            index = self.compile_submod_names.index(target)
+            submod = self.fetch_attr(target)
+            sym_shape_indices = [
+                i for i, x in enumerate(args) if isinstance(x, torch.SymInt)
+            ]
+            global compilation_start_time
+            compiled_graph_for_dynamic_shape = (
+                self.sglang_backend.compiler_manager.compile(
+                    submod,
+                    args,
+                    self.inductor_config,
+                    graph_index=index,
+                    num_graphs=len(self.compile_submod_names),
+                    runtime_shape=None,
+                )
+            )
+            self.module.__dict__[target] = make_backend(
+                submod,
+                self.compile_config,
+                self.inductor_config,
+                self.graph_pool,
+                index,
+                len(self.compile_submod_names),
+                sym_shape_indices,
+                compiled_graph_for_dynamic_shape,
+                self.sglang_backend,
+            )
+            compilation_counter.num_piecewise_capturable_graphs_seen += 1
+        return output
+model_tag: str = "backbone"
+@contextmanager
+def set_model_tag(tag: str):
+    """Context manager to set the model tag."""
+    global model_tag
+    assert (
+        tag != model_tag
+    ), f"Model tag {tag} is the same as the current tag {model_tag}."
+    old_tag = model_tag
+    model_tag = tag
+    try:
+        yield
+    finally:
+        model_tag = old_tag
+class SGLangBackend:
+    graph_pool: Any
+    _called: bool = False
+    # the graph we compiled
+    graph: fx.GraphModule
+    # the stiching graph module for all the piecewise graphs
+    split_gm: fx.GraphModule
+    piecewise_graphs: list[SplitItem]
+    returned_callable: Callable
+    # Inductor passes to run on the graph pre-defunctionalization
+    post_grad_passes: Sequence[Callable]
+    sym_tensor_indices: list[int]
+    input_buffers: list[torch.Tensor]
+    compiler_manager: CompilerManager
+    def __init__(
+        self,
+        config: CompilationConfig,
+        graph_pool: Any,
+    ):
+        rank0_log(f"Initializing SGLangBackend")
+        assert graph_pool is not None
+        self.graph_pool = graph_pool
+        self.post_grad_pass_manager = PostGradPassManager()
+        self.sym_tensor_indices = []
+        self.input_buffers = []
+        self.compiler_manager = CompilerManager(config)
+        self.inductor_config = {
+            "enable_auto_functionalized_v2": False,
+        }
+        self.compile_config = config
+    def configure_post_pass(self):
+        self.post_grad_pass_manager.configure()
+        self.inductor_config["post_grad_custom_post_pass"] = self.post_grad_pass_manager
+    def __call__(self, graph: fx.GraphModule, example_inputs) -> Callable:
+        rank0_log(f"SGLangBackend __call__")
+        base_cache_dir = os.path.expanduser(
+            os.getenv("SGLANG_CACHE_DIR", "~/.cache/sglang/")
+        )
+        cache_hash = self.compiler_manager.compute_hash()
+        cache_dir = os.path.join(
+            base_cache_dir,
+            "torch_compile_cache",
+            cache_hash,
+        )
+        os.makedirs(cache_dir, exist_ok=True)
+        rank = 0
+        dp_rank = 0
+        local_cache_dir = os.path.join(cache_dir, f"rank_{rank}_{dp_rank}", model_tag)
+        os.makedirs(local_cache_dir, exist_ok=True)
+        self.compiler_manager.initialize_cache(
+            local_cache_dir, disable_cache=False, prefix=""
+        )
+        compilation_counter.num_graphs_seen += 1
+        assert not self._called, "SGLangBackend can only be called once"
+        self.graph = graph
+        self.configure_post_pass()
+        self.split_gm, self.piecewise_graphs = split_graph(
+            graph,
+            self.compile_config.split_ops,
+        )
+        from torch._dynamo.utils import lazy_format_graph_code
+        # depyf will hook lazy_format_graph_code and dump the graph
+        # for debugging, no need to print the graph here
+        lazy_format_graph_code("before split", self.graph)
+        lazy_format_graph_code("after split", self.split_gm)
+        compilation_counter.num_piecewise_graphs_seen += len(self.piecewise_graphs)
+        submod_names_to_compile = [
+            item.submod_name
+            for item in self.piecewise_graphs
+            if not item.is_splitting_graph
+        ]
+        PiecewiseCompileInterpreter(
+            self.split_gm,
+            submod_names_to_compile,
+            self.inductor_config,
+            self.graph_pool,
+            self.compile_config,
+            self,
+        ).run(*example_inputs)
+        rank = torch.distributed.get_rank()
+        if rank == 0:
+            graph_path = os.path.join(
+                local_cache_dir, f"computation_graph_{time.time()}.py"
+            )
+            if not os.path.exists(graph_path):
+                # code adapted from https://github.com/thuml/depyf/blob/dab831108a752d1facc00acdd6d4243891845c37/depyf/explain/patched_lazy_format_graph_code.py#L30 # noqa
+                # use `print_readable` because it can include submodules
+                src = (
+                    "from __future__ import annotations\nimport torch\n"
+                    + self.split_gm.print_readable(print_output=False)
+                )
+                src = src.replace("<lambda>", "GraphModule")
+                with open(graph_path, "w") as f:
+                    f.write(src)
+                rank0_log(f"Computation graph saved to {graph_path}")
+        self._called = True
+        return self.split_gm

sglang/python/sglang/srt/compilation/compilation_config.py ADDED Viewed

	@@ -0,0 +1,45 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/compilation_config.py
+from typing import Callable, List, Optional
+SPLIT_OPS = []
+def register_split_op(op_name: Optional[str] = None):
+    def decorator(op_func: Callable):
+        name = op_name or op_func.__name__
+        SPLIT_OPS.append(f"sglang.{name}")
+        return op_func
+    return decorator
+# TODO(Yuwei): support better compile config support
+class CompilationConfig:
+    def __init__(
+        self,
+        capture_sizes: List[int],
+        compiler: str = "eager",
+        enable_debug_mode: bool = False,
+    ):
+        self.traced_files = set()
+        self.capture_sizes = capture_sizes
+        self.compiler = compiler
+        self.enable_debug_mode = enable_debug_mode
+        self.split_ops = []
+        self.split_ops.extend(SPLIT_OPS)
+    def add_split_op(self, op: str):
+        self.split_ops.append(op)
+    def add_traced_file(self, file_path: str):
+        self.traced_files.add(file_path)
+    def get_traced_files(self):
+        return self.traced_files
+    def get_capture_sizes(self):
+        return self.capture_sizes
+    def get_enable_debug_mode(self):
+        return self.enable_debug_mode

sglang/python/sglang/srt/compilation/compilation_counter.py ADDED Viewed

	@@ -0,0 +1,47 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/compilation_counter.py
+import copy
+import dataclasses
+from contextlib import contextmanager
+@dataclasses.dataclass
+class CompilationCounter:
+    num_models_seen: int = 0
+    num_graphs_seen: int = 0
+    # including the splitting ops
+    num_piecewise_graphs_seen: int = 0
+    # not including the splitting ops
+    num_piecewise_capturable_graphs_seen: int = 0
+    num_backend_compilations: int = 0
+    # Number of gpu_model_runner attempts to trigger CUDAGraphs capture
+    num_gpu_runner_capture_triggers: int = 0
+    # Number of CUDAGraphs captured
+    num_cudagraph_captured: int = 0
+    # InductorAdapter.compile calls
+    num_inductor_compiles: int = 0
+    # EagerAdapter.compile calls
+    num_eager_compiles: int = 0
+    # The number of time vLLM's compiler cache entry was updated
+    num_cache_entries_updated: int = 0
+    # The number of standalone_compile compiled artifacts saved
+    num_compiled_artifacts_saved: int = 0
+    # Number of times a model was loaded with CompilationLevel.DYNAMO_AS_IS
+    dynamo_as_is_count: int = 0
+    def clone(self) -> "CompilationCounter":
+        return copy.deepcopy(self)
+    @contextmanager
+    def expect(self, **kwargs):
+        old = self.clone()
+        yield
+        for k, v in kwargs.items():
+            assert getattr(self, k) - getattr(old, k) == v, (
+                f"{k} not as expected, before it is {getattr(old, k)}"
+                f", after it is {getattr(self, k)}, "
+                f"expected diff is {v}"
+            )
+compilation_counter = CompilationCounter()

sglang/python/sglang/srt/compilation/compile.py ADDED Viewed

	@@ -0,0 +1,203 @@

+import inspect
+import logging
+import os
+import sys
+import types
+from dataclasses import dataclass
+from typing import Any, Callable, Optional, Union
+import torch
+from sglang.srt.compilation.compilation_config import CompilationConfig
+from sglang.srt.compilation.piecewise_context_manager import is_in_piecewise_cuda_graph
+from sglang.srt.utils.common import rank0_log
+logger = logging.getLogger(__name__)
+@dataclass
+class IntermediateTensors:
+    """For all pipeline stages except the last, we need to return the hidden
+    states and residuals to be sent to the next stage. This data structure
+    contains the hidden states and residuals for a request.
+    Each stage also needs to handle its own finished_sending and
+    finished_recving in case of kv transfer.
+    """
+    tensors: dict[str, torch.Tensor]
+    # [req_ids]
+    finished_sending: Optional[set[str]] = None
+    finished_recving: Optional[set[str]] = None
+    def __init__(self, tensors):
+        # manually define this function, so that
+        # Dynamo knows `IntermediateTensors()` comes from this file.
+        # Otherwise, dataclass will generate this function by evaluating
+        # a string, and we will lose the information about the source file.
+        self.tensors = tensors
+    def __getitem__(self, key: Union[str, slice]):
+        if isinstance(key, str):
+            return self.tensors[key]
+        elif isinstance(key, slice):
+            return self.__class__({k: v[key] for k, v in self.tensors.items()})
+    def __setitem__(self, key: str, value: torch.Tensor):
+        self.tensors[key] = value
+    def items(self):
+        return self.tensors.items()
+    def __len__(self):
+        return len(self.tensors)
+    def __eq__(self, other: object):
+        return isinstance(other, self.__class__) and self
+    def __repr__(self) -> str:
+        return f"IntermediateTensors(tensors={self.tensors})"
+def _normalize_dims(dims, ndim: int):
+    dims = [dims] if isinstance(dims, int) else list(dims)
+    return [d if d >= 0 else ndim + d for d in dims]
+class _MaybeIntermediateTensors:
+    """Duck-typed check to support your IntermediateTensors without importing."""
+    def __init__(self, obj):
+        self.is_intermediate = hasattr(obj, "tensors") and isinstance(
+            getattr(obj, "tensors"), dict
+        )
+        self.obj = obj
+def _mark_dynamic_on_value(val, dims):
+    if isinstance(val, torch.Tensor):
+        torch._dynamo.maybe_mark_dynamic(val, _normalize_dims(dims, val.ndim))
+    else:
+        mit = _MaybeIntermediateTensors(val)
+        if mit.is_intermediate:
+            for t in mit.obj.tensors.values():
+                torch._dynamo.maybe_mark_dynamic(t, _normalize_dims(dims, t.ndim))
+        # else: ignore (None or non-tensor)
+def _infer_dynamic_arg_dims_from_annotations(forward_fn):
+    sig = inspect.signature(forward_fn)
+    dyn = {}
+    for name, p in sig.parameters.items():
+        ann = p.annotation
+        # Accept torch.Tensor / Optional[torch.Tensor] / your IntermediateTensors types by name
+        if (
+            ann is torch.Tensor
+            or getattr(getattr(ann, "__args__", [None])[0], "__name__", "") == "Tensor"
+        ):
+            dyn[name] = 0
+        elif getattr(ann, "__name__", "") in ("IntermediateTensors",) or any(
+            getattr(a, "__name__", "") == "IntermediateTensors"
+            for a in getattr(ann, "__args__", [])
+        ):
+            dyn[name] = 0
+        elif ann == "torch.Tensor" or ann == "Optional[torch.Tensor]":
+            # For future import annotations (e.g. from __future__ import annotations), the annotation is a string
+            dyn[name] = 0
+    if not dyn:
+        raise ValueError("No dynamic dims inferred; pass dynamic_arg_dims explicitly.")
+    return dyn
+def install_torch_compiled(
+    module: torch.nn.Module,
+    *,
+    dynamic_arg_dims: dict[str, Union[int, list[int]]] | None = None,
+    backend_factory: Optional[Callable[[torch.fx.GraphModule, list], Callable]] = None,
+    compile_config: CompilationConfig = None,
+    fullgraph: bool = True,
+    graph_pool: Any = None,
+):
+    rank0_log(f"install_torch_compiled")
+    unbound_fwd = module.__class__.forward
+    if not callable(unbound_fwd):
+        raise TypeError("module.__class__.forward must be callable")
+    original_code = unbound_fwd.__code__
+    dyn_map = dynamic_arg_dims or _infer_dynamic_arg_dims_from_annotations(unbound_fwd)
+    if backend_factory is None:
+        from sglang.srt.compilation.backend import SGLangBackend
+        backend_factory = lambda gm, ex: SGLangBackend(compile_config, graph_pool)(
+            gm, ex
+        )
+    compiled_codes: list[type(original_code)] = []
+    state = {"compiled": False, "compiled_callable": None}
+    def bytecode_hook(old_code, new_code):
+        if old_code is not original_code:
+            return
+        frame = sys._getframe()
+        while frame and frame.f_back:
+            frame = frame.f_back
+            if (
+                frame.f_code.co_name == "_compile"
+                and os.path.basename(frame.f_code.co_filename) == "convert_frame.py"
+            ):
+                break
+        try:
+            dynamo_frame = frame.f_locals["frame"]
+        except Exception:
+            return
+        if dynamo_frame.f_code is not old_code:
+            return
+        if dynamo_frame.f_locals.get("self") is not module:
+            return
+        compiled_codes.append(new_code)
+    torch._dynamo.convert_frame.register_bytecode_hook(bytecode_hook)
+    def _ensure_compiled(self, *args, **kwargs):
+        """Compile on first use (with flag ON)."""
+        if state["compiled"]:
+            return
+        # Mark dynamic dims only when we are about to compile
+        sig = inspect.signature(unbound_fwd)
+        ba = sig.bind(self, *args, **kwargs)
+        ba.apply_defaults()
+        for name, dims in (dyn_map or {}).items():
+            if name in ba.arguments:
+                val = ba.arguments[name]
+                if val is not None:
+                    _mark_dynamic_on_value(val, dims)
+        # Avoid cross-instance cache reuse
+        torch._dynamo.eval_frame.remove_from_cache(unbound_fwd.__code__)
+        bound = types.MethodType(unbound_fwd, self)
+        compiled_callable = torch.compile(
+            bound, fullgraph=fullgraph, backend=backend_factory
+        )
+        # Trigger Dynamo so bytecode hook can capture
+        compiled_callable(*args, **kwargs)
+        state["compiled"] = True
+        state["compiled_callable"] = compiled_callable
+    def trampoline(self, *args, **kwargs):
+        use_compiled = is_in_piecewise_cuda_graph()
+        if use_compiled:
+            if not state["compiled"]:
+                _ensure_compiled(self, *args, **kwargs)
+            compiled_callable = state["compiled_callable"]
+            return compiled_callable(*args, **kwargs)
+        else:
+            # Explicitly run the original uncompiled forward
+            return unbound_fwd(self, *args, **kwargs)
+    module.forward = types.MethodType(trampoline, module)
+    return module

sglang/python/sglang/srt/compilation/compiler_interface.py ADDED Viewed

	@@ -0,0 +1,504 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/compiler_interface.py
+import contextlib
+import copy
+import hashlib
+import os
+from contextlib import ExitStack
+from typing import Any, Callable, Optional
+from unittest.mock import patch
+import torch
+import torch._inductor.compile_fx
+import torch.fx as fx
+from sglang.srt.compilation.compilation_counter import compilation_counter
+from sglang.srt.compilation.inductor_pass import pass_context
+from sglang.srt.utils.common import torch_release
+class CompilerInterface:
+    """
+    The interface for a compiler that can be used by vLLM.
+    """
+    # The name of the compiler, e.g. inductor.
+    # This is a class-level attribute.
+    name: str
+    def initialize_cache(
+        self, cache_dir: str, disable_cache: bool = False, prefix: str = ""
+    ):
+        """
+        when the vLLM process uses `cache_dir` as the cache directory,
+        the compiler should initialize itself with the cache directory,
+        e.g. by re-directing its own cache directory to a sub-directory.
+        prefix can be used in combination with cache_dir to figure out the base
+        cache directory, e.g. there're multiple parts of model being compiled,
+        but we want to share the same cache directory for all of them.
+        e.g.
+        cache_dir = "/path/to/dir/backbone", prefix = "backbone"
+        cache_dir = "/path/to/dir/eagle_head", prefix = "eagle_head"
+        """
+        pass
+    def compute_hash(self) -> str:
+        """
+        Gather all the relevant information from the vLLM config,
+        to compute a hash so that we can cache the compiled model.
+        See [`VllmConfig.compute_hash`][vllm.config.VllmConfig.compute_hash]
+        to check what information
+        is already considered by default. This function should only
+        consider the information that is specific to the compiler.
+        """
+        return ""
+    def compile(
+        self,
+        graph: fx.GraphModule,
+        example_inputs: list[Any],
+        compiler_config: dict[str, Any],
+        runtime_shape: Optional[int] = None,
+        key: Optional[str] = None,
+    ) -> tuple[Optional[Callable], Optional[Any]]:
+        """
+        Compile the graph with the given example inputs and compiler config,
+        with a runtime shape. If the `runtime_shape` is None, it means
+        the `example_inputs` have a dynamic shape. Otherwise, the
+        `runtime_shape` specifies the shape of the inputs. Right now we only
+        support one variable shape for all inputs, which is the batchsize
+        (number of tokens) during inference.
+        Dynamo will make sure `graph(*example_inputs)` is valid.
+        The function should return a compiled callable function, as well as
+        a handle that can be used to directly load the compiled function.
+        The handle should be a plain Python object, preferably a string or a
+        file path for readability.
+        If the compiler doesn't support caching, it should return None for the
+        handle. If the compiler fails to compile the graph, it should return
+        None for the compiled function as well.
+        `key` is required for StandaloneInductorAdapter, it specifies where to
+        save the compiled artifact. The compiled artifact gets saved to
+        `cache_dir/key`.
+        """
+        return None, None
+    def load(
+        self,
+        handle: Any,
+        graph: fx.GraphModule,
+        example_inputs: list[Any],
+        graph_index: int,
+        runtime_shape: Optional[int] = None,
+    ) -> Callable:
+        """
+        Load the compiled function from the handle.
+        Raises an error if the handle is invalid.
+        The handle is the second return value of the `compile` function.
+        """
+        raise NotImplementedError("caching is not supported")
+def get_inductor_factors() -> list[Any]:
+    factors: list[Any] = []
+    # summarize system state
+    from torch._inductor.codecache import CacheBase
+    system_factors = CacheBase.get_system()
+    factors.append(system_factors)
+    # summarize pytorch state
+    from torch._inductor.codecache import torch_key
+    torch_factors = torch_key()
+    factors.append(torch_factors)
+    return factors
+class AlwaysHitShapeEnv:
+    """
+    Why do we need this class:
+    For normal `torch.compile` usage, every compilation will have
+    one Dynamo bytecode compilation and one Inductor compilation.
+    The Inductor compilation happens under the context of the
+    Dynamo bytecode compilation, and that context is used to
+    determine the dynamic shape information, etc.
+    For our use case, we only run Dynamo bytecode compilation once,
+    and run Inductor compilation multiple times with different shapes
+    plus a general shape. The compilation for specific shapes happens
+    outside of the context of the Dynamo bytecode compilation. At that
+    time, we don't have shape environment to provide to Inductor, and
+    it will fail the Inductor code cache lookup.
+    By providing a dummy shape environment that always hits, we can
+    make the Inductor code cache lookup always hit, and we can
+    compile the graph for different shapes as needed.
+    The following dummy methods are obtained by trial-and-error
+    until it works.
+    """
+    def __init__(self) -> None:
+        self.guards: list[Any] = []
+    def evaluate_guards_expression(self, *args, **kwargs):
+        return True
+    def get_pruned_guards(self, *args, **kwargs):
+        return []
+    def produce_guards_expression(self, *args, **kwargs):
+        return ""
+class InductorAdaptor(CompilerInterface):
+    """
+    The adaptor for the Inductor compiler, version 2.5, 2.6, 2.7.
+    """
+    name = "inductor"
+    def compute_hash(self) -> str:
+        factors = get_inductor_factors()
+        hash_str = hashlib.md5(
+            str(factors).encode(), usedforsecurity=False
+        ).hexdigest()[:10]
+        return hash_str
+    def initialize_cache(
+        self, cache_dir: str, disable_cache: bool = False, prefix: str = ""
+    ):
+        self.cache_dir = cache_dir
+        self.prefix = prefix
+        self.base_cache_dir = cache_dir[: -len(prefix)] if prefix else cache_dir
+        if disable_cache:
+            return
+        # redirect the cache directory to a sub-directory
+        # set flags so that Inductor and Triton store their cache
+        # in the cache_dir, then users only need to copy the cache_dir
+        # to another machine to reuse the cache.
+        inductor_cache = os.path.join(self.base_cache_dir, "inductor_cache")
+        os.makedirs(inductor_cache, exist_ok=True)
+        os.environ["TORCHINDUCTOR_CACHE_DIR"] = inductor_cache
+        triton_cache = os.path.join(self.base_cache_dir, "triton_cache")
+        os.makedirs(triton_cache, exist_ok=True)
+        os.environ["TRITON_CACHE_DIR"] = triton_cache
+    def compile(
+        self,
+        graph: fx.GraphModule,
+        example_inputs: list[Any],
+        compiler_config: dict[str, Any],
+        runtime_shape: Optional[int] = None,
+        key: Optional[str] = None,
+    ) -> tuple[Optional[Callable], Optional[Any]]:
+        compilation_counter.num_inductor_compiles += 1
+        from torch._inductor.compile_fx import compile_fx
+        current_config = {}
+        if compiler_config is not None:
+            current_config.update(compiler_config)
+        # disable remote cache
+        current_config["fx_graph_cache"] = True
+        current_config["fx_graph_remote_cache"] = False
+        set_inductor_config(current_config, runtime_shape)
+        # inductor can inplace modify the graph, so we need to copy it
+        # see https://github.com/pytorch/pytorch/issues/138980
+        graph = copy.deepcopy(graph)
+        # it's the first time we compile this graph
+        # the assumption is that we don't have nested Inductor compilation.
+        # compiled_fx_graph_hash will only be called once, and we can hook
+        # it to get the hash of the compiled graph directly.
+        hash_str, file_path = None, None
+        from torch._inductor.codecache import FxGraphCache, compiled_fx_graph_hash
+        if torch_release[:2] == (2, 5):
+            original_load = FxGraphCache.load
+            original_load_name = "torch._inductor.codecache.FxGraphCache.load"
+            def hijack_load(*args, **kwargs):
+                inductor_compiled_graph = original_load(*args, **kwargs)
+                nonlocal file_path
+                compiled_fn = inductor_compiled_graph.current_callable
+                file_path = compiled_fn.__code__.co_filename  # noqa
+                if not file_path.startswith(self.base_cache_dir):
+                    # hooked in the align_inputs_from_check_idxs function
+                    # in torch/_inductor/utils.py
+                    for cell in compiled_fn.__closure__:
+                        if not callable(cell.cell_contents):
+                            continue
+                        if cell.cell_contents.__code__.co_filename.startswith(
+                            self.base_cache_dir
+                        ):
+                            # this is the real file path compiled from Inductor
+                            file_path = cell.cell_contents.__code__.co_filename
+                            break
+                return inductor_compiled_graph
+            hijacked_compile_fx_inner = (
+                torch._inductor.compile_fx.compile_fx_inner
+            )  # noqa
+        elif torch_release >= (2, 6):
+            # function renamed in 2.6
+            original_load_name = None
+            def hijacked_compile_fx_inner(*args, **kwargs):
+                output = torch._inductor.compile_fx.compile_fx_inner(*args, **kwargs)
+                nonlocal hash_str
+                inductor_compiled_graph = output
+                if inductor_compiled_graph is not None:
+                    nonlocal file_path
+                    compiled_fn = inductor_compiled_graph.current_callable
+                    file_path = compiled_fn.__code__.co_filename  # noqa
+                    if not file_path.startswith(self.base_cache_dir):
+                        # hooked in the align_inputs_from_check_idxs function
+                        # in torch/_inductor/utils.py
+                        for cell in compiled_fn.__closure__:
+                            if not callable(cell.cell_contents):
+                                continue
+                            code = cell.cell_contents.__code__
+                            if code.co_filename.startswith(self.base_cache_dir):
+                                # this is the real file path
+                                # compiled from Inductor
+                                file_path = code.co_filename
+                                break
+                    hash_str = inductor_compiled_graph._fx_graph_cache_key
+                return output
+        def hijack_compiled_fx_graph_hash(*args, **kwargs):
+            out = compiled_fx_graph_hash(*args, **kwargs)
+            nonlocal hash_str
+            hash_str = out[0]
+            return out
+        def _check_can_cache(*args, **kwargs):
+            # no error means it can be cached.
+            # Inductor refuses to cache the graph outside of Dynamo
+            # tracing context, and also disables caching for graphs
+            # with high-order ops.
+            # For vLLM, in either case, we want to cache the graph.
+            # see https://github.com/pytorch/pytorch/blob/9f5ebf3fc609105a74eab4ccc24932d6353ff566/torch/_inductor/codecache.py#L1221 # noqa
+            return
+        def _get_shape_env() -> AlwaysHitShapeEnv:
+            return AlwaysHitShapeEnv()
+        with ExitStack() as stack:
+            # hijack to get the compiled graph itself
+            if original_load_name is not None:
+                stack.enter_context(patch(original_load_name, hijack_load))
+            # for hijacking the hash of the compiled graph
+            stack.enter_context(
+                patch(
+                    "torch._inductor.codecache.compiled_fx_graph_hash",
+                    hijack_compiled_fx_graph_hash,
+                )
+            )
+            # for providing a dummy shape environment
+            stack.enter_context(
+                patch(
+                    "torch._inductor.codecache.FxGraphCache._get_shape_env",
+                    _get_shape_env,
+                )
+            )
+            from torch._functorch._aot_autograd.autograd_cache import AOTAutogradCache
+            # torch 2.8+ on main uses _get_shape_env in AOTAutogradCache
+            if hasattr(AOTAutogradCache, "_get_shape_env"):
+                stack.enter_context(
+                    patch(
+                        "torch._functorch._aot_autograd.autograd_cache.AOTAutogradCache._get_shape_env",
+                        _get_shape_env,
+                    )
+                )
+            # for forcing the graph to be cached
+            stack.enter_context(
+                patch(
+                    "torch._inductor.codecache.FxGraphCache._check_can_cache",
+                    _check_can_cache,
+                )
+            )
+            # Dynamo metrics context, see method for more details.
+            stack.enter_context(self.metrics_context())
+            # Disable remote caching. When these are on, on remote cache-hit,
+            # the monkey-patched functions never actually get called.
+            # vLLM today assumes and requires the monkey-patched functions to
+            # get hit.
+            # TODO(zou3519): we're going to replace this all with
+            # standalone_compile sometime.
+            stack.enter_context(
+                torch._inductor.config.patch(fx_graph_remote_cache=False)
+            )
+            # InductorAdaptor (unfortunately) requires AOTAutogradCache
+            # to be turned off to run. It will fail to acquire the hash_str
+            # and error if not.
+            # StandaloneInductorAdaptor (PyTorch 2.8+) fixes this problem.
+            stack.enter_context(
+                torch._functorch.config.patch(enable_autograd_cache=False)
+            )
+            stack.enter_context(
+                torch._functorch.config.patch(enable_remote_autograd_cache=False)
+            )
+            with pass_context(runtime_shape):
+                compiled_graph = compile_fx(
+                    graph,
+                    example_inputs,
+                    inner_compile=hijacked_compile_fx_inner,
+                    config_patches=current_config,
+                )
+        return compiled_graph, (hash_str, file_path)
+    def load(
+        self,
+        handle: Any,
+        graph: fx.GraphModule,
+        example_inputs: list[Any],
+        graph_index: int,
+        runtime_shape: Optional[int] = None,
+    ) -> Callable:
+        assert isinstance(handle, tuple)
+        assert isinstance(handle[0], str)
+        assert isinstance(handle[1], str)
+        hash_str = handle[0]
+        from torch._functorch._aot_autograd.autograd_cache import AOTAutogradCache
+        from torch._inductor.codecache import FxGraphCache
+        with ExitStack() as exit_stack:
+            exit_stack.enter_context(
+                patch(
+                    "torch._inductor.codecache.FxGraphCache._get_shape_env",
+                    lambda *args, **kwargs: AlwaysHitShapeEnv(),
+                )
+            )
+            # torch 2.8+ on main uses _get_shape_env in AOTAutogradCache
+            if hasattr(AOTAutogradCache, "_get_shape_env"):
+                exit_stack.enter_context(
+                    patch(
+                        "torch._functorch._aot_autograd.autograd_cache.AOTAutogradCache._get_shape_env",
+                        lambda *args, **kwargs: AlwaysHitShapeEnv(),
+                    )
+                )
+            # Dynamo metrics context, see method for more details.
+            exit_stack.enter_context(self.metrics_context())
+            if torch_release[:2] == (2, 5):
+                inductor_compiled_graph = FxGraphCache._lookup_graph(
+                    hash_str, example_inputs, True, False
+                )
+                assert inductor_compiled_graph is not None, (
+                    "Inductor cache lookup failed. Please remove"
+                    f"the cache directory and try again."  # noqa
+                )
+            elif torch_release >= (2, 6):
+                from torch._inductor.output_code import CompiledFxGraphConstantsWithGm
+                constants = CompiledFxGraphConstantsWithGm(graph)
+                inductor_compiled_graph, _ = FxGraphCache._lookup_graph(
+                    hash_str, example_inputs, True, None, constants
+                )
+                assert inductor_compiled_graph is not None, (
+                    "Inductor cache lookup failed. Please remove"
+                    f"the cache directory and try again."  # noqa
+                )
+        # Inductor calling convention (function signature):
+        # f(list) -> tuple
+        # Dynamo calling convention (function signature):
+        # f(*args) -> Any
+        # need to know if the graph returns a tuple
+        from torch._inductor.compile_fx import graph_returns_tuple
+        returns_tuple = graph_returns_tuple(graph)
+        # this is the callable we return to Dynamo to run
+        def compiled_graph(*args):
+            # convert args to list
+            list_args = list(args)
+            graph_output = inductor_compiled_graph(list_args)
+            # unpack the tuple if needed
+            if returns_tuple:
+                return graph_output
+            else:
+                return graph_output[0]
+        return compiled_graph
+    def metrics_context(self) -> contextlib.AbstractContextManager:
+        """
+        This method returns the Dynamo metrics context (if it exists,
+        otherwise a null context). It is used by various compile components.
+        Present in torch>=2.6, it's used inside FxGraphCache in
+        torch==2.6 (but not after). It might also be used in various other
+        torch.compile internal functions.
+        Because it is re-entrant, we always set it (even if entering via Dynamo
+        and the context was already entered). We might want to revisit if it
+        should be set at a different level of compilation.
+        This is likely a bug in PyTorch: public APIs should not rely on
+        manually setting up internal contexts. But we also rely on non-public
+        APIs which might not provide these guarantees.
+        """
+        import torch._dynamo.utils
+        return torch._dynamo.utils.get_metrics_context()
+def set_inductor_config(config, runtime_shape):
+    if isinstance(runtime_shape, int):
+        # for a specific batchsize, tuning triton kernel parameters
+        # can be beneficial
+        config["max_autotune"] = True
+        config["coordinate_descent_tuning"] = True
+class EagerAdapter(CompilerInterface):
+    name = "eager"
+    def compile(
+        self,
+        graph: fx.GraphModule,
+        example_inputs: list[Any],
+        compiler_config: dict[str, Any],
+        runtime_shape: Optional[int] = None,
+        key: Optional[str] = None,
+        num_graphs: int = 1,
+    ) -> tuple[Optional[Callable], Optional[Any]]:
+        return graph, None
+    def load(
+        self,
+        handle: Any,
+        graph: fx.GraphModule,
+        example_inputs: list[Any],
+        graph_index: int,
+        runtime_shape: Optional[int] = None,
+        num_graphs: int = 1,
+    ) -> Callable:
+        raise NotImplementedError("eager compilation is not supported")

sglang/python/sglang/srt/compilation/cuda_piecewise_backend.py ADDED Viewed

	@@ -0,0 +1,206 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/cuda_piecewise_backend.py
+import dataclasses
+import logging
+from contextlib import ExitStack
+from typing import Any, Callable, Optional
+from unittest.mock import patch
+import torch
+import torch.fx as fx
+from sglang.srt.compilation.compilation_config import CompilationConfig
+from sglang.srt.compilation.compilation_counter import compilation_counter
+from sglang.srt.compilation.piecewise_context_manager import (
+    get_pcg_capture_stream,
+    is_in_pcg_torch_compile,
+)
+from sglang.srt.compilation.weak_ref_tensor import weak_ref_tensors
+logger = logging.getLogger(__name__)
+@dataclasses.dataclass
+class ConcreteSizeEntry:
+    runtime_shape: int
+    need_to_compile: bool  # the size is in compile_sizes
+    use_cudagraph: bool  # the size is in cudagraph_capture_sizes
+    compiled: bool = False
+    runnable: Callable = None  # type: ignore
+    num_finished_warmup: int = 0
+    cudagraph: Optional[torch.cuda.CUDAGraph] = None
+    output: Optional[Any] = None
+    # for cudagraph debugging, track the input addresses
+    # during capture, and check if they are the same during replay
+    input_addresses: Optional[list[int]] = None
+class CUDAPiecewiseBackend:
+    def __init__(
+        self,
+        graph: fx.GraphModule,
+        compile_config: CompilationConfig,
+        inductor_config: dict[str, Any],
+        graph_pool: Any,
+        piecewise_compile_index: int,
+        total_piecewise_compiles: int,
+        sym_shape_indices: list[int],
+        compiled_graph_for_general_shape: Callable,
+        sglang_backend,
+    ):
+        """
+        The backend for piecewise compilation.
+        It mainly handles the compilation and cudagraph capturing.
+        We will compile `self.graph` once for the general shape,
+        and then compile for different shapes specified in
+        `compilation_config.compile_sizes`.
+        Independently, we will capture cudagraph for different shapes.
+        If a shape needs both compilation and cudagraph, we will
+        compile it first, and then capture cudagraph.
+        """
+        self.graph = graph
+        self.inductor_config = inductor_config
+        self.graph_pool = graph_pool
+        self.piecewise_compile_index = piecewise_compile_index
+        self.total_piecewise_compiles = total_piecewise_compiles
+        self.sglang_backend = sglang_backend
+        self.is_first_graph = piecewise_compile_index == 0
+        self.is_last_graph = piecewise_compile_index == total_piecewise_compiles - 1
+        self.compile_sizes: set[int] = set([])
+        self.compile_config = compile_config
+        self.cudagraph_capture_sizes: set[int] = set(compile_config.get_capture_sizes())
+        self.first_run_finished = False
+        self.compiled_graph_for_general_shape = compiled_graph_for_general_shape  # noqa
+        self.sym_shape_indices = sym_shape_indices
+        # the entries for different shapes that we need to either
+        # compile or capture cudagraph
+        self.concrete_size_entries: dict[int, ConcreteSizeEntry] = {}
+        # to_be_compiled_sizes tracks the remaining sizes to compile,
+        # and updates during the compilation process, so we need to copy it
+        self.to_be_compiled_sizes: set[int] = self.compile_sizes.copy()
+        for shape in self.compile_sizes.union(self.cudagraph_capture_sizes):
+            self.concrete_size_entries[shape] = ConcreteSizeEntry(
+                runtime_shape=shape,
+                need_to_compile=shape in self.compile_sizes,
+                use_cudagraph=shape in self.cudagraph_capture_sizes,
+            )
+    def check_for_ending_compilation(self):
+        if self.is_last_graph and not self.to_be_compiled_sizes:
+            # no specific sizes to compile
+            # save the hash of the inductor graph for the next run
+            self.sglang_backend.compiler_manager.save_to_file()
+    def __call__(self, *args) -> Any:
+        if not self.first_run_finished:
+            self.first_run_finished = True
+            self.check_for_ending_compilation()
+            return self.compiled_graph_for_general_shape(*args)
+        if len(self.sym_shape_indices) == 0:
+            return self.compiled_graph_for_general_shape(*args)
+        runtime_shape = args[self.sym_shape_indices[0]]
+        if runtime_shape not in self.concrete_size_entries:
+            # we don't need to do anything for this shape
+            return self.compiled_graph_for_general_shape(*args)
+        entry = self.concrete_size_entries[runtime_shape]
+        if entry.runnable is None:
+            entry.runnable = self.compiled_graph_for_general_shape
+        if entry.need_to_compile and not entry.compiled:
+            entry.compiled = True
+            self.to_be_compiled_sizes.remove(runtime_shape)
+            # args are real arguments
+            entry.runnable = self.sglang_backend.compiler_manager.compile(
+                self.graph,
+                args,
+                self.inductor_config,
+                graph_index=self.piecewise_compile_index,
+                num_graphs=self.total_piecewise_compiles,
+                runtime_shape=runtime_shape,
+            )
+            # finished compilations for all required shapes
+            if self.is_last_graph and not self.to_be_compiled_sizes:
+                self.check_for_ending_compilation()
+        if is_in_pcg_torch_compile():
+            return entry.runnable(*args)
+        if entry.cudagraph is None:
+            if entry.num_finished_warmup < 1:  # noqa
+                entry.num_finished_warmup += 1
+                return entry.runnable(*args)
+            if self.compile_config.get_enable_debug_mode():
+                input_addresses = [
+                    x.data_ptr() for x in args if isinstance(x, torch.Tensor)
+                ]
+                entry.input_addresses = input_addresses
+            cudagraph = torch.cuda.CUDAGraph()
+            with ExitStack() as stack:
+                if not self.is_first_graph:
+                    # during every model forward, we will capture
+                    # many pieces of cudagraphs (roughly one per layer).
+                    # running gc again and again across layers will
+                    # make the cudagraph capture very slow.
+                    # therefore, we only run gc for the first graph,
+                    # and disable gc for the rest of the graphs.
+                    stack.enter_context(patch("gc.collect", lambda: None))
+                    stack.enter_context(patch("torch.cuda.empty_cache", lambda: None))
+                # mind-exploding: carefully manage the reference and memory.
+                stream = get_pcg_capture_stream()
+                assert (
+                    stream is not None
+                ), "PCG capture stream is not set, please check if runtime recompilation happened"
+                with torch.cuda.graph(cudagraph, pool=self.graph_pool, stream=stream):
+                    # `output` is managed by pytorch's cudagraph pool
+                    output = entry.runnable(*args)
+                    if self.is_last_graph:
+                        # by converting it to weak ref,
+                        # the original `output` will immediately be released
+                        # to save memory. It is only safe to do this for
+                        # the last graph, because the output of the last graph
+                        # will not be used by any other cuda graph.
+                        output = weak_ref_tensors(output)
+            # here we always use weak ref for the output
+            # to save memory
+            entry.output = weak_ref_tensors(output)
+            entry.cudagraph = cudagraph
+            compilation_counter.num_cudagraph_captured += 1
+            # important: we need to return the output, rather than
+            # the weak ref of the output, so that pytorch can correctly
+            # manage the memory during cuda graph capture
+            return output
+        if self.compile_config.get_enable_debug_mode():
+            # check if the input addresses are the same
+            new_input_addresses = [
+                x.data_ptr() for x in args if isinstance(x, torch.Tensor)
+            ]
+            assert new_input_addresses == entry.input_addresses, (
+                "Input addresses for cudagraphs are different during replay."
+                f" Expected {entry.input_addresses}, got {new_input_addresses}"
+            )
+        entry.cudagraph.replay()
+        return entry.output

sglang/python/sglang/srt/compilation/fix_functionalization.py ADDED Viewed

	@@ -0,0 +1,134 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/fix_functionalization.py
+import logging
+import operator
+from collections.abc import Iterable
+from typing import Optional, Union
+import torch
+from torch._higher_order_ops.auto_functionalize import auto_functionalized
+from sglang.srt.compilation.fx_utils import is_func
+from sglang.srt.compilation.inductor_pass import SGLangInductorPass
+logger = logging.getLogger(__name__)
+class FixFunctionalizationPass(SGLangInductorPass):
+    """
+    This pass defunctionalizes certain nodes to avoid redundant tensor copies.
+    After this pass, DCE (dead-code elimination) should never be run,
+    as de-functionalized nodes may appear as dead code.
+    To add new nodes to defunctionalize, add to the if-elif chain in __call__.
+    """
+    def __call__(self, graph: torch.fx.Graph):
+        self.begin()
+        self.dump_graph(graph, "before_fix_functionalization")
+        self.nodes_to_remove: list[torch.fx.Node] = []
+        count = 0
+        for node in graph.nodes:
+            if not is_func(node, auto_functionalized):
+                continue  # Avoid deep if-elif nesting
+            count += 1
+        self.dump_graph(graph, "before_fix_functionalization_cleanup")
+        # Remove the nodes all at once
+        count_removed = len(self.nodes_to_remove)
+        for node in self.nodes_to_remove:
+            graph.erase_node(node)
+        logger.debug(
+            "De-functionalized %s nodes, removed %s nodes", count, count_removed
+        )
+        self.dump_graph(graph, "after_fix_functionalization")
+        self.end_and_log()
+    def _remove(self, node_or_nodes: Union[torch.fx.Node, Iterable[torch.fx.Node]]):
+        """
+        Stage a node (or nodes) for removal at the end of the pass.
+        """
+        if isinstance(node_or_nodes, torch.fx.Node):
+            self.nodes_to_remove.append(node_or_nodes)
+        else:
+            self.nodes_to_remove.extend(node_or_nodes)
+    def defunctionalize(
+        self,
+        graph: torch.fx.Graph,
+        node: torch.fx.Node,
+        mutated_args: dict[int, Union[torch.fx.Node, str]],
+        args: Optional[tuple[Union[torch.fx.Node, str], ...]] = None,
+    ):
+        """
+        De-functionalize a node by replacing it with a call to the original.
+        It also replaces the getitem users with the mutated arguments.
+        See replace_users_with_mutated_args and insert_defunctionalized.
+        """
+        self.replace_users_with_mutated_args(node, mutated_args)
+        self.insert_defunctionalized(graph, node, args=args)
+        self._remove(node)
+    def replace_users_with_mutated_args(
+        self, node: torch.fx.Node, mutated_args: dict[int, Union[torch.fx.Node, str]]
+    ):
+        """
+        Replace all getitem users of the auto-functionalized node with the
+        mutated arguments.
+        :param node: The auto-functionalized node
+        :param mutated_args: The mutated arguments, indexed by getitem index.
+        If the value of an arg is a string, `node.kwargs[arg]` is used.
+        """
+        for idx, user in self.getitem_users(node).items():
+            arg = mutated_args[idx]
+            arg = node.kwargs[arg] if isinstance(arg, str) else arg
+            user.replace_all_uses_with(arg)
+            self._remove(user)
+    def getitem_users(self, node: torch.fx.Node) -> dict[int, torch.fx.Node]:
+        """
+        Returns the operator.getitem users of the auto-functionalized node,
+        indexed by the index they are getting.
+        """
+        users = {}
+        for user in node.users:
+            if is_func(user, operator.getitem):
+                idx = user.args[1]
+                users[idx] = user
+        return users
+    def insert_defunctionalized(
+        self,
+        graph: torch.fx.Graph,
+        node: torch.fx.Node,
+        args: Optional[tuple[Union[torch.fx.Node, str], ...]] = None,
+    ):
+        """
+        Insert a new defunctionalized node into the graph before node.
+        If one of the kwargs is 'out', provide args directly,
+        as node.kwargs cannot be used.
+        See https://github.com/pytorch/pytorch/blob/a00faf440888ffb724bad413f329a49e2b6388e7/torch/_inductor/lowering.py#L351
+        :param graph: Graph to insert the defunctionalized node into
+        :param node: The auto-functionalized node to defunctionalize
+        :param args: If we cannot use kwargs, specify args directly.
+        If an arg is a string, `node.kwargs[arg]` is used.
+        """  # noqa: E501
+        assert is_func(
+            node, auto_functionalized
+        ), f"node must be auto-functionalized, is {node} instead"
+        # Create a new call to the original function
+        with graph.inserting_before(node):
+            function = node.args[0]
+            if args is None:
+                graph.call_function(function, kwargs=node.kwargs)
+            else:
+                # Args passed as strings refer to items in node.kwargs
+                args = tuple(
+                    node.kwargs[arg] if isinstance(arg, str) else arg for arg in args
+                )
+                graph.call_function(function, args=args)

sglang/python/sglang/srt/compilation/fx_utils.py ADDED Viewed

	@@ -0,0 +1,83 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/compilation/fx_utils.py
+import operator
+from collections.abc import Iterable, Iterator
+from typing import Optional
+from torch import fx
+from torch._higher_order_ops.auto_functionalize import auto_functionalized
+from torch._ops import OpOverload
+def is_func(node: fx.Node, target) -> bool:
+    return node.op == "call_function" and node.target == target
+def is_auto_func(node: fx.Node, op: OpOverload) -> bool:
+    return is_func(node, auto_functionalized) and node.args[0] == op
+# Returns the first specified node with the given op (if it exists)
+def find_specified_fn_maybe(
+    nodes: Iterable[fx.Node], op: OpOverload
+) -> Optional[fx.Node]:
+    for node in nodes:
+        if node.target == op:
+            return node
+    return None
+# Returns the first specified node with the given op
+def find_specified_fn(nodes: Iterable[fx.Node], op: OpOverload) -> fx.Node:
+    node = find_specified_fn_maybe(nodes, op)
+    assert node is not None, f"Could not find {op} in nodes {nodes}"
+    return node
+# Returns the first auto_functionalized node with the given op (if it exists)
+def find_auto_fn_maybe(nodes: Iterable[fx.Node], op: OpOverload) -> Optional[fx.Node]:
+    for node in nodes:
+        if is_func(node, auto_functionalized) and node.args[0] == op:  # noqa
+            return node
+    return None
+# Returns the first auto_functionalized node with the given op
+def find_auto_fn(nodes: Iterable[fx.Node], op: OpOverload) -> fx.Node:
+    node = find_auto_fn_maybe(nodes, op)
+    assert node is not None, f"Could not find {op} in nodes {nodes}"
+    return node
+# Returns the getitem node that extracts the idx-th element from node
+# (if it exists)
+def find_getitem_maybe(node: fx.Node, idx: int) -> Optional[fx.Node]:
+    for user in node.users:
+        if is_func(user, operator.getitem) and user.args[1] == idx:
+            return user
+    return None
+# Returns the getitem node that extracts the idx-th element from node
+def find_getitem(node: fx.Node, idx: int) -> fx.Node:
+    ret = find_getitem_maybe(node, idx)
+    assert ret is not None, f"Could not find getitem {idx} in node {node}"
+    return ret
+# An auto-functionalization-aware utility for finding nodes with a specific op
+def find_op_nodes(op: OpOverload, graph: fx.Graph) -> Iterator[fx.Node]:
+    if not op._schema.is_mutable:
+        yield from graph.find_nodes(op="call_function", target=op)
+    for n in graph.find_nodes(op="call_function", target=auto_functionalized):
+        if n.args[0] == op:
+            yield n
+# Asserts that the node only has one user and returns it
+# Even if a node has only 1 user, it might share storage with another node,
+# which might need to be taken into account.
+def get_only_user(node: fx.Node) -> fx.Node:
+    assert len(node.users) == 1
+    return next(iter(node.users))