Lekr0 commited on Apr 13

Commit

d02d576

verified ·

1 Parent(s): a402b9b

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

sglang/.claude/skills/add-jit-kernel/SKILL.md +553 -0
sglang/.claude/skills/add-sgl-kernel/SKILL.md +358 -0
sglang/.claude/skills/sglang-bisect-ci-regression/SKILL.md +219 -0
sglang/.claude/skills/write-sglang-test/SKILL.md +248 -0
sglang/benchmark/json_jump_forward/README.md +88 -0
sglang/benchmark/json_jump_forward/bench_other.py +288 -0
sglang/benchmark/json_jump_forward/bench_sglang.py +143 -0
sglang/benchmark/json_jump_forward/build_dataset.py +58 -0
sglang/benchmark/json_jump_forward/dataset.txt +50 -0
sglang/benchmark/multi_turn_chat/bench_other.py +93 -0
sglang/benchmark/multi_turn_chat/data_gen.py +29 -0
sglang/benchmark/tree_of_thought_deep/README.md +51 -0
sglang/benchmark/tree_of_thought_deep/bench_other.py +222 -0
sglang/benchmark/tree_of_thought_deep/bench_sglang.py +171 -0
sglang/docker/configs/.zshrc +27 -0
sglang/docker/configs/opt/.gitconfig +30 -0
sglang/docker/configs/opt/.tmux.conf +27 -0
sglang/docker/configs/opt/.vimrc +45 -0
sglang/docker/configs/yank +12 -0
sglang/python/sglang.egg-info/PKG-INFO +120 -0
sglang/python/sglang.egg-info/SOURCES.txt +0 -0
sglang/python/sglang.egg-info/dependency_links.txt +1 -0
sglang/python/sglang.egg-info/entry_points.txt +2 -0
sglang/python/sglang.egg-info/requires.txt +121 -0
sglang/python/sglang.egg-info/top_level.txt +1 -0
sglang/python/sglang/README.md +18 -0
sglang/python/sglang/__init__.py +83 -0
sglang/python/sglang/__pycache__/__init__.cpython-311.pyc +0 -0
sglang/python/sglang/__pycache__/_version.cpython-311.pyc +0 -0
sglang/python/sglang/__pycache__/bench_serving.cpython-311.pyc +0 -0
sglang/python/sglang/__pycache__/check_env.cpython-311.pyc +0 -0
sglang/python/sglang/__pycache__/global_config.cpython-311.pyc +0 -0
sglang/python/sglang/__pycache__/launch_server.cpython-311.pyc +0 -0
sglang/python/sglang/__pycache__/utils.cpython-311.pyc +0 -0
sglang/python/sglang/__pycache__/version.cpython-311.pyc +0 -0
sglang/python/sglang/_version.py +34 -0
sglang/python/sglang/bench_offline_throughput.py +543 -0
sglang/python/sglang/bench_one_batch.py +837 -0
sglang/python/sglang/bench_one_batch_server.py +49 -0
sglang/python/sglang/bench_serving.py +2238 -0
sglang/python/sglang/benchmark/__init__.py +0 -0
sglang/python/sglang/benchmark/__pycache__/__init__.cpython-311.pyc +0 -0
sglang/python/sglang/benchmark/__pycache__/utils.cpython-311.pyc +0 -0
sglang/python/sglang/benchmark/datasets/__init__.py +47 -0
sglang/python/sglang/benchmark/datasets/__pycache__/__init__.cpython-311.pyc +0 -0
sglang/python/sglang/benchmark/datasets/__pycache__/common.cpython-311.pyc +0 -0
sglang/python/sglang/benchmark/datasets/__pycache__/custom.cpython-311.pyc +0 -0
sglang/python/sglang/benchmark/datasets/__pycache__/generated_shared_prefix.cpython-311.pyc +0 -0
sglang/python/sglang/benchmark/datasets/__pycache__/image.cpython-311.pyc +0 -0
sglang/python/sglang/benchmark/datasets/__pycache__/mmmu.cpython-311.pyc +0 -0

sglang/.claude/skills/add-jit-kernel/SKILL.md ADDED Viewed

	@@ -0,0 +1,553 @@

+---
+name: add-jit-kernel
+description: Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
+---
+# Tutorial: Adding a New JIT Kernel to SGLang
+This tutorial walks through adding a simple element-wise scale operation as a JIT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
+## Goal
+Add a new operation that scales each element of a tensor by a scalar factor:
+- Input: tensor `x` (CUDA) and scalar `factor` (float, passed as C++ template argument)
+- Output: `x * factor` (element-wise), allocated internally
+- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
+## When to use JIT vs AOT (`sgl-kernel`)
+- **JIT (`jit_kernel`)**: lightweight, few dependencies, rapid iteration, compiled on first use
+- **AOT (`sgl-kernel`)**: depends on CUTLASS / FlashInfer / DeepGEMM, needs pre-built wheel
+---
+## Common Abstractions in `python/sglang/jit_kernel/include/sgl_kernel/`
+**Always prefer these abstractions over raw CUDA primitives.** They provide safety, readability, and consistency with the rest of the codebase.
+### `utils.h` — Host-side utilities
+```cpp
+#include <sgl_kernel/utils.h>
+```
+- **`host::RuntimeCheck(cond, args...)`** — Assert a condition at runtime; throws `PanicError` with file/line info on failure. Prefer this over bare `assert`.
+- **`host::Panic(args...)`** — Unconditionally throw a `PanicError` with a descriptive message.
+- **`host::div_ceil(a, b)`** — Integer ceiling division `(a + b - 1) / b`.
+- **`host::irange(n)`** / **`host::irange(start, end)`** — Range views for cleaner loops.
+- **`host::pointer::offset(ptr, offsets...)`** — Byte-safe pointer arithmetic on `void*`. Use this instead of raw casts.
+### `utils.cuh` — Device-side utilities + `LaunchKernel`
+```cpp
+#include <sgl_kernel/utils.cuh>
+```
+- **Type aliases**: `fp16_t`, `bf16_t`, `fp32_t`, `fp8_e4m3_t`, `fp8_e5m2_t` and their packed variants `fp16x2_t`, `bf16x2_t`, `fp32x2_t`, etc.
+- **`SGL_DEVICE`** — Expands to `__forceinline__ __device__`. Use on all device functions.
+- **`device::kWarpThreads`** — Constant `32`.
+- **`device::load_as<T>(ptr, offset)`** / **`device::store_as<T>(ptr, val, offset)`** — Type-safe loads/stores from `void*`.
+- **`device::pointer::offset(ptr, offsets...)`** — Pointer arithmetic on device.
+- **`host::LaunchKernel(grid, block, device_or_stream [, smem])`** — RAII kernel launcher that:
+  - Resolves the CUDA stream from a `DLDevice` via TVM-FFI automatically.
+  - Checks the CUDA error with file/line info after launch via `operator()(kernel, args...)`.
+  - Supports `.enable_pdl(bool)` for PDL (Programmatic Dependent Launch, SM90+).
+- **`host::RuntimeDeviceCheck(cudaError_t)`** — Check a CUDA error; throw on failure.
+### `tensor.h` — Tensor validation (`TensorMatcher`, Symbolic types)
+```cpp
+#include <sgl_kernel/tensor.h>
+```
+This is the **primary validation API** for all kernel launchers. Use it to validate every `tvm::ffi::TensorView` argument.
+- **`host::SymbolicSize{"name"}`** — A named symbolic dimension. Call `.set_value(n)` to pin it, `.unwrap()` to extract after verification.
+- **`host::SymbolicDType`** — Symbolic dtype. Use `.set_options<Ts...>()` to restrict allowed types.
+- **`host::SymbolicDevice`** — Symbolic device. Use `.set_options<kDLCUDA>()` to restrict to CUDA.
+- **`host::TensorMatcher({dims...})`** — Fluent builder for tensor validation:
+  - `.with_dtype<T>()` — require a specific C++ type (e.g. `fp16_t`)
+  - `.with_dtype<T1, T2, ...>()` — allow a set of types
+  - `.with_device<kDLCUDA>(device_sym)` — require CUDA, bind device to symbol
+  - `.with_strides({strides...})` — validate strides (omit to require contiguous)
+  - `.verify(tensor_view)` — execute the check; throws `PanicError` with full context on failure; **chainable** (`verify(a).verify(b)` to check multiple tensors with the same shape)
+**Typical pattern:**
+```cpp
+auto N = SymbolicSize{"num_elements"};
+auto device = SymbolicDevice{};
+device.set_options<kDLCUDA>();
+TensorMatcher({N})  //
+    .with_dtype<fp16_t>()
+    .with_device(device)
+    .verify(dst)
+    .verify(src);  // same shape, dtype, device as dst
+const size_t n = N.unwrap();
+const DLDevice dev = device.unwrap();
+```
+### `type.cuh` — `dtype_trait<T>` and `packed_t<T>`
+```cpp
+#include <sgl_kernel/type.cuh>
+```
+- **`dtype_trait<T>`** — Static trait struct for each scalar type. Provides:
+  - `dtype_trait<T>::from(value)` — convert from another type (e.g. `fp32_t` → `fp16_t`)
+  - `dtype_trait<T>::abs/sqrt/rsqrt/max/min(x)` — type-dispatched math (for `fp32_t`)
+- **`packed_t<T>`** — Two-element packed alias: `packed_t<fp16_t>` = `fp16x2_t`, `packed_t<bf16_t>` = `bf16x2_t`, `packed_t<fp32_t>` = `fp32x2_t`. Use for vectorized loads/stores.
+- **`device::cast<To, From>(value)`** — Type-safe cast using `dtype_trait`, e.g. `cast<fp32x2_t, fp16x2_t>(v)`.
+### `vec.cuh` — Vectorized memory access (`AlignedVector`)
+```cpp
+#include <sgl_kernel/vec.cuh>
+```
+- **`device::AlignedVector<T, N>`** — Aligned storage for N elements of type T. N must be a power of two, `sizeof(T)*N <= 32`. Enables 128-bit vector loads/stores for bandwidth efficiency.
+  - `.load(ptr, offset)` — vectorized load from `ptr[offset]`
+  - `.store(ptr, offset)` — vectorized store to `ptr[offset]`
+  - `.fill(value)` — fill all lanes
+  - `operator[](i)` — element access
+### `tile.cuh` — `tile::Memory` (strided memory access pattern)
+```cpp
+#include <sgl_kernel/tile.cuh>
+```
+- **`device::tile::Memory<T>::cta(blockDim.x)`** — Creates a tile accessor where each thread handles `tid = threadIdx.x` with stride `blockDim.x`. Common for loops over a 1D array.
+- **`.load(ptr, offset)`** — loads `ptr[tid + offset * blockDim.x]`
+- **`.store(ptr, val, offset)`** — stores to `ptr[tid + offset * blockDim.x]`
+- **`.in_bound(n, offset)`** — boundary check
+### `math.cuh` — Device math (`device::math::`)
+```cpp
+#include <sgl_kernel/math.cuh>
+```
+- `device::math::max/min/abs/sqrt/rsqrt<T>(a, b)` — type-dispatched math via `dtype_trait`
+- `device::math::exp/sin/cos(float)` — fast float math wrappers
+### `warp.cuh` — Warp-level primitives
+```cpp
+#include <sgl_kernel/warp.cuh>
+```
+- `device::warp::reduce_sum<T>(value)` — warp-level sum reduction via `__shfl_xor_sync`
+- `device::warp::reduce_max<T>(value)` — warp-level max reduction
+### `cta.cuh` — CTA-level primitives
+```cpp
+#include <sgl_kernel/cta.cuh>
+```
+- `device::cta::reduce_max<T>(value, smem, min_value)` — CTA-wide max using shared memory + warp reduction. Caller is responsible for a `__syncthreads()` after if the result in `smem[0]` is needed.
+### `atomic.cuh` — Atomic operations
+```cpp
+#include <sgl_kernel/atomic.cuh>
+```
+- `device::atomic::max(float* addr, float value)` — float atomic max (handles negative values correctly via bit tricks).
+### `runtime.cuh` — Occupancy and device info
+```cpp
+#include <sgl_kernel/runtime.cuh>
+```
+- `host::runtime::get_blocks_per_sm(kernel, block_dim)` — max active blocks per SM (occupancy)
+- `host::runtime::get_sm_count(device_id)` — number of SMs on the device
+- `host::runtime::get_cc_major(device_id)` — compute capability major version
+**Persistent kernel pattern** (cap blocks to SM count × occupancy):
+```cpp
+static const uint32_t max_occ = runtime::get_blocks_per_sm(kernel, kBlockSize);
+static const uint32_t num_sm  = runtime::get_sm_count(device.unwrap().device_id);
+const auto num_blocks = std::min(num_sm * max_occ, div_ceil(n, kBlockSize));
+LaunchKernel(num_blocks, kBlockSize, device.unwrap())(kernel, params);
+```
+---
+## Step 0 (optional): Generate a `.clangd` config for better IDE support
+```bash
+python -m sglang.jit_kernel
+```
+---
+## Step 1: Implement the CUDA kernel in `jit_kernel/csrc/`
+Create `python/sglang/jit_kernel/csrc/elementwise/scale.cuh`.
+The implementation fully uses the project abstractions described above:
+```cpp
+#include <sgl_kernel/tensor.h>   // TensorMatcher, SymbolicSize, SymbolicDevice
+#include <sgl_kernel/type.cuh>   // dtype_trait, fp16_t, bf16_t, fp32_t
+#include <sgl_kernel/utils.h>    // RuntimeCheck, div_ceil
+#include <sgl_kernel/utils.cuh>  // LaunchKernel, SGL_DEVICE
+#include <sgl_kernel/vec.cuh>    // AlignedVector
+#include <dlpack/dlpack.h>
+#include <tvm/ffi/container/tensor.h>
+namespace {
+// ----------------------------------------------------------------
+// Kernel: element-wise scale using vectorized 128-bit loads/stores
+// T       = fp16_t | bf16_t | fp32_t
+// kVecN   = number of elements per vector load (e.g. 8 for fp16)
+// kFactor = scale factor encoded as kFactorNumer / kFactorDenom
+// ----------------------------------------------------------------
+template <typename T, int kVecN, int32_t kFactorNumer, int32_t kFactorDenom>
+__global__ void scale_kernel(T* __restrict__ dst,
+                              const T* __restrict__ src,
+                              uint32_t n_vecs,
+                              uint32_t n_remainder,
+                              uint32_t n_total) {
+  constexpr float kFactor = static_cast<float>(kFactorNumer)
+                          / static_cast<float>(kFactorDenom);
+  using vec_t = device::AlignedVector<T, kVecN>;
+  // --- vectorised body ---
+  const uint32_t vec_stride = blockDim.x * gridDim.x;
+  for (uint32_t vi = blockIdx.x * blockDim.x + threadIdx.x;
+       vi < n_vecs;
+       vi += vec_stride) {
+    vec_t v;
+    v.load(src, vi);
+#pragma unroll
+    for (int i = 0; i < kVecN; ++i) {
+      v[i] = static_cast<T>(static_cast<float>(v[i]) * kFactor);
+    }
+    v.store(dst, vi);
+  }
+  // --- scalar tail ---
+  const uint32_t base = n_vecs * kVecN;
+  const uint32_t scalar_stride = blockDim.x * gridDim.x;
+  for (uint32_t i = blockIdx.x * blockDim.x + threadIdx.x;
+       i < n_remainder;
+       i += scalar_stride) {
+    dst[base + i] = static_cast<T>(static_cast<float>(src[base + i]) * kFactor);
+  }
+}
+// ----------------------------------------------------------------
+// Launcher: validates tensors, selects vector width, launches kernel
+// ----------------------------------------------------------------
+template <typename T, int32_t kFactorNumer, int32_t kFactorDenom>
+void scale(tvm::ffi::TensorView dst, tvm::ffi::TensorView src) {
+  using namespace host;
+  // 1. Validate input tensors with TensorMatcher
+  SymbolicSize N = {"num_elements"};
+  SymbolicDevice device_;
+  device_.set_options<kDLCUDA>();
+  TensorMatcher({N})  //
+      .with_dtype<T>()
+      .with_device(device_)
+      .verify(dst)
+      .verify(src);  // same shape / dtype / device as dst
+  const uint32_t n         = static_cast<uint32_t>(N.unwrap());
+  const DLDevice device    = device_.unwrap();
+  RuntimeCheck(n > 0, "scale: num_elements must be > 0, got ", n);
+  // 2. Choose vector width for 128-bit loads (16 bytes)
+  //    fp16/bf16: 8 elements × 2 bytes = 16 bytes
+  //    fp32:      4 elements × 4 bytes = 16 bytes
+  constexpr int kVecN    = 16 / sizeof(T);
+  const uint32_t n_vecs      = n / kVecN;
+  const uint32_t n_remainder = n % kVecN;
+  // 3. Launch
+  constexpr uint32_t kBlockSize = 256;
+  const uint32_t grid           = div_ceil(std::max(n_vecs, n_remainder), kBlockSize);
+  LaunchKernel(grid, kBlockSize, device)(
+      scale_kernel<T, kVecN, kFactorNumer, kFactorDenom>,
+      static_cast<T*>(dst.data_ptr()),
+      static_cast<const T*>(src.data_ptr()),
+      n_vecs,
+      n_remainder,
+      n);
+}
+}  // namespace
+```
+**Key points:**
+- Include headers from `sgl_kernel/` — **not** raw CUDA headers for anything already covered
+- Use `TensorMatcher` for all tensor validation; never manually check shape/dtype/device
+- Use `AlignedVector` for vectorised 128-bit loads/stores — significant bandwidth win
+- Use `LaunchKernel` — it resolves the stream and checks errors automatically
+- Use `RuntimeCheck` for runtime assertions with useful error messages
+- `fp16_t` / `bf16_t` / `fp32_t` are the project's type aliases (from `utils.cuh`)
+- `device::cast<To, From>` or `dtype_trait<T>::from(val)` for cross-type conversions
+- `device::math::` functions for device math instead of bare `__` intrinsics
+---
+## Step 2: Add the Python wrapper in `jit_kernel/`
+Create `python/sglang/jit_kernel/scale.py`:
+```python
+from __future__ import annotations
+from typing import TYPE_CHECKING
+import torch
+from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args
+if TYPE_CHECKING:
+    from tvm_ffi.module import Module
+@cache_once
+def _jit_scale_module(dtype: torch.dtype, factor_numer: int, factor_denom: int) -> Module:
+    """Compile and cache the JIT scale module for a given dtype and factor."""
+    args = make_cpp_args(dtype, factor_numer, factor_denom)
+    return load_jit(
+        "scale",
+        *args,
+        cuda_files=["elementwise/scale.cuh"],
+        cuda_wrappers=[("scale", f"scale<{args}>")],
+    )
+def scale(src: torch.Tensor, factor: float, out: torch.Tensor | None = None) -> torch.Tensor:
+    """
+    Element-wise scale: dst = src * factor.
+    Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
+    Parameters
+    ----------
+    src    : CUDA tensor (FP16 / BF16 / FP32)
+    factor : scale factor
+    out    : optional pre-allocated output tensor (same shape/dtype as src)
+    Returns
+    -------
+    Scaled tensor (dst = src * factor).
+    """
+    assert src.is_cuda, "src must be a CUDA tensor"
+    assert src.dtype in (torch.float16, torch.bfloat16, torch.float32), (
+        f"Unsupported dtype {src.dtype}. Supported: float16, bfloat16, float32"
+    )
+    if out is None:
+        out = torch.empty_like(src)
+    else:
+        assert out.shape == src.shape, "out shape must match src"
+        assert out.dtype == src.dtype,  "out dtype must match src"
+    # Encode factor as integer ratio; denom=1000 gives 3 decimal places of precision
+    factor_denom = 1000
+    factor_numer = round(factor * factor_denom)
+    module = _jit_scale_module(src.dtype, factor_numer, factor_denom)
+    module.scale(out, src)
+    return out
+```
+**Key points:**
+- Use `cache_once` — **not** `functools.lru_cache` (incompatible with `torch.compile`)
+- `load_jit` first arg(s) form the unique build marker; same marker = same cached binary
+- `cuda_wrappers`: `(export_name, kernel_symbol)` — `export_name` is called from Python
+- `make_cpp_args(dtype, ...)` converts `torch.dtype` to C++ type alias:
+| `torch.dtype`      | C++ type   |
+|--------------------|------------|
+| `torch.float16`    | `fp16_t`   |
+| `torch.bfloat16`   | `bf16_t`   |
+| `torch.float32`    | `fp32_t`   |
+---
+## Step 3 (optional): Tune JIT build flags
+```python
+return load_jit(
+    "scale",
+    *args,
+    cuda_files=["elementwise/scale.cuh"],
+    cuda_wrappers=[("scale", f"scale<{args}>")],
+    extra_cuda_cflags=["-O3", "--use_fast_math"],
+)
+```
+If your kernel requires SM90+, raise a clear Python error before calling `load_jit`:
+```python
+if torch.cuda.get_device_capability()[0] < 9:
+    raise RuntimeError("This kernel requires SM90 (Hopper) or later")
+```
+---
+## Step 4: Write tests (required)
+Create `python/sglang/jit_kernel/tests/test_scale.py`:
+```python
+import pytest
+import torch
+from sglang.jit_kernel.scale import scale
+@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
+@pytest.mark.parametrize("size", [1, 127, 128, 1024, 4097])  # cover tail remainder
+@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0, 3.0])
+def test_scale_correctness(dtype, size, factor):
+    src = torch.randn(size, dtype=dtype, device="cuda")
+    out = scale(src, factor)
+    expected = src * factor
+    rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
+    torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
+@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
+def test_scale_out_param(dtype):
+    src = torch.randn(1024, dtype=dtype, device="cuda")
+    out = torch.empty_like(src)
+    result = scale(src, 2.0, out=out)
+    assert result is out
+    torch.testing.assert_close(out, src * 2.0, rtol=1e-2, atol=1e-2)
+def test_scale_cpu_error():
+    src = torch.randn(128, dtype=torch.float16)  # CPU tensor
+    with pytest.raises(AssertionError, match="CUDA"):
+        scale(src, 2.0)
+def test_scale_unsupported_dtype():
+    src = torch.randint(0, 10, (128,), dtype=torch.int32, device="cuda")
+    with pytest.raises(AssertionError, match="Unsupported dtype"):
+        scale(src, 2.0)
+if __name__ == "__main__":
+    pytest.main([__file__, "-v", "-s"])
+```
+---
+## Step 5: Add a benchmark (required)
+Create `python/sglang/jit_kernel/benchmark/bench_scale.py`:
+```python
+import itertools
+import torch
+import triton
+import triton.testing
+from sglang.jit_kernel.benchmark.utils import (
+    DEFAULT_DEVICE,
+    DEFAULT_DTYPE,
+    get_benchmark_range,
+    run_benchmark,
+)
+from sglang.jit_kernel.scale import scale as jit_scale
+SIZE_LIST = get_benchmark_range(
+    full_range=[2**n for n in range(10, 20)],  # 1K … 512K elements
+    ci_range=[4096, 65536],
+)
+configs = list(itertools.product(SIZE_LIST))
+@triton.testing.perf_report(
+    triton.testing.Benchmark(
+        x_names=["size"],
+        x_vals=configs,
+        line_arg="provider",
+        line_vals=["jit", "torch"],
+        line_names=["SGL JIT Kernel", "PyTorch"],
+        styles=[("blue", "-"), ("red", "--")],
+        ylabel="us",
+        plot_name="scale-performance",
+        args={},
+    )
+)
+def benchmark(size: int, provider: str):
+    src = torch.randn(size, dtype=DEFAULT_DTYPE, device=DEFAULT_DEVICE)
+    factor = 2.0
+    if provider == "jit":
+        fn = lambda: jit_scale(src, factor)
+    else:
+        fn = lambda: src * factor
+    return run_benchmark(fn)
+if __name__ == "__main__":
+    benchmark.run(print_data=True)
+```
+Run:
+```bash
+python python/sglang/jit_kernel/benchmark/bench_scale.py
+```
+---
+## Troubleshooting
+- **JIT compilation fails**: ensure the `.cuh` file is under `python/sglang/jit_kernel/csrc/`; reduce template argument combinations
+- **CUDA crash / illegal memory access**: `CUDA_LAUNCH_BLOCKING=1`; `compute-sanitizer --tool memcheck python ...`
+- **Unstable benchmark results**: `run_benchmark` uses CUDA-graph-based timing by default
+---
+## References
+- `docs/developer_guide/development_jit_kernel_guide.md`
+- `python/sglang/jit_kernel/utils.py` — `cache_once`, `load_jit`, `make_cpp_args`
+- `python/sglang/jit_kernel/include/sgl_kernel/tensor.h` — `TensorMatcher`, `SymbolicSize/DType/Device`
+- `python/sglang/jit_kernel/include/sgl_kernel/utils.cuh` — type aliases, `LaunchKernel`, `SGL_DEVICE`
+- `python/sglang/jit_kernel/include/sgl_kernel/vec.cuh` — `AlignedVector`
+- `python/sglang/jit_kernel/include/sgl_kernel/tile.cuh` — `tile::Memory`
+- `python/sglang/jit_kernel/include/sgl_kernel/type.cuh` — `dtype_trait`, `packed_t`, `device::cast`
+- `python/sglang/jit_kernel/include/sgl_kernel/math.cuh` — `device::math::`
+- `python/sglang/jit_kernel/include/sgl_kernel/warp.cuh` — `warp::reduce_sum/max`
+- `python/sglang/jit_kernel/include/sgl_kernel/cta.cuh` — `cta::reduce_max`
+- `python/sglang/jit_kernel/include/sgl_kernel/atomic.cuh` — `atomic::max`
+- `python/sglang/jit_kernel/include/sgl_kernel/runtime.cuh` — occupancy / SM count helpers
+- `python/sglang/jit_kernel/csrc/add_constant.cuh` — minimal runnable reference
+- `python/sglang/jit_kernel/csrc/elementwise/rmsnorm.cuh` — real example using `TensorMatcher` + `LaunchKernel` + `tile::Memory`
+- `python/sglang/jit_kernel/csrc/elementwise/qknorm.cuh` — real example using `runtime::get_blocks_per_sm` + persistent kernel pattern
+- `python/sglang/jit_kernel/benchmark/utils.py` — benchmark helpers
+## Summary of Files Created
+```
+python/sglang/jit_kernel/csrc/elementwise/scale.cuh   # NEW: CUDA kernel
+python/sglang/jit_kernel/scale.py                     # NEW: Python wrapper
+python/sglang/jit_kernel/tests/test_scale.py          # NEW: Tests
+python/sglang/jit_kernel/benchmark/bench_scale.py     # NEW: Benchmark
+```

sglang/.claude/skills/add-sgl-kernel/SKILL.md ADDED Viewed

	@@ -0,0 +1,358 @@

+---
+name: add-sgl-kernel
+description: Step-by-step tutorial for adding a heavyweight AOT CUDA/C++ kernel to sgl-kernel (including tests & benchmarks)
+---
+# Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)
+This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
+## Goal
+Add a new operation that scales each element of a tensor by a scalar factor:
+- Input: tensor `x` (CUDA) and scalar `factor` (float)
+- Output: `x * factor` (element-wise, in-place or into pre-allocated `out`)
+- Supported dtypes: **FP16 (`torch.float16`), BF16 (`torch.bfloat16`), FP32 (`torch.float32`)**
+  - Dispatched via `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro (defined in `sgl-kernel/include/utils.h`)
+## Two rules of thumb (must follow)
+1. **Heavyweight kernels go to `sgl-kernel`.** If it depends on CUTLASS / FlashInfer / DeepGEMM (or similarly heavy stacks), implement it in `sgl-kernel/`.
+2. **Lightweight kernels go to `python/sglang/jit_kernel`.** If it is small, has few dependencies, and benefits from rapid iteration, implement it as a JIT kernel instead.
+In addition, every new kernel must ship with:
+- **Tests** (pytest)
+- **A benchmark script** (triton.testing)
+---
+## Repository integration map
+You will typically touch these files/areas:
+- Implementation: `sgl-kernel/csrc/elementwise/scale.cu` (pick the right subdirectory)
+- Public declarations: `sgl-kernel/include/sgl_kernel_ops.h`
+- Torch extension registration: `sgl-kernel/csrc/common_extension.cc`
+- Build: `sgl-kernel/CMakeLists.txt` (`set(SOURCES ...)`)
+- Python API: `sgl-kernel/python/sgl_kernel/` and `sgl-kernel/python/sgl_kernel/__init__.py`
+- Tests: `sgl-kernel/tests/test_scale.py`
+- Benchmarks: `sgl-kernel/benchmark/bench_scale.py`
+---
+## Step 1: Implement the kernel in `csrc/`
+Pick the right subdirectory:
+- `csrc/elementwise/` — for element-wise ops (our example)
+- `csrc/gemm/`, `csrc/attention/`, `csrc/moe/` — for other categories
+Create `sgl-kernel/csrc/elementwise/scale.cu`:
+```cpp
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+#include <torch/all.h>
+#include "utils.h"  // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16
+// scale_kernel: out[i] = input[i] * factor
+// Supports float, half (__half), __nv_bfloat16 via template T
+template <typename T>
+__global__ void scale_kernel(T* __restrict__ out,
+                              const T* __restrict__ input,
+                              float factor,
+                              int64_t n) {
+  int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x;
+  if (idx < n) {
+    out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor);
+  }
+}
+void scale(at::Tensor& out, const at::Tensor& input, double factor) {
+  TORCH_CHECK(input.is_cuda(),       "input must be a CUDA tensor");
+  TORCH_CHECK(input.is_contiguous(), "input must be contiguous");
+  TORCH_CHECK(out.is_cuda(),         "out must be a CUDA tensor");
+  TORCH_CHECK(out.is_contiguous(),   "out must be contiguous");
+  TORCH_CHECK(out.sizes() == input.sizes(),  "out and input must have the same shape");
+  TORCH_CHECK(out.scalar_type() == input.scalar_type(),
+              "out and input must have the same dtype");
+  const int64_t n = input.numel();
+  const int threads = 256;
+  const int blocks  = (n + threads - 1) / threads;
+  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
+  // Dispatches over float, float16, bfloat16
+  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] {
+    scale_kernel<c_type><<<blocks, threads, 0, stream>>>(
+        static_cast<c_type*>(out.data_ptr()),
+        static_cast<const c_type*>(input.data_ptr()),
+        static_cast<float>(factor),
+        n);
+    cudaError_t status = cudaGetLastError();
+    TORCH_CHECK(status == cudaSuccess,
+                "scale_kernel launch failed: ", cudaGetErrorString(status));
+    return true;
+  });
+}
+```
+**Key points:**
+- Use `at::Tensor` (PyTorch tensors), `TORCH_CHECK` for validation, `at::cuda::getCurrentCUDAStream()` for stream
+- `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` covers `float`, `half` (FP16), `__nv_bfloat16` (BF16)
+- Add device error checking after every kernel launch
+- If a kernel only works on certain architectures, enforce that with `TORCH_CHECK` and skip logic in tests
+---
+## Step 2: Add a C++ declaration in `include/sgl_kernel_ops.h`
+Edit `sgl-kernel/include/sgl_kernel_ops.h`, add to the elementwise section:
+```cpp
+void scale(at::Tensor& out, const at::Tensor& input, double factor);
+```
+---
+## Step 3: Register the op in `csrc/common_extension.cc`
+Edit `sgl-kernel/csrc/common_extension.cc`, inside `TORCH_LIBRARY_FRAGMENT(sgl_kernel, m)`:
+```cpp
+// From csrc/elementwise
+m.def("scale(Tensor! out, Tensor input, float factor) -> ()");
+m.impl("scale", torch::kCUDA, &scale);
+```
+**Key points:**
+- `Tensor!` means in-place / mutable output argument
+- The schema is important for `torch.compile` and for consistent call signatures
+- If your underlying C++ API uses `float` but PyTorch bindings expect `double`, the implicit cast is fine for scalars; use shims if needed for other types
+---
+## Step 4: Add the new source file to `CMakeLists.txt`
+Edit `sgl-kernel/CMakeLists.txt`, add to `set(SOURCES ...)`:
+```cmake
+csrc/elementwise/scale.cu
+```
+**Key points:**
+- Keep the list **alphabetically sorted** (the file explicitly requires this)
+- If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic
+---
+## Step 5: Expose a Python API under `sgl-kernel/python/sgl_kernel/`
+In `sgl-kernel/python/sgl_kernel/__init__.py`, add:
+```python
+from torch.ops import sgl_kernel as _ops
+def scale(out: torch.Tensor, input: torch.Tensor, factor: float) -> None:
+    """
+    Element-wise scale: out = input * factor (in-place into out).
+    Supported dtypes: torch.float16, torch.bfloat16, torch.float32.
+    Parameters
+    ----------
+    out    : pre-allocated CUDA output tensor (same shape/dtype as input)
+    input  : CUDA input tensor
+    factor : scale factor (float)
+    """
+    _ops.scale(out, input, factor)
+```
+Or export it from the existing module organisation — follow the pattern already used by similar ops in `__init__.py`.
+---
+## Step 6: Write tests (required)
+Create `sgl-kernel/tests/test_scale.py`:
+```python
+import pytest
+import torch
+import sgl_kernel
+@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32])
+@pytest.mark.parametrize("size", [128, 1024, 4096, 65536])
+@pytest.mark.parametrize("factor", [0.5, 1.0, 2.0])
+def test_scale_correctness(dtype, size, factor):
+    input = torch.randn(size, dtype=dtype, device="cuda")
+    out   = torch.empty_like(input)
+    sgl_kernel.scale(out, input, factor)
+    expected = input * factor
+    rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
+    torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)
+def test_scale_shape_mismatch():
+    input = torch.randn(128, dtype=torch.float16, device="cuda")
+    out   = torch.empty(256, dtype=torch.float16, device="cuda")
+    with pytest.raises(RuntimeError, match="same shape"):
+        sgl_kernel.scale(out, input, 2.0)
+def test_scale_cpu_input():
+    input = torch.randn(128, dtype=torch.float16)  # CPU
+    out   = torch.empty_like(input)
+    with pytest.raises(RuntimeError, match="CUDA"):
+        sgl_kernel.scale(out, input, 2.0)
+if __name__ == "__main__":
+    pytest.main([__file__, "-q"])
+```
+Run:
+```bash
+pytest sgl-kernel/tests/test_scale.py -q
+```
+---
+## Step 7: Add a benchmark (required)
+Create `sgl-kernel/benchmark/bench_scale.py`:
+```python
+import itertools
+import os
+import torch
+import triton
+import triton.testing
+import sgl_kernel
+IS_CI = (
+    os.getenv("CI", "false").lower() == "true"
+    or os.getenv("GITHUB_ACTIONS", "false").lower() == "true"
+)
+dtypes  = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32]
+sizes   = [4096] if IS_CI else [2**n for n in range(10, 20)]  # 1K … 512K
+factors = [2.0]
+configs = list(itertools.product(dtypes, sizes))
+def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor:
+    return input * factor
+@triton.testing.perf_report(
+    triton.testing.Benchmark(
+        x_names=["dtype", "size"],
+        x_vals=configs,
+        line_arg="provider",
+        line_vals=["sglang", "torch"],
+        line_names=["SGL Kernel", "PyTorch"],
+        styles=[("green", "-"), ("red", "--")],
+        ylabel="µs (median)",
+        plot_name="scale-performance",
+        args={},
+    )
+)
+def benchmark(dtype, size, provider):
+    input  = torch.randn(size, dtype=dtype, device="cuda")
+    out    = torch.empty_like(input)
+    factor = 2.0
+    if provider == "sglang":
+        fn = lambda: sgl_kernel.scale(out, input, factor)
+    else:
+        fn = lambda: torch_scale(input, factor)
+    ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
+        fn, quantiles=[0.5, 0.2, 0.8]
+    )
+    return 1000 * ms, 1000 * max_ms, 1000 * min_ms
+if __name__ == "__main__":
+    benchmark.run(print_data=True)
+```
+Run:
+```bash
+python sgl-kernel/benchmark/bench_scale.py
+```
+---
+## Step 8: Build and validate
+Build:
+```bash
+cd sgl-kernel
+make build -j16
+```
+If you need to limit host resource usage:
+```bash
+cd sgl-kernel
+make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
+```
+Validate:
+```bash
+pytest sgl-kernel/tests/test_scale.py -q
+python sgl-kernel/benchmark/bench_scale.py
+```
+---
+## Troubleshooting
+- **Async CUDA errors**: `CUDA_LAUNCH_BLOCKING=1`
+- **Memory errors**: `compute-sanitizer --tool memcheck python ...`
+- **Build is too slow / OOM**: reduce `MAX_JOBS` and `SGL_KERNEL_COMPILE_THREADS`
+- **Binary bloat**: use `sgl-kernel/analyze_whl_kernel_sizes.py`
+- **CMake sources list**: if your `.cu` file is missing from `SOURCES`, the symbol will be undefined at link time
+---
+## References
+- `sgl-kernel/README.md`
+- `sgl-kernel/include/sgl_kernel_ops.h`
+- `sgl-kernel/csrc/common_extension.cc`
+- `sgl-kernel/CMakeLists.txt`
+- `sgl-kernel/include/utils.h` — `DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16` macro and friends
+- `sgl-kernel/csrc/elementwise/activation.cu` — reference for the FP16/BF16/FP32 dispatch pattern
+## Summary of Files Created/Modified
+```
+sgl-kernel/csrc/elementwise/scale.cu          # NEW: CUDA kernel + launcher
+sgl-kernel/include/sgl_kernel_ops.h           # MODIFIED: C++ declaration
+sgl-kernel/csrc/common_extension.cc           # MODIFIED: schema + dispatch registration
+sgl-kernel/CMakeLists.txt                     # MODIFIED: add source file (alphabetical)
+sgl-kernel/python/sgl_kernel/__init__.py      # MODIFIED: export Python API
+sgl-kernel/tests/test_scale.py                # NEW: tests
+sgl-kernel/benchmark/bench_scale.py           # NEW: benchmark
+```

sglang/.claude/skills/sglang-bisect-ci-regression/SKILL.md ADDED Viewed

	@@ -0,0 +1,219 @@

+# SGLang Bisect CI Regression
+Investigate a consistently failing CI test to find the root cause - whether it's a code regression from a specific PR, a hardware/runner-specific issue, or an environment change. Optionally reproduce the failure on a remote GPU server.
+## Slash Command
+`/sglang-bisect-ci-regression <test_name_or_ci_url> [ssh_target] [docker_container]`
+## When to Use This Skill
+- A CI test is failing consistently on main (scheduled runs)
+- You need to find which PR introduced a regression
+- You suspect a runner-specific or GPU-specific issue
+- You want to reproduce a CI failure on a remote server
+## Arguments
+- **First argument (required)**: Test file name (e.g. `test_lora_tp.py`) or a GitHub Actions job URL
+- **Second argument (optional)**: SSH target for remote reproduction (e.g. `user@host`)
+- **Third argument (optional)**: Docker container name on the SSH target (e.g. `sglang_dev`)
+If SSH target and docker container are not provided, the skill will only perform the CI log analysis and bisection, without remote reproduction. **Ask the user** for these if reproduction is needed and they weren't provided.
+## Background: Scheduled CI Runs
+SGLang uses the `pr-test.yml` workflow with **scheduled runs** (cron-triggered) to periodically test the `main` branch. These runs are the primary data source for detecting regressions:
+- **Workflow**: `pr-test.yml` with `event: schedule`
+- **Branch**: `main`
+- **Dashboard**: https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
+- **Frequency**: Runs multiple times daily, each pinned to the HEAD of `main` at trigger time
+- **Purpose**: Catches regressions that slip through PR-level CI (e.g., interaction bugs between merged PRs, hardware-specific issues)
+Always use these scheduled runs (not PR-triggered runs) when bisecting regressions on `main`. The `--event schedule` filter in `gh run list` ensures you only see these periodic main-branch runs.
+## Workflow
+### Phase 1: Extract the Failure Signature
+1. **Get the failing test details from CI logs.** If given a URL, fetch logs directly. If given a test name, find recent scheduled runs of `pr-test.yml` on `main` that failed:
+```bash
+# List recent scheduled runs targeting main (the primary source of truth for regressions)
+# These are cron-triggered runs visible at:
+# https://github.com/sgl-project/sglang/actions/workflows/pr-test.yml?query=event%3Aschedule
+gh run list --repo sgl-project/sglang --workflow="pr-test.yml" --event schedule --branch main --limit 20 --json databaseId,conclusion,createdAt,headSha
+# Find the job containing the test
+gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name, conclusion, databaseId}'
+# Get the failure details
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E -B 5 -A 30 "AssertionError|FAIL|Error|{TEST_NAME}"
+```
+2. **Record the failure signature:**
+   - Exact error message and assertion
+   - Affected test method name
+   - Model/config involved
+   - Numeric values (e.g., tolerance diffs, scores)
+   - Whether the failure is deterministic (same values across runs)
+### Phase 2: Temporal Bisection
+3. **Find the boundary between passing and failing runs.** Walk through the scheduled run history (from the `pr-test.yml` schedule runs on `main`) to identify:
+   - Last known PASSING run (sha + date)
+   - First known FAILING run (sha + date)
+```bash
+# For each scheduled run, check the specific partition/job status
+gh run view {RUN_ID} --repo sgl-project/sglang --json jobs --jq '.jobs[] | select(.name == "{JOB_NAME}") | {conclusion, databaseId}'
+# Verify a specific test passed or failed in a run
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "{TEST_NAME}|PASSED|FAILED|logprobs mismatch" | head -10
+```
+4. **List commits between the boundary:**
+```bash
+git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA}
+```
+5. **Filter for relevant commits** that touch files related to the failing test (model layers, kernels, test utilities, etc.):
+```bash
+git log --oneline {LAST_PASS_SHA}..{FIRST_FAIL_SHA} -- {relevant_paths}
+```
+### Phase 3: Runner/Hardware Analysis
+6. **Check if the failure is runner-specific.** Extract the runner identity from each failing and passing run:
+```bash
+# Get runner name and machine
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "Runner name|Machine name" | head -5
+# Get GPU/driver info
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -i -E "NVIDIA-SMI|Driver Version|CUDA Version" | head -5
+# Get package versions
+gh run view {RUN_ID} --repo sgl-project/sglang --job {JOB_ID} --log 2>&1 | grep -E "sgl.kernel.*==|flashinfer.*==" | head -5
+```
+7. **Correlate runners with pass/fail outcomes.** Build a table:
+| Run ID | Date | Runner | GPU Type | Driver | Result |
+|--------|------|--------|----------|--------|--------|
+If all failures map to a specific runner type/GPU and all passes map to another, the issue is **hardware-specific**, not a code regression.
+### Phase 4: Code Analysis
+8. **If a code regression is suspected** (failures not runner-specific), examine the candidate commits:
+   - Read the changed files
+   - Understand how the changes could affect the failing test
+   - Look for prefill-vs-decode differences, TP-specific paths, kernel changes
+9. **If a hardware issue is suspected**, analyze:
+   - Kernel compatibility (CUDA compute capability)
+   - Driver version differences
+   - All-reduce / NCCL behavior differences
+   - CUDA graph capture differences across GPU architectures
+### Phase 5: Remote Reproduction (Optional)
+Only if SSH target and docker container were provided.
+10. **Verify the remote environment:**
+```bash
+ssh {SSH_TARGET} "docker exec {CONTAINER} nvidia-smi --query-gpu=name,driver_version --format=csv"
+ssh {SSH_TARGET} "docker exec {CONTAINER} pip show sgl-kernel sglang flashinfer-python 2>&1 | grep -E 'Name:|Version:'"
+```
+11. **Ensure latest code is installed.** If the container is stale, update:
+```bash
+# Try fetching latest main
+ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && git fetch origin main && git checkout origin/main'"
+# Or download and install from tarball if git auth fails
+ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /tmp && curl -L https://github.com/sgl-project/sglang/archive/refs/heads/main.tar.gz | tar xz && cd sglang-main && pip install -e \"python[all]\"'"
+# Reinstall (after git fetch)
+ssh {SSH_TARGET} "docker exec {CONTAINER} bash -c 'cd /path/to/sglang && pip install -e \"python[all]\"'"
+# Install test dependencies if needed
+ssh {SSH_TARGET} "docker exec {CONTAINER} pip install peft rouge-score"
+```
+12. **Create a minimal reproduction script** that:
+    - Uses `if __name__ == '__main__'` with `mp.set_start_method("spawn")`
+    - Runs the specific failing test configuration
+    - Prints key metrics (diffs, scores, outputs)
+    - Exits with code 1 on failure
+13. **Copy and run the reproduction script:**
+```bash
+scp /tmp/repro_script.py {SSH_TARGET}:/tmp/
+ssh {SSH_TARGET} "docker cp /tmp/repro_script.py {CONTAINER}:/tmp/"
+ssh {SSH_TARGET} "docker exec -e CUDA_VISIBLE_DEVICES=0,1 {CONTAINER} python3 /tmp/repro_script.py"
+```
+14. **Run control experiments** to isolate the variable:
+    - If suspecting TP issue: run with TP=1 as control
+    - If suspecting GPU issue: compare same code on different GPU
+    - If suspecting a specific commit: test before/after that commit
+### Phase 6: Report
+15. **Produce a structured report:**
+```markdown
+## CI Regression Bisection Report
+### Failure Signature
+- **Test**: {test_file}::{test_method}
+- **Error**: {exact error message}
+- **Key metrics**: {numeric values}
+- **Deterministic**: Yes/No
+### Root Cause Classification
+One of:
+- **Code Regression**: PR #{number} introduced the bug
+- **Hardware-Specific**: Fails on {GPU_TYPE}, passes on others
+- **Environment Change**: New runner/driver/package version
+- **Pre-existing Flakiness**: Intermittent, not a new regression
+### Evidence
+| Condition | Result |
+|-----------|--------|
+| {condition1} | PASS/FAIL |
+| {condition2} | PASS/FAIL |
+### Timeline
+- {date}: Last known pass ({sha}, {runner})
+- {date}: First known fail ({sha}, {runner})
+- {date}: Confirmed reproduction on {server}
+### Recommended Fix
+- **Short-term**: {workaround}
+- **Long-term**: {proper fix}
+```
+## Key Patterns to Recognize
+| Pattern | Diagnosis |
+|---------|-----------|
+| Same SHA passes on runner A, fails on runner B | Hardware/runner-specific |
+| All runners fail after commit X | Code regression from commit X |
+| Intermittent - same runner sometimes passes/fails | Flaky test or race condition |
+| Prefill OK but decode fails | TP/all-reduce issue in decode path |
+| Works with TP=1, fails with TP>1 | Tensor parallelism bug |
+| Exact same numeric diff every time | Deterministic bug, not flakiness |
+## Important Notes
+- **Always check runner identity** before concluding it's a code regression. Many "consistent" failures are actually runner-specific.
+- **Test partition assignments change over time** as tests are added/removed. A test may move between partitions, landing on different runner types.
+- **H200 runners** use `/root/actions-runner/` path and machine names like `gpu-h200-worker-*`. Non-H200 runners use `/public_sglang_ci/runner-*` paths.
+- When running remote reproduction, use `run_in_background` for long-running tests and check output with `TaskOutput`.
+- Container environments may be stale - always verify package versions match CI before drawing conclusions.

sglang/.claude/skills/write-sglang-test/SKILL.md ADDED Viewed

	@@ -0,0 +1,248 @@

+---
+name: write-sglang-test
+description: Guide for writing SGLang CI/UT tests following project conventions. Covers CustomTestCase, CI registration, server fixtures, model selection, and test placement. Use when creating new tests, adding CI test cases, writing unit tests, or when the user asks to add tests for SGLang features.
+---
+# Writing SGLang CI / UT Tests
+## Core Rules
+1. **Always use `CustomTestCase`** — never raw `unittest.TestCase`
+2. **Place tests in `test/registered/<category>/`** — only use `test/manual/` for debugging / non-CI tests
+3. **Reuse server fixtures** — inherit from `DefaultServerBase` or write `setUpClass`/`tearDownClass` with `popen_launch_server`
+4. **Smallest model for model-agnostic functionality** — use `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` (Llama-3.2-1B-Instruct) for basic features that don't depend on model size
+5. **8B for general performance** — use `DEFAULT_MODEL_NAME_FOR_TEST` (Llama-3.1-8B-Instruct, single-node) for performance tests that don't involve spec / DP / parallelism
+6. **Bigger features → discuss case by case** — spec, DP attention, tensor/pipeline parallelism etc. may need multi-GPU suites and specific models
+---
+## Test File Template
+### Functional correctness test (small model)
+```python
+import unittest
+import requests
+from sglang.srt.utils import kill_process_tree
+from sglang.test.ci.ci_register import register_cuda_ci
+from sglang.test.test_utils import (
+    DEFAULT_SMALL_MODEL_NAME_FOR_TEST,
+    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+    DEFAULT_URL_FOR_TEST,
+    CustomTestCase,
+    popen_launch_server,
+)
+register_cuda_ci(est_time=60, suite="stage-b-test-small-1-gpu")
+class TestMyFeature(CustomTestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
+        cls.base_url = DEFAULT_URL_FOR_TEST
+        cls.process = popen_launch_server(
+            cls.model,
+            cls.base_url,
+            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+            other_args=["--arg1", "value1"],  # feature-specific args
+        )
+    @classmethod
+    def tearDownClass(cls):
+        kill_process_tree(cls.process.pid)
+    def test_basic_functionality(self):
+        response = requests.post(
+            self.base_url + "/generate",
+            json={"text": "Hello", "sampling_params": {"max_new_tokens": 32}},
+        )
+        self.assertEqual(response.status_code, 200)
+if __name__ == "__main__":
+    unittest.main(verbosity=3)
+```
+### General performance test (8B model, single node, no spec/DP/parallelism)
+```python
+import time
+import unittest
+import requests
+from sglang.srt.utils import kill_process_tree
+from sglang.test.ci.ci_register import register_cuda_ci
+from sglang.test.test_utils import (
+    DEFAULT_MODEL_NAME_FOR_TEST,
+    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+    DEFAULT_URL_FOR_TEST,
+    CustomTestCase,
+    popen_launch_server,
+)
+register_cuda_ci(est_time=300, suite="stage-b-test-large-1-gpu")
+class TestMyFeaturePerf(CustomTestCase):
+    @classmethod
+    def setUpClass(cls):
+        cls.model = DEFAULT_MODEL_NAME_FOR_TEST
+        cls.base_url = DEFAULT_URL_FOR_TEST
+        cls.process = popen_launch_server(
+            cls.model,
+            cls.base_url,
+            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
+        )
+    @classmethod
+    def tearDownClass(cls):
+        kill_process_tree(cls.process.pid)
+    def test_latency(self):
+        start = time.perf_counter()
+        response = requests.post(
+            self.base_url + "/generate",
+            json={"text": "Hello", "sampling_params": {"max_new_tokens": 128}},
+        )
+        elapsed = time.perf_counter() - start
+        self.assertEqual(response.status_code, 200)
+        self.assertLess(elapsed, 5.0, "Latency exceeded threshold")
+if __name__ == "__main__":
+    unittest.main(verbosity=3)
+```
+---
+## Server Fixture Reuse
+For tests that only need a standard server, inherit from `DefaultServerBase` and override class attributes:
+```python
+from sglang.test.server_fixtures.default_fixture import DefaultServerBase
+class TestMyFeature(DefaultServerBase):
+    model = DEFAULT_SMALL_MODEL_NAME_FOR_TEST
+    other_args = ["--enable-my-feature"]
+    def test_something(self):
+        ...
+```
+Available fixtures in `python/sglang/test/server_fixtures/`:
+| Fixture | Use case |
+|---------|----------|
+| `DefaultServerBase` | Standard single-server tests |
+| `EagleServerBase` | EAGLE speculative decoding |
+| `PDDisaggregationServerBase` | Disaggregated prefill/decode |
+| `MMMUServerBase` | Multimodal VLM tests |
+---
+## CI Registration
+Every test file in `test/registered/` **must** call a registration function at module level:
+```python
+from sglang.test.ci.ci_register import register_cuda_ci, register_amd_ci
+register_cuda_ci(est_time=60, suite="stage-b-test-small-1-gpu")
+register_amd_ci(est_time=60, suite="stage-b-test-small-1-gpu-amd")  # optional
+```
+Parameters:
+- `est_time`: estimated runtime in seconds (used for CI partitioning)
+- `suite`: which CI suite to run in (see below)
+- `nightly=True`: for nightly-only tests (default `False` = per-commit)
+- `disabled="reason"`: temporarily disable with explanation
+### Suite selection guide
+**Default cases (1 GPU):**
+| Scenario | Model | Suite |
+|----------|-------|-------|
+| Model-agnostic basic functionality | 1B (smallest) | `stage-b-test-small-1-gpu` |
+| General performance (no spec/DP/parallelism) | 8B | `stage-b-test-large-1-gpu` |
+**Bigger features (case by case):**
+| Scenario | Suite |
+|----------|-------|
+| 2 GPU (e.g. TP=2) | `stage-b-test-large-2-gpu` |
+| 4 GPU (H100) | `stage-c-test-4-gpu-h100` |
+| 8 GPU (H200) | `stage-c-test-8-gpu-h200` |
+| Nightly, 1 GPU | `nightly-1-gpu` |
+| Nightly, 8 GPU | `nightly-8-gpu` |
+For spec, DP attention, parallelism, disaggregation, etc., discuss with the team to determine the appropriate suite and GPU configuration.
+---
+## Model Constants
+All defined in `python/sglang/test/test_utils.py`:
+| Constant | Model | When to use |
+|----------|-------|-------------|
+| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST` | Llama-3.2-1B-Instruct | Model-agnostic basic functionality |
+| `DEFAULT_SMALL_MODEL_NAME_FOR_TEST_BASE` | Llama-3.2-1B | Base (non-instruct) model tests |
+| `DEFAULT_MODEL_NAME_FOR_TEST` | Llama-3.1-8B-Instruct | General performance (single node) |
+| `DEFAULT_MOE_MODEL_NAME_FOR_TEST` | Mixtral-8x7B-Instruct | MoE-specific tests |
+| `DEFAULT_SMALL_EMBEDDING_MODEL_NAME_FOR_TEST` | — | Embedding tests |
+| `DEFAULT_SMALL_VLM_MODEL_NAME_FOR_TEST` | — | Vision-language tests |
+---
+## Test Placement
+```
+test/
+├── registered/          # CI tests (auto-discovered by run_suite.py)
+│   ├── sampling/        # test_penalty.py, test_sampling_params.py ...
+│   ├── sessions/        # test_session_control.py ...
+│   ├── openai_server/   # basic/, features/, validation/ ...
+│   ├── spec/            # eagle/, utils/ ...
+│   ├── models/          # model-specific accuracy tests
+│   ├── perf/            # performance benchmarks
+│   └── <category>/      # create new category if needed
+├── manual/              # Non-CI: debugging, one-off, manual verification
+└── run_suite.py         # CI runner (scans registered/ only)
+```
+**Decision rule**: if the test should run in CI → `registered/`. If it's for local debugging or requires special hardware not in CI → `manual/`.
+---
+## Key Utilities
+```python
+from sglang.test.test_utils import (
+    CustomTestCase,              # base class with retry logic
+    popen_launch_server,         # launch server subprocess
+    DEFAULT_URL_FOR_TEST,        # auto-configured base URL
+    DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,  # 600s default
+    run_bench_serving,           # benchmark helper (launch + bench)
+)
+from sglang.srt.utils import kill_process_tree  # cleanup server
+```
+---
+## Checklist
+Before submitting a test:
+- [ ] Inherits from `CustomTestCase` (not `unittest.TestCase`)
+- [ ] Has `register_*_ci(...)` call at module level
+- [ ] Placed in `test/registered/<category>/`
+- [ ] Model selection: smallest for model-agnostic features, 8B for general perf, case-by-case for other complex features
+- [ ] `setUpClass` launches server, `tearDownClass` kills it
+- [ ] Has `if __name__ == "__main__": unittest.main(verbosity=3)`
+- [ ] `est_time` is reasonable (measure locally)

sglang/benchmark/json_jump_forward/README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+## Run benchmark
+### Dependencies
+```
+llama_cpp_python          0.2.38
+guidance                  0.1.10
+vllm                      0.2.7
+outlines                  0.0.25
+```
+### Build dataset
+When benchmarking long document information retrieval, run the following command to build the dataset:
+```bash
+pip install wikipedia
+python3 build_dataset.py
+```
+### Benchmark sglang
+Run Llama-7B
+```bash
+python3 -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
+```
+Benchmark Character Generation
+```bash
+python3 bench_sglang.py --mode character
+```
+Benchmark City Information Retrieval
+```bash
+python3 bench_sglang.py --mode city
+```
+### Benchmark Outlines + vLLM
+Run Llama-7B
+```bash
+python3 -m outlines.serve.serve --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf  --disable-log-requests --port 21000
+```
+Benchmark Character Generation
+```bash
+python3 bench_other.py --mode character --backend outlines
+```
+Benchmark City Information Retrieval
+```bash
+python3 bench_other.py --mode city --backend outlines
+```
+### Benchmark guidance
+Run Llama-7B and benchmark character generation
+```bash
+python3 bench_other.py --mode character --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
+```
+Run Llama-7B and benchmark city information retrieval
+```bash
+python3 bench_other.py --mode city --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
+```
+### Benchmark lmql
+Run Llama-7B and benchmark character generation
+```
+python3 bench_other.py --mode character --backend lmql --parallel 1
+```
+Run Llama-7B and benchmark city information retrieval
+```
+python3 bench_other.py --mode city --backend lmql --parallel 1
+```

sglang/benchmark/json_jump_forward/bench_other.py ADDED Viewed

	@@ -0,0 +1,288 @@

+import argparse
+import json
+import time
+from concurrent.futures import ThreadPoolExecutor
+from functools import partial
+import guidance
+from tqdm import tqdm
+from sglang.test.test_utils import add_common_other_args_and_parse, get_call_generate
+from sglang.utils import dump_state_text, read_jsonl
+# there are some FSM bugs with json regex converted from pydantic model
+# here use a string regex instead
+# regex_string = build_regex_from_object(HarryPoterRole)
+character_regex = (
+    r"""\{\n"""
+    + r"""    "name": "[\w\d\s]{1,16}",\n"""
+    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
+    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
+    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
+    + r"""    "wand": \{\n"""
+    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
+    + r"""        "core": "[\w\d\s]{1,16}",\n"""
+    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
+    + r"""    \},\n"""
+    + r"""    "alive": "(Alive|Deceased)",\n"""
+    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
+    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
+    + r"""\}"""
+)
+city_regex = (
+    r"""\{\n"""
+    + r"""  "name": "[\w\d\s]{1,16}",\n"""
+    + r"""  "country": "[\w\d\s]{1,16}",\n"""
+    + r"""  "latitude": [-+]?[0-9]*\.?[0-9]{0,2},\n"""
+    + r"""  "population": [-+]?[0-9]{1,9},\n"""
+    + r"""  "top 3 landmarks": \["[\w\d\s]{1,16}", "[\w\d\s]{1,16}", "[\w\d\s]{1,16}"\]\n"""
+    + r"""\}"""
+)
+# fmt: off
+def character_gen(name, generate):
+    s = name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
+    s += generate(s, max_tokens=256, regex=character_regex)
+    return s
+# fmt: on
+# fmt: off
+def city_gen(document, generate):
+    s = "Please extract the information of a city from the following wikipedia page.\n"
+    s += "Page begin.\n" + document + "Page end.\n"
+    s += "Here is the name, country, and symbol of the city in JSON format.\n"
+    s += generate(s, max_tokens=256, regex=city_regex)
+    return s
+# fmt: on
+@guidance
+def character_maker(lm, name):
+    regex_str_no_quote = r"[\w\d\s]+"
+    regex_float = r"[0-9]+\.[0-9]+"
+    lm += f"""\
+    {name} is a character in Harry Potter. Please fill in the following information about this character.
+    {{
+        "name": "{guidance.gen("name", max_tokens=16, regex=regex_str_no_quote)}",
+        "house": "{guidance.select(options=['Gryffindor', 'Slytherin', 'Ravenclaw', 'Hufflepuff'], name='house')}",
+        "blood status": "{guidance.select(options=['Pure-blood', 'Half-blood', 'Muggle-born'], name='blood status')}",
+        "occupation": "{guidance.select(options=['student', 'teacher', 'auror', 'ministry of magic', 'death eater', 'order of the phoenix'], name='occupation')}",
+        "wand": {{
+            "wood": "{guidance.gen("wood", max_tokens=16, regex=regex_str_no_quote)}",
+            "core": "{guidance.gen('core', max_tokens=16, regex=regex_str_no_quote)}",
+            "length": {guidance.gen('length', max_tokens=10, regex=regex_float)}
+        }},
+        "alive": "{guidance.select(options=['Alive', 'Deceased'], name='alive')}",
+        "patronus": "{guidance.gen('patronus', max_tokens=16, regex=regex_str_no_quote)}",
+        "bogart": "{guidance.gen('bogart', max_tokens=16, regex=regex_str_no_quote)}"
+    }}
+    """
+    return lm
+async def call_generate_lmql(
+    prompt, temperature, max_tokens, regex, max_len=4096, model=None, **kwargs
+):
+    assert model is not None
+    import lmql
+    @lmql.query(model=model)
+    async def program(question, max_tokens, regex):
+        '''lmql
+        """{question}[ANSWER]""" where len(TOKENS(ANSWER)) < max_tokens and REGEX(ANSWER, regex)
+        return ANSWER
+        '''
+    return await program(
+        question=prompt,
+        temperature=temperature,
+        max_tokens=max_tokens,
+        max_len=max_len,
+        regex=regex,
+        **kwargs,
+    )
+@guidance
+def city_maker(lm, document):
+    regex_str_no_quote = r"[\w\d\s]+"
+    regex_float = r"[0-9]+\.[0-9]+"
+    lm += f"""\
+    Please extract the information of a city from the following wikipedia page.
+    Page begin.
+    {document}
+    Page end.
+    Here is the name, country, and symbol of the city in JSON format.
+    {{
+        "name": "{guidance.gen("name", max_tokens=16, regex=regex_str_no_quote)}",
+        "country": "{guidance.gen("country", max_tokens=16, regex=regex_str_no_quote)}",
+        "latitude": {guidance.gen("latitude", max_tokens=10, regex=regex_float)},
+        "population": {guidance.gen("population", max_tokens=10, regex=r"[0-9]+")},
+        "top 3 landmarks": [
+            "{guidance.gen("landmark1", max_tokens=16, regex=regex_str_no_quote)}", "{guidance.gen("landmark2", max_tokens=16, regex=regex_str_no_quote)}", "{guidance.gen("landmark3", max_tokens=16, regex=regex_str_no_quote)}"
+        ]
+    }}
+    """
+    return lm
+def bench_character(args):
+    arguments = []
+    with open(args.data_path, "r") as f:
+        for line in f:
+            arguments.append({"name": line.strip()})
+    arguments = arguments[: args.num_jsons]
+    states = [None] * len(arguments)
+    # Select backend
+    if args.backend == "outlines":
+        call_generate = partial(get_call_generate(args), temperature=0)
+        def get_one_answer(i):
+            states[i] = character_gen(**arguments[i], generate=call_generate)
+    elif args.backend == "guidance":
+        model = guidance.models.LlamaCpp(
+            args.model_path,
+            n_gpu_layers=-1,
+            n_ctx=args.n_ctx,
+        )
+        def get_one_answer(i):
+            lm = model + character_maker(**arguments[i])
+            states[i] = lm
+    elif args.backend == "lmql":
+        import asyncio
+        import lmql
+        model = lmql.model(args.model_path, endpoint=f"{args.host}:{args.port}")
+        call_generate = partial(
+            call_generate_lmql,
+            model=model,
+            max_tokens=256,
+            regex=character_regex,
+        )
+        async def get_one_answer_async(i):
+            states[i] = await call_generate(prompt=arguments[i]["name"], temperature=0)
+    else:
+        raise ValueError(f"Invalid backend: {args.backend}")
+    tic = time.perf_counter()
+    if args.backend != "lmql":
+        if args.parallel == 1:
+            for i in tqdm(range(len(arguments))):
+                get_one_answer(i)
+        else:
+            with ThreadPoolExecutor(args.parallel) as executor:
+                rets = list(
+                    tqdm(
+                        executor.map(get_one_answer, list(range(len(arguments)))),
+                        total=len(arguments),
+                    )
+                )
+                for _ in rets:
+                    pass
+    else:
+        batches = []
+        for i in range(0, len(arguments), args.parallel):
+            batches.append(list(range(i, min(i + args.parallel, len(arguments)))))
+        loop = asyncio.get_event_loop()
+        for bt in tqdm(batches):
+            loop.run_until_complete(
+                asyncio.gather(*[get_one_answer_async(i) for i in bt])
+            )
+    latency = time.perf_counter() - tic
+    return states, latency
+def bench_city_doc(args):
+    arguments = []
+    for line in read_jsonl(args.data_path):
+        arguments.append({"document": line["document"]})
+    arguments = arguments[: args.num_jsons]
+    states = [None] * len(arguments)
+    # Select backend
+    if args.backend == "outlines":
+        call_generate = partial(get_call_generate(args), temperature=0)
+        def get_one_answer(i):
+            states[i] = city_gen(**arguments[i], generate=call_generate)
+    elif args.backend == "guidance":
+        model = guidance.models.LlamaCpp(
+            args.model_path,
+            n_gpu_layers=-1,
+            n_ctx=args.n_ctx,
+        )
+        def get_one_answer(i):
+            lm = model + city_maker(**arguments[i])
+            states[i] = lm
+    else:
+        raise ValueError(f"Invalid backend: {args.backend}")
+    tic = time.perf_counter()
+    if args.parallel == 1:
+        for i in tqdm(range(len(arguments))):
+            get_one_answer(i)
+    else:
+        with ThreadPoolExecutor(args.parallel) as executor:
+            rets = executor.map(get_one_answer, list(range(len(arguments))))
+            for _ in rets:
+                pass
+    latency = time.perf_counter() - tic
+    return states, latency
+def main(args):
+    if args.mode == "character":
+        args.data_path = "dataset.txt"
+        states, latency = bench_character(args)
+    elif args.mode == "city":
+        args.data_path = "questions.jsonl"
+        states, latency = bench_city_doc(args)
+    # Compute accuracy
+    print(f"Latency: {latency:.3f}")
+    # Write results
+    dump_state_text(f"tmp_output_{args.backend}_{args.mode}.txt", states)
+    with open(args.result_file, "a") as fout:
+        value = {
+            "task": "json_jump_forward",
+            "backend": args.backend,
+            "latency": round(latency, 3),
+            "num_jsons": args.num_jsons,
+            "mode": args.mode,
+            "parallel": args.parallel,
+        }
+        fout.write(json.dumps(value) + "\n")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-path", type=str)
+    parser.add_argument("--num-jsons", type=int, default=50)
+    parser.add_argument(
+        "--mode", type=str, default="character", choices=["character", "city"]
+    )
+    args = add_common_other_args_and_parse(parser)
+    main(args)

sglang/benchmark/json_jump_forward/bench_sglang.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import argparse
+import json
+import time
+import sglang as sgl
+from sglang.test.test_utils import (
+    add_common_sglang_args_and_parse,
+    select_sglang_backend,
+)
+from sglang.utils import dump_state_text, read_jsonl
+# there are some FSM bugs with json regex converted from pydantic model
+# here use a string regex instead
+# regex_string = build_regex_from_object(HarryPoterRole)
+character_regex = (
+    r"""\{\n"""
+    + r"""    "name": "[\w\d\s]{1,16}",\n"""
+    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
+    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
+    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
+    + r"""    "wand": \{\n"""
+    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
+    + r"""        "core": "[\w\d\s]{1,16}",\n"""
+    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
+    + r"""    \},\n"""
+    + r"""    "alive": "(Alive|Deceased)",\n"""
+    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
+    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
+    + r"""\}"""
+)
+city_regex = (
+    r"""\{\n"""
+    + r"""  "name": "[\w\d\s]{1,16}",\n"""
+    + r"""  "country": "[\w\d\s]{1,16}",\n"""
+    + r"""  "latitude": [-+]?[0-9]*\.?[0-9]{0,2},\n"""
+    + r"""  "population": [-+]?[0-9]{1,9},\n"""
+    + r"""  "top 3 landmarks": \["[\w\d\s]{1,16}", "[\w\d\s]{1,16}", "[\w\d\s]{1,16}"\]\n"""
+    + r"""\}"""
+)
+# fmt: off
+@sgl.function
+def character_gen(s, name):
+    s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
+    s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
+# fmt: on
+# fmt: off
+@sgl.function
+def city_gen(s, document):
+    s += "Please extract the information of a city from the following wikipedia page.\n"
+    s += "Page begin.\n" + document + "Page end.\n"
+    s += "Here is the name, country, and symbol of the city in JSON format.\n"
+    s += sgl.gen("json_output",max_tokens=256, regex=city_regex)
+# fmt: on
+def bench_city_doc(args):
+    arguments = []
+    for line in read_jsonl(args.data_path):
+        arguments.append({"document": line["document"]})
+    arguments = arguments[: args.num_jsons]
+    # Select backend
+    backend = select_sglang_backend(args)
+    sgl.set_default_backend(backend)
+    # Run requests
+    tic = time.perf_counter()
+    states = city_gen.run_batch(
+        arguments,
+        temperature=0,
+        num_threads=args.parallel,
+        progress_bar=True,
+    )
+    latency = time.perf_counter() - tic
+    return states, latency
+def bench_character(args):
+    arguments = []
+    with open(args.data_path, "r") as f:
+        for line in f:
+            arguments.append({"name": line.strip()})
+    arguments = arguments[: args.num_jsons]
+    # Select backend
+    backend = select_sglang_backend(args)
+    sgl.set_default_backend(backend)
+    # Run requests
+    tic = time.perf_counter()
+    states = character_gen.run_batch(
+        arguments,
+        temperature=0,
+        num_threads=args.parallel,
+        progress_bar=True,
+    )
+    latency = time.perf_counter() - tic
+    return states, latency
+def main(args):
+    if args.mode == "character":
+        args.data_path = "dataset.txt"
+        states, latency = bench_character(args)
+    elif args.mode == "city":
+        args.data_path = "questions.jsonl"
+        states, latency = bench_city_doc(args)
+    # Compute accuracy
+    print(f"Latency: {latency:.3f}")
+    # Write results
+    dump_state_text(f"tmp_output_{args.backend}_{args.mode}.txt", states)
+    with open(f"{args.backend}_{args.mode}.json", "w") as fout:
+        for state in states:
+            fout.write(state["json_output"] + "\n")
+    with open(args.result_file, "a") as fout:
+        value = {
+            "task": "json_jump_forward",
+            "backend": args.backend,
+            "latency": round(latency, 3),
+            "num_jsons": args.num_jsons,
+            "mode": args.mode,
+            "parallel": args.parallel,
+        }
+        fout.write(json.dumps(value) + "\n")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-path", type=str)
+    parser.add_argument("--num-jsons", type=int, default=50)
+    parser.add_argument(
+        "--mode", type=str, default="character", choices=["character", "city"]
+    )
+    args = add_common_sglang_args_and_parse(parser)
+    main(args)

sglang/benchmark/json_jump_forward/build_dataset.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import json
+import transformers
+import wikipedia
+model_path = "meta-llama/Llama-2-7b-chat-hf"
+t = transformers.AutoTokenizer.from_pretrained(model_path)
+city_names = [
+    "los angles",
+    "london",
+    "tokyo",
+    "beijing",
+    "singapore",
+    "paris",
+    "dubai",
+    "sydney",
+    "moscow",
+    "rome",
+    "toronto",
+    "rio de janeiro",
+    "istanbul",
+    "berlin",
+    "auckland",
+    "buenos aires",
+    "mexico city",
+    "mumbai",
+    "seoul",
+    "bangkok",
+    "cairo",
+    "athens",
+    "jerusalem",
+]
+def get_content(city_name):
+    content = str(wikipedia.page(city_name).content)
+    content = content.replace("\n\n", "\n")
+    tokens = t.encode(content)
+    expected_tokens = 3000
+    truncate_len = int((expected_tokens / len(tokens)) * len(content))
+    truncate_content = content[:truncate_len]
+    truncate_tokens = t.encode(truncate_content)
+    # Count token
+    print(
+        f"city_name: {city_name}, #tokens: {len(tokens)}, #truncate tokens: {len(truncate_tokens)}"
+    )
+    return truncate_content
+if __name__ == "__main__":
+    with open("questions.jsonl", "w") as fout:
+        for city_name in city_names:
+            truncate_content = get_content(city_name)
+            fout.write(json.dumps({"document": truncate_content}) + "\n")

sglang/benchmark/json_jump_forward/dataset.txt ADDED Viewed

	@@ -0,0 +1,50 @@

+Harry Potter
+Hermione Granger
+Ron Weasley
+Albus Dumbledore
+Severus Snape
+Rubeus Hagrid
+Draco Malfoy
+Ginny Weasley
+Fred Weasley
+George Weasley
+Percy Weasley
+Sirius Black
+Remus Lupin
+Neville Longbottom
+Luna Lovegood
+Cedric Diggory
+Cho Chang
+Lord Voldemort
+Minerva McGonagall
+Filius Flitwick
+Dolores Umbridge
+Bellatrix Lestrange
+Lucius Malfoy
+Molly Weasley
+Arthur Weasley
+Nymphadora Tonks
+Dobby
+Moaning Myrtle
+Peter Pettigrew
+Alastor 'Mad-Eye' Moody
+Horace Slughorn
+Vernon Dursley
+Petunia Dursley
+Dudley Dursley
+Argus Filch
+Sybill Trelawney
+Gilderoy Lockhart
+Fleur Delacour
+Viktor Krum
+Bill Weasley
+Oliver Wood
+Cornelius Fudge
+Barty Crouch Sr.
+Barty Crouch Jr.
+Kingsley Shacklebolt
+Quirinus Quirrell
+Nearly Headless Nick
+Aunt Marge
+Griphook
+Ludo Bagman

sglang/benchmark/multi_turn_chat/bench_other.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import json
+import time
+from argparse import ArgumentParser
+from concurrent.futures import ThreadPoolExecutor
+from functools import partial
+from data_gen import gen_arguments
+from tqdm import tqdm
+from vllm.transformers_utils.tokenizer import get_tokenizer
+from sglang.test.test_utils import add_common_other_args_and_parse, get_call_generate
+from sglang.utils import dump_state_text
+def multi_turns(generate, qas):
+    s = ""
+    for qa in qas:
+        s += qa["prompt"]
+        s += generate(s, max_tokens=qa["new_tokens"])
+    return s
+def main(args):
+    print(args)
+    tokenizer = get_tokenizer(args.tokenizer, trust_remote_code=args.trust_remote_code)
+    multi_qas = gen_arguments(args, tokenizer)
+    states = [None] * args.num_qa
+    call_generate = partial(get_call_generate(args), temperature=0)
+    def get_one_answer(i):
+        states[i] = multi_turns(generate=call_generate, **multi_qas[i])
+    tic = time.perf_counter()
+    if args.parallel == 1:
+        for i in tqdm(range(len(multi_qas))):
+            get_one_answer(i)
+    else:
+        with ThreadPoolExecutor(args.parallel) as executor:
+            rets = list(
+                tqdm(
+                    executor.map(get_one_answer, list(range(len(multi_qas)))),
+                    total=len(multi_qas),
+                )
+            )
+            for _ in rets:
+                pass
+    latency = time.perf_counter() - tic
+    # Compute accuracy
+    print(f"Latency: {latency:.3f}")
+    dump_state_text(f"tmp_output_{args.backend}.txt", states)
+    with open(args.result_file, "a") as fout:
+        value = {
+            "task": "multi_turn_chat",
+            "backend": args.backend,
+            "num_gpus": 1,
+            "latency": round(latency, 3),
+            "num_requests": args.num_qa,
+            "num_turns": args.turns,
+            "other": {
+                "parallel": args.parallel,
+                "output_mode": "long" if args.long else "short",
+            },
+        }
+        fout.write(json.dumps(value) + "\n")
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument("--turns", type=int, default=4)
+    parser.add_argument("--num-qa", type=int, default=20)
+    parser.add_argument("--min-len-q", type=int, default=256)
+    parser.add_argument("--max-len-q", type=int, default=512)
+    parser.add_argument("--min-len-a", type=int, default=4)
+    parser.add_argument("--max-len-a", type=int, default=8)
+    parser.add_argument("--tokenizer", type=str, required=True)
+    parser.add_argument("--trust-remote-code", action="store_true")
+    parser.add_argument("--long", action="store_true")
+    args = add_common_other_args_and_parse(parser)
+    if args.long:
+        args.min_len_a = 256
+        args.max_len_a = 512
+        args.num_qa = 20
+    main(args)

sglang/benchmark/multi_turn_chat/data_gen.py ADDED Viewed

	@@ -0,0 +1,29 @@

+import random
+import string
+random.seed(42)
+def gen_prompt(tokenizer, token_num):
+    cha_set = string.ascii_letters + string.digits
+    ret = "".join(random.choices(cha_set, k=token_num))
+    while len(tokenizer(ret).input_ids) < token_num:
+        ret += random.choice(cha_set)
+    return ret
+def gen_arguments(args, tokenizer):
+    multi_qas = [{"qas": []} for _ in range(args.num_qa)]
+    for i in range(args.num_qa):
+        qas = multi_qas[i]["qas"]
+        for _ in range(args.turns):
+            prompt_len = random.randint(args.min_len_q, args.max_len_q)
+            new_tokens = random.randint(args.min_len_a, args.max_len_a)
+            qas.append(
+                {
+                    "prompt": gen_prompt(tokenizer, prompt_len),
+                    "new_tokens": new_tokens,
+                }
+            )
+    return multi_qas

sglang/benchmark/tree_of_thought_deep/README.md ADDED Viewed

	@@ -0,0 +1,51 @@

+## Download data
+```
+wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
+```
+## Run benchmark
+NOTE: This is an implementation for throughput/latency benchmark purposes. The prompts are not tuned to achieve good accuracy on the GSM-8K tasks.
+### Benchmark sglang
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
+```
+```
+python3 bench_sglang.py --num-questions 32
+python3 bench_sglang.py --num-questions 16 --parallel 1
+```
+### Benchmark vllm
+```
+python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
+```
+```
+python3 bench_other.py --num-questions 32 --backend vllm
+```
+### Benchmark lightllm
+```
+# A10G
+python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
+```
+```
+python3 bench_other.py --num-questions 32 --backend lightllm
+```
+### Benchmark guidance
+```
+python3 bench_other.py --num-questions 8 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
+```
+### Benchmark lmql
+```
+python3 bench_other.py --num-questions 8 --backend lmql --parallel 1
+```

sglang/benchmark/tree_of_thought_deep/bench_other.py ADDED Viewed

	@@ -0,0 +1,222 @@

+import argparse
+import ast
+import json
+import re
+import time
+from collections import Counter
+from concurrent.futures import ThreadPoolExecutor
+import numpy as np
+from tqdm import tqdm
+from sglang.test.test_utils import add_common_other_args_and_parse, get_call_generate
+from sglang.utils import dump_state_text, read_jsonl
+INVALID = -9999999
+def get_answer_value(answer_str):
+    answer_str = answer_str.replace(",", "")
+    numbers = re.findall(r"\d+", answer_str)
+    if len(numbers) < 1:
+        return INVALID
+    try:
+        return ast.literal_eval(numbers[-1])
+    except SyntaxError:
+        return INVALID
+def most_frequent_number(numbers):
+    if not numbers:
+        return None
+    frequency = Counter(numbers)
+    most_frequent = max(frequency, key=frequency.get)
+    return most_frequent
+USER_PREFIX = "[INST] "
+USER_SUFFIX = " [/INST]"
+ASSISTANT_PREFIX = ""
+ASSISTANT_SUFFIX = " </s><s>"
+# Use a low temp to make the results more deterministic and the comparison more fair.
+temp = 0.001
+def propose_plan(s, question, num_branches, call_generate):
+    s += (
+        USER_PREFIX
+        + """Please generate a high-level plan for solving the following question. As the first step, just say what method and idea you will use to solve the question. You can reorganize the information in the question. Do not do the actual calculation. Keep your response concise and within 80 words. Question: """
+        + question
+        + USER_SUFFIX
+    )
+    s += ASSISTANT_PREFIX
+    comps = call_generate(
+        s, max_tokens=256, temperature=temp, stop=None, n=num_branches
+    )
+    return [s + comp + ASSISTANT_SUFFIX for comp in comps]
+def execute_plan(s, num_branches, call_generate):
+    s += (
+        USER_PREFIX
+        + """The plan looks good! Now, use real numbers and do the calculation. Please solve the question step-by-step according to the high-level plan. Give me the final answer. Make your response short."""
+        + USER_SUFFIX
+    )
+    s += ASSISTANT_PREFIX
+    comps = call_generate(
+        s, max_tokens=256, temperature=temp, stop=None, n=num_branches
+    )
+    return [s + comp + ASSISTANT_SUFFIX for comp in comps]
+def reflect_solution(s, num_branches, call_generate):
+    s += (
+        USER_PREFIX
+        + """Okay. Now, evaluate your own solution and give it a score on a scale of 1 to 5. Please do rigorous check of the correctness."""
+        + USER_SUFFIX
+    )
+    s += ASSISTANT_PREFIX
+    comps = call_generate(
+        s, max_tokens=256, temperature=temp, stop=None, n=num_branches
+    )
+    return [s + comp + ASSISTANT_SUFFIX for comp in comps]
+def get_final_answer(s, num_branches, call_generate):
+    s += (
+        USER_PREFIX
+        + """Based on your reflection, do you change your mind? Now, give me the final answer after careful consideration."""
+        + USER_SUFFIX
+    )
+    s += ASSISTANT_PREFIX
+    comps = call_generate(
+        s, max_tokens=256, temperature=temp, stop=None, n=num_branches
+    )
+    return [s + comp + ASSISTANT_SUFFIX for comp in comps]
+def tree_search(question, num_branches, call_generate):
+    plan_forks = propose_plan("", question, num_branches, call_generate)
+    sol_states = []
+    for plan in plan_forks:
+        forks = execute_plan(plan, num_branches, call_generate)
+        sol_states.extend(forks)
+    ref_states = []
+    for sol in sol_states:
+        forks = reflect_solution(sol, num_branches, call_generate)
+        ref_states.extend(forks)
+    solutions = []
+    for sol in ref_states:
+        ans = get_final_answer(sol, num_branches, call_generate)
+        solutions.append(ans)
+    return solutions
+def main(args):
+    lines = read_jsonl(args.data_path)
+    # Construct prompts
+    num_branches = 2
+    questions = []
+    labels = []
+    for i in range(len(lines[: args.num_questions])):
+        questions.append(lines[i]["question"])
+        labels.append(get_answer_value(lines[i]["answer"]))
+    assert all(l != INVALID for l in labels)
+    arguments = [{"question": q, "num_branches": num_branches} for q in questions]
+    # Select backend
+    call_generate = get_call_generate(args)
+    # Run requests
+    states = [None] * len(questions)
+    tic = time.perf_counter()
+    if args.backend != "lmql":
+        def get_one_answer(i):
+            states[i] = tree_search(**arguments[i], call_generate=call_generate)
+        if args.parallel == 1:
+            for i in tqdm(range(len(questions))):
+                get_one_answer(i)
+        else:
+            with ThreadPoolExecutor(args.parallel) as executor:
+                list(
+                    tqdm(
+                        executor.map(get_one_answer, list(range(len(questions)))),
+                        total=len(questions),
+                    )
+                )
+    else:
+        import asyncio
+        from lmql_funcs import tree_search_async
+        async def get_one_answer_async(i):
+            states[i] = await tree_search_async(
+                **arguments[i], call_generate=call_generate
+            )
+        batches = [
+            [] for _ in range((len(questions) + args.parallel - 1) // args.parallel)
+        ]
+        for i in range(len(questions)):
+            batches[i // args.parallel].append(i)
+        loop = asyncio.get_event_loop()
+        for bt in tqdm(batches):
+            tasks = [get_one_answer_async(k) for k in bt]
+            loop.run_until_complete(asyncio.gather(*tasks))
+    latency = time.perf_counter() - tic
+    answers_text = []
+    for s in states:
+        answers_text.append([x for xs in s for x in xs])
+    preds = []
+    for i in range(len(states)):
+        answers = [get_answer_value(v) for v in answers_text[i]]
+        preds.append(most_frequent_number(answers))
+    # Compute accuracy
+    acc = np.mean(np.array(preds) == np.array(labels))
+    invalid = np.mean(np.array(preds) == INVALID)
+    print(f"Latency: {latency:.3f}")
+    print(f"Invalid: {invalid:.3f}")
+    print(f"Accuracy: {acc:.3f}")
+    # Write results
+    dump_state_text(f"tmp_output_{args.backend}.txt", answers_text)
+    with open(args.result_file, "a") as fout:
+        value = {
+            "task": "tree_of_thought_gsm8k",
+            "backend": args.backend,
+            "num_gpus": 1,
+            "latency": round(latency, 3),
+            "accuracy": round(acc, 3),
+            "num_requests": args.num_questions,
+            "other": {
+                "num_questions": args.num_questions,
+                "parallel": args.parallel,
+            },
+        }
+        fout.write(json.dumps(value) + "\n")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-path", type=str, default="test.jsonl")
+    parser.add_argument("--num-questions", type=int, default=200)
+    args = add_common_other_args_and_parse(parser)
+    main(args)

sglang/benchmark/tree_of_thought_deep/bench_sglang.py ADDED Viewed

	@@ -0,0 +1,171 @@

+import argparse
+import ast
+import json
+import re
+import time
+from collections import Counter
+import numpy as np
+import sglang as sgl
+from sglang.test.test_utils import (
+    add_common_sglang_args_and_parse,
+    select_sglang_backend,
+)
+from sglang.utils import dump_state_text, read_jsonl
+INVALID = -9999999
+def get_answer_value(answer_str):
+    answer_str = answer_str.replace(",", "")
+    numbers = re.findall(r"\d+", answer_str)
+    if len(numbers) < 1:
+        return INVALID
+    try:
+        return ast.literal_eval(numbers[-1])
+    except SyntaxError:
+        return INVALID
+def most_frequent_number(numbers):
+    if not numbers:
+        return None
+    frequency = Counter(numbers)
+    most_frequent = max(frequency, key=frequency.get)
+    return most_frequent
+# Use a low temp to make the results more deterministic and the comparison more fair.
+temp = 0.001
+def propose_plan(s, question, num_branches):
+    s += sgl.user(
+        """Please generate a high-level plan for solving the following question. As the first step, just say what method and idea you will use to solve the question. You can reorganize the information in the question. Do not do the actual calculation. Keep your response concise and within 80 words. Question: """
+        + question
+    )
+    forks = s.fork(num_branches)
+    forks += sgl.assistant(sgl.gen("plan", max_tokens=256, temperature=temp))
+    return forks
+def execute_plan(s, num_branches):
+    s += sgl.user(
+        """The plan looks good! Now, use real numbers and do the calculation. Please solve the question step-by-step according to the high-level plan. Give me the final answer. Make your response short."""
+    )
+    forks = s.fork(num_branches)
+    forks += sgl.assistant(sgl.gen("answer", max_tokens=256, temperature=temp))
+    return forks
+def reflect_solution(s, num_branches):
+    s += sgl.user(
+        """Okay. Now, evaluate your own solution and give it a score on a scale of 1 to 5. Please do rigorous check of the correctness."""
+    )
+    forks = s.fork(num_branches)
+    forks += sgl.assistant(sgl.gen("score", max_tokens=256, temperature=temp))
+    return forks
+def get_final_answer(s, num_branches):
+    s += sgl.user(
+        """Based on your reflection, do you change your mind? Now, give me the final answer after careful consideration."""
+    )
+    forks = s.fork(num_branches)
+    forks += sgl.assistant(sgl.gen("final_answer", max_tokens=256, temperature=temp))
+    return forks
+@sgl.function
+def tree_search(s, question, num_branches):
+    plan_forks = propose_plan(s, question, num_branches)
+    sol_states = []
+    for plan in plan_forks:
+        forks = execute_plan(plan, num_branches)
+        sol_states.extend(forks)
+    ref_states = []
+    for sol in sol_states:
+        forks = reflect_solution(sol, num_branches)
+        ref_states.extend(forks)
+    solutions = []
+    for sol in ref_states:
+        forks = get_final_answer(sol, num_branches)
+        solutions.append(forks)
+    solutions = [[s.text() for s in forks] for forks in solutions]
+    return solutions
+def main(args):
+    lines = read_jsonl(args.data_path)
+    lines = list(lines)
+    # Construct prompts
+    num_branches = 2
+    questions = []
+    labels = []
+    for i in range(len(lines[: args.num_questions])):
+        questions.append(lines[i]["question"])
+        labels.append(get_answer_value(lines[i]["answer"]))
+    assert all(l != INVALID for l in labels)
+    arguments = [{"question": q, "num_branches": num_branches} for q in questions]
+    # Select backend
+    backend = select_sglang_backend(args)
+    # Run requests
+    tic = time.perf_counter()
+    states = tree_search.run_batch(
+        arguments,
+        temperature=0,
+        backend=backend,
+        num_threads=args.parallel,
+        progress_bar=True,
+    )
+    latency = time.perf_counter() - tic
+    answers_text = []
+    for s in states:
+        answers_text.append([x for xs in s.ret_value for x in xs])
+    preds = []
+    for i in range(len(states)):
+        answers = [get_answer_value(v) for v in answers_text[i]]
+        preds.append(most_frequent_number(answers))
+    # Compute accuracy
+    acc = np.mean(np.array(preds) == np.array(labels))
+    invalid = np.mean(np.array(preds) == INVALID)
+    print(f"Latency: {latency:.3f}")
+    print(f"Invalid: {invalid:.3f}")
+    print(f"Accuracy: {acc:.3f}")
+    # Write results
+    dump_state_text(f"tmp_output_{args.backend}.txt", answers_text)
+    with open(args.result_file, "a") as fout:
+        value = {
+            "task": "tree_of_thought_gsm8k",
+            "backend": args.backend,
+            "num_gpus": 1,
+            "latency": round(latency, 3),
+            "accuracy": round(acc, 3),
+            "num_requests": args.num_questions,
+            "other": {
+                "num_questions": args.num_questions,
+                "parallel": args.parallel,
+            },
+        }
+        fout.write(json.dumps(value) + "\n")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-path", type=str, default="test.jsonl")
+    parser.add_argument("--num-questions", type=int, default=200)
+    args = add_common_sglang_args_and_parse(parser)
+    main(args)

sglang/docker/configs/.zshrc ADDED Viewed

	@@ -0,0 +1,27 @@

+export ZSH="/root/.oh-my-zsh"
+# Theme
+ZSH_THEME="robbyrussell"
+# Plugins
+plugins=(
+    git
+    z
+    zsh-autosuggestions
+    zsh-syntax-highlighting
+)
+source $ZSH/oh-my-zsh.sh
+# Aliases
+alias ll='ls -alF'
+alias la='ls -A'
+alias l='ls -CF'
+alias vi='vim'
+# Enhanced history
+HISTSIZE=10000
+SAVEHIST=10000
+setopt HIST_IGNORE_ALL_DUPS
+setopt HIST_FIND_NO_DUPS
+setopt INC_APPEND_HISTORY

sglang/docker/configs/opt/.gitconfig ADDED Viewed

	@@ -0,0 +1,30 @@

+[core]
+	editor = vim
+	whitespace = fix,-indent-with-non-tab,trailing-space,cr-at-eol
+	pager = diff-so-fancy | less --tabs=4 -RFX
+[color]
+	ui = true
+[color "diff-highlight"]
+	oldNormal = red bold
+	oldHighlight = red bold 52
+	newNormal = green bold
+	newHighlight = green bold 22
+[color "diff"]
+	meta = 11
+	frag = magenta bold
+	commit = yellow bold
+	old = red bold
+	new = green bold
+	whitespace = red reverse
+[alias]
+	lg = log --color --graph --pretty=format:'%Cred%h%Creset - %s %Cgreen(%cr) %C(bold blue)<%an>%Creset%C(auto)%d%Creset' --abbrev-commit --
+[http]
+	sslVerify = false
+[pull]
+	rebase = true

sglang/docker/configs/opt/.tmux.conf ADDED Viewed

	@@ -0,0 +1,27 @@

+# Pane border styling
+set -g pane-border-style fg='#742727',bg=black
+set -g pane-active-border-style fg=red,bg=black
+# Status bar styling
+set -g status-style bg='#0C8A92',fg=black
+# Change prefix key to backtick
+set-option -g prefix `
+unbind C-b
+bind-key ` send-prefix
+# Split panes using - and = with current path
+unbind '"'
+bind - splitw -v -c '#{pane_current_path}'
+unbind '%'
+bind = splitw -h -c '#{pane_current_path}'
+# Vi mode settings
+bind-key -T copy-mode-vi Y send-keys -X copy-pipe 'yank > #{pane_tty}'
+set-window-option -g mode-keys vi
+# Other settings
+set-option -g escape-time 0
+set-option -g base-index 1
+set-window-option -g mouse on
+set -g history-limit 100000

sglang/docker/configs/opt/.vimrc ADDED Viewed

	@@ -0,0 +1,45 @@

+function! Yank(text) abort
+  let escape = system('yank', a:text)
+  if v:shell_error
+    echoerr escape
+  else
+    call writefile([escape], '/dev/tty', 'b')
+  endif
+endfunction
+noremap <silent> <Leader>y y:<C-U>call Yank(@0)<CR>
+" automatically run yank(1) whenever yanking in Vim
+function! CopyYank() abort
+  call Yank(join(v:event.regcontents, "\n"))
+endfunction
+autocmd TextYankPost * call CopyYank()
+" Basic settings
+set number
+syntax on
+set mouse=a
+filetype indent on
+" Indentation
+set autoindent nosmartindent
+set smarttab
+set expandtab
+set shiftwidth=4
+set softtabstop=4
+" Visual guides
+set colorcolumn=120
+highlight ColorColumn ctermbg=5
+" Status line
+set laststatus=2
+set statusline=%<%f\ %h%m%r%=%{\"[\".(&fenc==\"\"?&enc:&fenc).((exists(\"+bomb\")\ &&\ &bomb)?\",B\":\"\").\"]\ \"}%k\ %-14.(%l,%c%V%)\ %P
+" Backspace behavior
+set backspace=2
+" Encoding
+set encoding=utf-8
+set fileencoding=utf-8

sglang/docker/configs/yank ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+put() {
+  esc=$1
+  test -n "$TMUX" -o -z "${TERM##screen*}" && esc="\033Ptmux;\033$esc\033\\"
+  printf "$esc"
+}
+put "\033]52;c;!\a"
+buf=$( cat "$@" )
+len=$( printf %s "$buf" | wc -c ) max=74994
+test $len -gt $max && echo "$0: input is $(( len - max )) bytes too long" >&2
+put "\033]52;c;$( printf %s "$buf" | head -c $max | base64 | tr -d '\r\n' )\a"
+test -n "$TMUX" && tmux set-buffer "$buf" ||:

sglang/python/sglang.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,120 @@

+Metadata-Version: 2.4
+Name: sglang
+Version: 0.5.9
+Summary: SGLang is a fast serving framework for large language models and vision language models.
+Project-URL: Homepage, https://github.com/sgl-project/sglang
+Project-URL: Bug Tracker, https://github.com/sgl-project/sglang/issues
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: IPython
+Requires-Dist: aiohttp
+Requires-Dist: apache-tvm-ffi<0.2,>=0.1.5
+Requires-Dist: anthropic>=0.20.0
+Requires-Dist: blobfile==3.0.0
+Requires-Dist: build
+Requires-Dist: compressed-tensors
+Requires-Dist: cuda-python==12.9
+Requires-Dist: decord2
+Requires-Dist: datasets
+Requires-Dist: einops
+Requires-Dist: fastapi
+Requires-Dist: flashinfer_python==0.6.4
+Requires-Dist: flashinfer_cubin==0.6.4
+Requires-Dist: gguf
+Requires-Dist: hf_transfer
+Requires-Dist: huggingface_hub
+Requires-Dist: interegular
+Requires-Dist: llguidance<0.8.0,>=0.7.11
+Requires-Dist: modelscope
+Requires-Dist: msgspec
+Requires-Dist: ninja
+Requires-Dist: numpy
+Requires-Dist: nvidia-cutlass-dsl>=4.3.4
+Requires-Dist: nvidia-ml-py
+Requires-Dist: openai-harmony==0.0.4
+Requires-Dist: openai==2.6.1
+Requires-Dist: orjson
+Requires-Dist: outlines==0.1.11
+Requires-Dist: packaging
+Requires-Dist: partial_json_parser
+Requires-Dist: pillow
+Requires-Dist: prometheus-client>=0.20.0
+Requires-Dist: psutil
+Requires-Dist: py-spy
+Requires-Dist: pybase64
+Requires-Dist: pydantic
+Requires-Dist: python-multipart
+Requires-Dist: pyzmq>=25.1.2
+Requires-Dist: quack-kernels==0.2.4
+Requires-Dist: requests
+Requires-Dist: scipy
+Requires-Dist: sentencepiece
+Requires-Dist: setproctitle
+Requires-Dist: sgl-fa4==4.0.3
+Requires-Dist: sgl-kernel==0.3.21
+Requires-Dist: soundfile==0.13.1
+Requires-Dist: tiktoken
+Requires-Dist: timm==1.0.16
+Requires-Dist: torch_memory_saver==0.0.9
+Requires-Dist: torch==2.9.1
+Requires-Dist: torchao==0.9.0
+Requires-Dist: torchaudio==2.9.1
+Requires-Dist: torchcodec==0.8.0; sys_platform != "linux" or (sys_platform == "linux" and platform_machine != "aarch64" and platform_machine != "arm64" and platform_machine != "armv7l")
+Requires-Dist: torchvision
+Requires-Dist: tqdm
+Requires-Dist: transformers==4.57.1
+Requires-Dist: uvicorn
+Requires-Dist: uvloop
+Requires-Dist: watchfiles
+Requires-Dist: xgrammar==0.1.27
+Requires-Dist: smg-grpc-proto>=0.4.1
+Requires-Dist: grpcio>=1.78.0
+Requires-Dist: grpcio-reflection>=1.78.0
+Requires-Dist: grpcio-health-checking>=1.78.0
+Provides-Extra: checkpoint-engine
+Requires-Dist: checkpoint-engine==0.1.2; extra == "checkpoint-engine"
+Provides-Extra: diffusion
+Requires-Dist: PyYAML==6.0.1; extra == "diffusion"
+Requires-Dist: cloudpickle==3.1.2; extra == "diffusion"
+Requires-Dist: diffusers==0.36.0; extra == "diffusion"
+Requires-Dist: imageio==2.36.0; extra == "diffusion"
+Requires-Dist: imageio-ffmpeg==0.5.1; extra == "diffusion"
+Requires-Dist: moviepy>=2.0.0; extra == "diffusion"
+Requires-Dist: opencv-python-headless==4.10.0.84; extra == "diffusion"
+Requires-Dist: remote-pdb==2.1.0; extra == "diffusion"
+Requires-Dist: st_attn==0.0.7; (platform_machine != "aarch64" and platform_machine != "arm64") and extra == "diffusion"
+Requires-Dist: vsa==0.0.4; (platform_machine != "aarch64" and platform_machine != "arm64") and extra == "diffusion"
+Requires-Dist: runai_model_streamer>=0.15.5; extra == "diffusion"
+Requires-Dist: cache-dit==1.2.3; extra == "diffusion"
+Requires-Dist: addict==2.4.0; extra == "diffusion"
+Requires-Dist: av==16.1.0; extra == "diffusion"
+Requires-Dist: scikit-image==0.25.2; extra == "diffusion"
+Requires-Dist: trimesh>=4.0.0; extra == "diffusion"
+Requires-Dist: xatlas; extra == "diffusion"
+Provides-Extra: ray
+Requires-Dist: ray[default]>=2.54.0; extra == "ray"
+Provides-Extra: tracing
+Requires-Dist: opentelemetry-api; extra == "tracing"
+Requires-Dist: opentelemetry-exporter-otlp; extra == "tracing"
+Requires-Dist: opentelemetry-exporter-otlp-proto-grpc; extra == "tracing"
+Requires-Dist: opentelemetry-sdk; extra == "tracing"
+Provides-Extra: test
+Requires-Dist: accelerate; extra == "test"
+Requires-Dist: bitsandbytes; extra == "test"
+Requires-Dist: expecttest; extra == "test"
+Requires-Dist: jsonlines; extra == "test"
+Requires-Dist: lm-eval[api]>=0.4.9.2; extra == "test"
+Requires-Dist: matplotlib; extra == "test"
+Requires-Dist: pandas; extra == "test"
+Requires-Dist: parameterized; extra == "test"
+Requires-Dist: peft; extra == "test"
+Requires-Dist: pytest; extra == "test"
+Requires-Dist: sentence_transformers; extra == "test"
+Requires-Dist: tabulate; extra == "test"
+Provides-Extra: dev
+Requires-Dist: sglang[test]; extra == "dev"
+Provides-Extra: all
+Requires-Dist: sglang[diffusion]; extra == "all"
+Requires-Dist: sglang[tracing]; extra == "all"

sglang/python/sglang.egg-info/SOURCES.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

sglang/python/sglang.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

sglang/python/sglang.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ sglang = sglang.cli.main:main

sglang/python/sglang.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,121 @@

+IPython
+aiohttp
+apache-tvm-ffi<0.2,>=0.1.5
+anthropic>=0.20.0
+blobfile==3.0.0
+build
+compressed-tensors
+cuda-python==12.9
+decord2
+datasets
+einops
+fastapi
+flashinfer_python==0.6.4
+flashinfer_cubin==0.6.4
+gguf
+hf_transfer
+huggingface_hub
+interegular
+llguidance<0.8.0,>=0.7.11
+modelscope
+msgspec
+ninja
+numpy
+nvidia-cutlass-dsl>=4.3.4
+nvidia-ml-py
+openai-harmony==0.0.4
+openai==2.6.1
+orjson
+outlines==0.1.11
+packaging
+partial_json_parser
+pillow
+prometheus-client>=0.20.0
+psutil
+py-spy
+pybase64
+pydantic
+python-multipart
+pyzmq>=25.1.2
+quack-kernels==0.2.4
+requests
+scipy
+sentencepiece
+setproctitle
+sgl-fa4==4.0.3
+sgl-kernel==0.3.21
+soundfile==0.13.1
+tiktoken
+timm==1.0.16
+torch_memory_saver==0.0.9
+torch==2.9.1
+torchao==0.9.0
+torchaudio==2.9.1
+torchvision
+tqdm
+transformers==4.57.1
+uvicorn
+uvloop
+watchfiles
+xgrammar==0.1.27
+smg-grpc-proto>=0.4.1
+grpcio>=1.78.0
+grpcio-reflection>=1.78.0
+grpcio-health-checking>=1.78.0
+[:sys_platform != "linux" or (sys_platform == "linux" and platform_machine != "aarch64" and platform_machine != "arm64" and platform_machine != "armv7l")]
+torchcodec==0.8.0
+[all]
+sglang[diffusion]
+sglang[tracing]
+[checkpoint-engine]
+checkpoint-engine==0.1.2
+[dev]
+sglang[test]
+[diffusion]
+PyYAML==6.0.1
+cloudpickle==3.1.2
+diffusers==0.36.0
+imageio==2.36.0
+imageio-ffmpeg==0.5.1
+moviepy>=2.0.0
+opencv-python-headless==4.10.0.84
+remote-pdb==2.1.0
+runai_model_streamer>=0.15.5
+cache-dit==1.2.3
+addict==2.4.0
+av==16.1.0
+scikit-image==0.25.2
+trimesh>=4.0.0
+xatlas
+[diffusion:platform_machine != "aarch64" and platform_machine != "arm64"]
+st_attn==0.0.7
+vsa==0.0.4
+[ray]
+ray[default]>=2.54.0
+[test]
+accelerate
+bitsandbytes
+expecttest
+jsonlines
+lm-eval[api]>=0.4.9.2
+matplotlib
+pandas
+parameterized
+peft
+pytest
+sentence_transformers
+tabulate
+[tracing]
+opentelemetry-api
+opentelemetry-exporter-otlp
+opentelemetry-exporter-otlp-proto-grpc
+opentelemetry-sdk

sglang/python/sglang.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ sglang

sglang/python/sglang/README.md ADDED Viewed

	@@ -0,0 +1,18 @@

+# Code Structure
+- `eval`: The evaluation utilities.
+- `lang`: The frontend language.
+- `multimodal_gen`: Inference framework for accelerated image/video generation.
+- `srt`: The backend engine for running local models. (SRT = SGLang Runtime).
+- `test`: The test utilities.
+- `api.py`: The public APIs.
+- `bench_offline_throughput.py`: Benchmark the performance in the offline mode.
+- `bench_one_batch.py`: Benchmark the latency of running a single static batch without a server.
+- `bench_one_batch_server.py`: Benchmark the latency of running a single batch with a server.
+- `bench_serving.py`: Benchmark online serving with dynamic requests.
+- `check_env.py`: Check the environment variables and dependencies.
+- `global_config.py`: The global configs and constants.
+- `launch_server.py`: The entry point for launching a local server.
+- `profiler.py`: The profiling entry point to send profile requests.
+- `utils.py`: Common utilities.
+- `version.py`: Version info.

sglang/python/sglang/__init__.py ADDED Viewed

	@@ -0,0 +1,83 @@

+# SGLang public APIs
+# Frontend Language APIs
+from sglang.global_config import global_config
+from sglang.lang.api import (
+    Engine,
+    Runtime,
+    assistant,
+    assistant_begin,
+    assistant_end,
+    flush_cache,
+    function,
+    gen,
+    gen_int,
+    gen_string,
+    get_server_info,
+    image,
+    select,
+    separate_reasoning,
+    set_default_backend,
+    system,
+    system_begin,
+    system_end,
+    user,
+    user_begin,
+    user_end,
+    video,
+)
+from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
+from sglang.lang.choices import (
+    greedy_token_selection,
+    token_length_normalized,
+    unconditional_likelihood_normalized,
+)
+# Lazy import some libraries
+from sglang.utils import LazyImport
+from sglang.version import __version__
+Anthropic = LazyImport("sglang.lang.backend.anthropic", "Anthropic")
+LiteLLM = LazyImport("sglang.lang.backend.litellm", "LiteLLM")
+OpenAI = LazyImport("sglang.lang.backend.openai", "OpenAI")
+VertexAI = LazyImport("sglang.lang.backend.vertexai", "VertexAI")
+# Runtime Engine APIs
+ServerArgs = LazyImport("sglang.srt.server_args", "ServerArgs")
+Engine = LazyImport("sglang.srt.entrypoints.engine", "Engine")
+__all__ = [
+    "Engine",
+    "Runtime",
+    "assistant",
+    "assistant_begin",
+    "assistant_end",
+    "flush_cache",
+    "function",
+    "gen",
+    "gen_int",
+    "gen_string",
+    "get_server_info",
+    "image",
+    "select",
+    "separate_reasoning",
+    "set_default_backend",
+    "system",
+    "system_begin",
+    "system_end",
+    "user",
+    "user_begin",
+    "user_end",
+    "video",
+    "RuntimeEndpoint",
+    "greedy_token_selection",
+    "token_length_normalized",
+    "unconditional_likelihood_normalized",
+    "ServerArgs",
+    "Anthropic",
+    "LiteLLM",
+    "OpenAI",
+    "VertexAI",
+    "global_config",
+    "__version__",
+]

sglang/python/sglang/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (2.08 kB). View file

sglang/python/sglang/__pycache__/_version.cpython-311.pyc ADDED Viewed

Binary file (872 Bytes). View file

sglang/python/sglang/__pycache__/bench_serving.cpython-311.pyc ADDED Viewed

Binary file (95.3 kB). View file

sglang/python/sglang/__pycache__/check_env.cpython-311.pyc ADDED Viewed

Binary file (24.9 kB). View file

sglang/python/sglang/__pycache__/global_config.cpython-311.pyc ADDED Viewed

Binary file (969 Bytes). View file

sglang/python/sglang/__pycache__/launch_server.cpython-311.pyc ADDED Viewed

Binary file (2.62 kB). View file

sglang/python/sglang/__pycache__/utils.cpython-311.pyc ADDED Viewed

Binary file (34.6 kB). View file

sglang/python/sglang/__pycache__/version.cpython-311.pyc ADDED Viewed

Binary file (1.31 kB). View file

sglang/python/sglang/_version.py ADDED Viewed

	@@ -0,0 +1,34 @@

+# file generated by setuptools-scm
+# don't change, don't track in version control
+__all__ = [
+    "__version__",
+    "__version_tuple__",
+    "version",
+    "version_tuple",
+    "__commit_id__",
+    "commit_id",
+]
+TYPE_CHECKING = False
+if TYPE_CHECKING:
+    from typing import Tuple
+    from typing import Union
+    VERSION_TUPLE = Tuple[Union[int, str], ...]
+    COMMIT_ID = Union[str, None]
+else:
+    VERSION_TUPLE = object
+    COMMIT_ID = object
+version: str
+__version__: str
+__version_tuple__: VERSION_TUPLE
+version_tuple: VERSION_TUPLE
+commit_id: COMMIT_ID
+__commit_id__: COMMIT_ID
+__version__ = version = '0.5.9'
+__version_tuple__ = version_tuple = (0, 5, 9)
+__commit_id__ = commit_id = 'gbbe9c7eeb'

sglang/python/sglang/bench_offline_throughput.py ADDED Viewed

	@@ -0,0 +1,543 @@

+"""
+Benchmark the throughput in the offline mode.
+It accepts server arguments (the same as launch_server.py) and benchmark arguments (the same as bench_serving.py).
+# Usage
+## Sharegpt dataset with default args
+python -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
+## Random dataset with default args
+python -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dataset-name random --random-input 1024 --random-output 1024
+"""
+import argparse
+import asyncio
+import dataclasses
+import inspect
+import json
+import logging
+import os
+import random
+import time
+from typing import Dict, List, Optional
+import numpy as np
+from sglang.benchmark.datasets import DatasetRow, get_dataset
+from sglang.benchmark.datasets.random import sample_random_requests
+from sglang.benchmark.utils import get_tokenizer, set_ulimit
+from sglang.lang.backend.runtime_endpoint import Runtime
+from sglang.srt.entrypoints.engine import Engine
+from sglang.srt.server_args import ServerArgs
+@dataclasses.dataclass
+class BenchArgs:
+    backend: str = "engine"
+    result_filename: str = ""
+    dataset_name: str = "sharegpt"
+    dataset_path: str = ""
+    num_prompts: int = 1000
+    sharegpt_output_len: Optional[int] = None
+    sharegpt_context_len: Optional[int] = None
+    random_input_len: int = 1024
+    random_output_len: int = 1024
+    random_range_ratio: float = 0.0
+    gsp_num_groups: int = 64
+    gsp_prompts_per_group: int = 16
+    gsp_system_prompt_len: int = 2048
+    gsp_question_len: int = 128
+    gsp_output_len: int = 256
+    seed: int = 1
+    disable_ignore_eos: bool = False
+    extra_request_body: Optional[str] = None
+    apply_chat_template: bool = False
+    profile: bool = False
+    skip_warmup: bool = False
+    do_not_exit: bool = False
+    prompt_suffix: str = ""
+    return_logprob: bool = False
+    logprob_start_len: int = -1
+    @staticmethod
+    def add_cli_args(parser: argparse.ArgumentParser):
+        parser.add_argument("--backend", type=str, default=BenchArgs.backend)
+        parser.add_argument(
+            "--result-filename", type=str, default=BenchArgs.result_filename
+        )
+        parser.add_argument(
+            "--dataset-name",
+            type=str,
+            default="sharegpt",
+            choices=["sharegpt", "random", "generated-shared-prefix"],
+            help="Name of the dataset to benchmark on.",
+        )
+        parser.add_argument(
+            "--dataset-path", type=str, default="", help="Path to the dataset."
+        )
+        parser.add_argument(
+            "--num-prompts",
+            type=int,
+            default=BenchArgs.num_prompts,
+            help="Number of prompts to process. Default is 1000.",
+        )
+        parser.add_argument(
+            "--sharegpt-output-len",
+            type=int,
+            default=BenchArgs.sharegpt_output_len,
+            help="Output length for each request. Overrides the output length from the ShareGPT dataset.",
+        )
+        parser.add_argument(
+            "--sharegpt-context-len",
+            type=int,
+            default=BenchArgs.sharegpt_context_len,
+            help="The context length of the model for the ShareGPT dataset. Requests longer than the context length will be dropped.",
+        )
+        parser.add_argument(
+            "--random-input-len",
+            type=int,
+            default=BenchArgs.random_input_len,
+            help="Number of input tokens per request, used only for random dataset.",
+        )
+        parser.add_argument(
+            "--random-output-len",
+            type=int,
+            default=BenchArgs.random_output_len,
+            help="Number of output tokens per request, used only for random dataset.",
+        )
+        parser.add_argument(
+            "--random-range-ratio",
+            type=float,
+            default=BenchArgs.random_range_ratio,
+            help="Range of sampled ratio of input/output length, "
+            "used only for random dataset.",
+        )
+        parser.add_argument(
+            "--gsp-num-groups",
+            type=int,
+            default=BenchArgs.gsp_num_groups,
+            help="Number of groups with shared prefix, used"
+            "only for generate-shared-prefix",
+        )
+        parser.add_argument(
+            "--gsp-prompts-per-group",
+            type=int,
+            default=BenchArgs.gsp_prompts_per_group,
+            help="Number of prompts per group of shared prefix, used"
+            "only for generate-shared-prefix",
+        )
+        parser.add_argument(
+            "--gsp-system-prompt-len",
+            type=int,
+            default=BenchArgs.gsp_system_prompt_len,
+            help="System prompt length, used" "only for generate-shared-prefix",
+        )
+        parser.add_argument(
+            "--gsp-question-len",
+            type=int,
+            default=BenchArgs.gsp_question_len,
+            help="Question length, used" "only for generate-shared-prefix",
+        )
+        parser.add_argument(
+            "--gsp-output-len",
+            type=int,
+            default=BenchArgs.gsp_output_len,
+            help="Target length in tokens for outputs in generated-shared-prefix dataset",
+        )
+        parser.add_argument("--seed", type=int, default=1, help="The random seed.")
+        parser.add_argument(
+            "--disable-ignore-eos",
+            action="store_true",
+            help="Disable ignore EOS token",
+        )
+        parser.add_argument(
+            "--extra-request-body",
+            metavar='{"key1": "value1", "key2": "value2"}',
+            type=str,
+            default=BenchArgs.extra_request_body,
+            help="Append given JSON object to the request payload. You can use this to specify"
+            "additional generate params like sampling params.",
+        )
+        parser.add_argument(
+            "--apply-chat-template",
+            action="store_true",
+            help="Apply chat template",
+        )
+        parser.add_argument(
+            "--profile",
+            action="store_true",
+            help="Use Torch Profiler. The endpoint must be launched with "
+            "SGLANG_TORCH_PROFILER_DIR to enable profiler.",
+        )
+        parser.add_argument(
+            "--skip-warmup",
+            action="store_true",
+            help="Skip the warmup batches.",
+        )
+        parser.add_argument(
+            "--do-not-exit",
+            action="store_true",
+            help="Do not exit the program. This is useful for nsys profile with --duration and --delay.",
+        )
+        parser.add_argument(
+            "--prompt-suffix",
+            type=str,
+            default="",
+            help="Suffix applied to the end of all user prompts, followed by assistant prompt suffix.",
+        )
+        parser.add_argument(
+            "--return-logprob",
+            action="store_true",
+            help="Enable returning log probabilities.",
+        )
+        parser.add_argument(
+            "--logprob-start-len",
+            type=int,
+            default=-1,
+            help="Start length for logprob. -1 means only return logprobs for output tokens (default). 0 means return logprobs for all tokens including input.",
+        )
+    @classmethod
+    def from_cli_args(cls, args: argparse.Namespace):
+        attrs = [attr.name for attr in dataclasses.fields(cls)]
+        return cls(**{attr: getattr(args, attr) for attr in attrs})
+def throughput_test_once(
+    backend_name: str,
+    backend,
+    reqs: List[DatasetRow],
+    ignore_eos: bool,
+    extra_request_body: Dict,
+    profile: bool,
+    return_logprob: bool = False,
+    logprob_start_len: int = -1,
+):
+    measurement_results = {
+        "backend": backend_name,
+        "successful_requests": len(reqs),
+        "total_latency": -1,
+        "total_input_tokens": sum(r.prompt_len for r in reqs),
+        "total_output_tokens": -1,
+        "request_throughput": -1,
+        "input_throughput": -1,
+        "output_throughput": -1,
+        "total_throughput": -1,
+    }
+    prompt = [r.prompt for r in reqs]
+    sampling_params = [
+        {
+            "temperature": 0,
+            "max_new_tokens": r.output_len,
+            "ignore_eos": ignore_eos,
+            **extra_request_body,
+        }
+        for r in reqs
+    ]
+    if profile:
+        assert (
+            "SGLANG_TORCH_PROFILER_DIR" in os.environ
+        ), "Please set SGLANG_TORCH_PROFILER_DIR."
+        os.makedirs(os.environ["SGLANG_TORCH_PROFILER_DIR"], exist_ok=True)
+        backend.start_profile()
+    st = time.perf_counter()
+    gen_out = backend.generate(
+        prompt=prompt,
+        sampling_params=sampling_params,
+        return_logprob=return_logprob,
+        logprob_start_len=logprob_start_len,
+    )
+    latency = time.perf_counter() - st
+    if profile:
+        dir = os.getenv("SGLANG_TORCH_PROFILER_DIR")
+        known_files = set(os.listdir(dir))
+        backend.stop_profile()
+        monitor_trace_file(known_files, dir)
+    if backend_name == "runtime":
+        gen_out = json.loads(gen_out)
+    server_info = backend.get_server_info()
+    measurement_results["total_latency"] = latency
+    measurement_results["total_output_tokens"] = sum(
+        o["meta_info"]["completion_tokens"] for o in gen_out
+    )
+    measurement_results["request_throughput"] = (
+        measurement_results["successful_requests"] / latency
+    )
+    measurement_results["input_throughput"] = (
+        measurement_results["total_input_tokens"] / latency
+    )
+    measurement_results["output_throughput"] = (
+        measurement_results["total_output_tokens"] / latency
+    )
+    measurement_results["total_throughput"] = (
+        measurement_results["total_input_tokens"]
+        + measurement_results["total_output_tokens"]
+    ) / latency
+    if inspect.isawaitable(server_info):
+        server_info = asyncio.run(server_info)
+    measurement_results["last_gen_throughput"] = server_info["internal_states"][0][
+        "last_gen_throughput"
+    ]
+    return measurement_results
+def monitor_trace_file(known_files, directory, interval=1):
+    print(f"Monitoring {directory} for new trace files...")
+    while True:
+        flag = False
+        time.sleep(interval)
+        current_files = set(os.listdir(directory))
+        new_files = current_files - known_files
+        for new_file in new_files:
+            new_file_path = os.path.join(directory, new_file)
+            print(f"New file detected: {new_file}")
+            previous_size = 0
+            while True:
+                try:
+                    current_size = os.path.getsize(new_file_path)
+                except FileNotFoundError:
+                    print(f"File {new_file} is no longer accessible.")
+                    break
+                if current_size > previous_size:
+                    previous_size = current_size
+                else:
+                    flag = True
+                    break
+                time.sleep(interval)
+        if flag:
+            break
+def _create_ray_engine_backend(server_args: ServerArgs):
+    """Create a RayEngine inside a Ray actor on a placement group.
+    RayEngine requires a placement group, so we launch it inside a Ray actor
+    and return a lightweight proxy that forwards calls via ray.get().
+    """
+    import ray
+    from ray.runtime_env import RuntimeEnv
+    from ray.util.placement_group import placement_group
+    from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
+    env_vars = {"RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1"}
+    if os.environ.get("HF_TOKEN"):
+        env_vars["HF_TOKEN"] = os.environ["HF_TOKEN"]
+    if not ray.is_initialized():
+        ray.init(runtime_env=RuntimeEnv(env_vars=env_vars))
+    total_gpus = server_args.tp_size * server_args.pp_size
+    pg = placement_group([{"CPU": 1, "GPU": total_gpus}], strategy="STRICT_PACK")
+    ray.get(pg.ready())
+    @ray.remote
+    class _EngineActor:
+        def __init__(self, **kwargs):
+            from sglang.srt.ray.engine import RayEngine
+            self.engine = RayEngine(**kwargs)
+        def call(self, method, **kwargs):
+            return getattr(self.engine, method)(**kwargs)
+    actor = _EngineActor.options(
+        num_cpus=1,
+        num_gpus=0,
+        scheduling_strategy=PlacementGroupSchedulingStrategy(
+            placement_group=pg,
+            placement_group_bundle_index=0,
+        ),
+    ).remote(**dataclasses.asdict(server_args))
+    class _Proxy:
+        """Forwards method calls to the remote RayEngine actor."""
+        def generate(self, **kwargs):
+            return ray.get(actor.call.remote("generate", **kwargs))
+        def get_server_info(self, **kwargs):
+            return ray.get(actor.call.remote("get_server_info", **kwargs))
+        def start_profile(self, **kwargs):
+            return ray.get(actor.call.remote("start_profile", **kwargs))
+        def stop_profile(self, **kwargs):
+            return ray.get(actor.call.remote("stop_profile", **kwargs))
+        def shutdown(self):
+            try:
+                ray.get(actor.call.remote("shutdown"), timeout=60)
+            except Exception:
+                pass
+            try:
+                ray.util.remove_placement_group(pg)
+            except Exception:
+                pass
+    return _Proxy()
+def throughput_test(
+    server_args: ServerArgs,
+    bench_args: BenchArgs,
+):
+    if bench_args.backend == "engine":
+        if server_args.use_ray:
+            backend = _create_ray_engine_backend(server_args)
+        else:
+            backend = Engine(**dataclasses.asdict(server_args))
+        if not backend:
+            raise ValueError("Please provide valid engine arguments")
+    elif bench_args.backend == "runtime":
+        backend = Runtime(**dataclasses.asdict(server_args))
+    else:
+        raise ValueError('Please set backend to either "engine" or "runtime"')
+    tokenizer_id = server_args.tokenizer_path or server_args.model_path
+    tokenizer = get_tokenizer(tokenizer_id)
+    # Set global environments
+    set_ulimit()
+    random.seed(bench_args.seed)
+    np.random.seed(bench_args.seed)
+    # Parse args
+    extra_request_body = {}
+    if bench_args.extra_request_body:
+        extra_request_body = json.loads(args.extra_request_body)
+    # Read dataset
+    input_requests = get_dataset(bench_args, tokenizer)
+    warmup_requests = sample_random_requests(
+        input_len=256,
+        output_len=16,
+        num_prompts=min(bench_args.num_prompts, 16),
+        range_ratio=1.0,
+        tokenizer=tokenizer,
+        dataset_path=bench_args.dataset_path,
+    )
+    # Warm up
+    if not bench_args.skip_warmup:
+        logging.info("\nWarmup...")
+        throughput_test_once(
+            backend_name=bench_args.backend,
+            backend=backend,
+            reqs=warmup_requests,
+            ignore_eos=not bench_args.disable_ignore_eos,
+            extra_request_body=extra_request_body,
+            profile=False,
+            return_logprob=bench_args.return_logprob,
+            logprob_start_len=bench_args.logprob_start_len,
+        )
+        time.sleep(0.5)
+    logging.info("\nBenchmark...")
+    result = throughput_test_once(
+        backend_name=bench_args.backend,
+        backend=backend,
+        reqs=input_requests,
+        ignore_eos=not bench_args.disable_ignore_eos,
+        extra_request_body=extra_request_body,
+        profile=bench_args.profile,
+        return_logprob=bench_args.return_logprob,
+        logprob_start_len=bench_args.logprob_start_len,
+    )
+    backend.shutdown()
+    if bench_args.result_filename:
+        with open(bench_args.result_filename, "a") as fout:
+            fout.write(json.dumps(result) + "\n")
+    print(
+        "\n{s:{c}^{n}}".format(s=" Offline Throughput Benchmark Result ", n=50, c="=")
+    )
+    print("{:<40} {:<10}".format("Backend:", result["backend"]))
+    print("{:<40} {:<10}".format("Successful requests:", result["successful_requests"]))
+    print("{:<40} {:<10.2f}".format("Benchmark duration (s):", result["total_latency"]))
+    print("{:<40} {:<10}".format("Total input tokens:", result["total_input_tokens"]))
+    print(
+        "{:<40} {:<10}".format("Total generated tokens:", result["total_output_tokens"])
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Last generation throughput (tok/s):", result["last_gen_throughput"]
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Request throughput (req/s):", result["request_throughput"]
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Input token throughput (tok/s):", result["input_throughput"]
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Output token throughput (tok/s):", result["output_throughput"]
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Total token throughput (tok/s):", result["total_throughput"]
+        )
+    )
+    print("=" * 50)
+    return result
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ServerArgs.add_cli_args(parser)
+    BenchArgs.add_cli_args(parser)
+    args = parser.parse_args()
+    # handling ModelScope model downloads
+    if os.getenv("SGLANG_USE_MODELSCOPE", "false").lower() in ("true", "1"):
+        if os.path.exists(args.model_path):
+            print(f"Using local model path: {args.model_path}")
+        else:
+            try:
+                from modelscope import snapshot_download
+                print(f"Using ModelScope to download model: {args.model_path}")
+                # download the model and replace args.model_path
+                args.model_path = snapshot_download(
+                    args.model_path,
+                )
+                print(f"Model downloaded to: {args.model_path}")
+            except Exception as e:
+                print(f"ModelScope download failed: {str(e)}")
+                raise e
+    server_args = ServerArgs.from_cli_args(args)
+    bench_args = BenchArgs.from_cli_args(args)
+    logging.basicConfig(
+        level=getattr(logging, server_args.log_level.upper()),
+        format="%(message)s",
+    )
+    throughput_test(server_args, bench_args)
+    while bench_args.do_not_exit:
+        pass

sglang/python/sglang/bench_one_batch.py ADDED Viewed

	@@ -0,0 +1,837 @@

+"""
+Benchmark the latency of running a single static batch without a server.
+This script does not launch a server and uses the low-level APIs.
+It accepts server arguments (the same as launch_server.py) and benchmark arguments (e.g., batch size, input lengths).
+# Usage (latency test)
+## with dummy weights:
+python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3-8B-Instruct --load-format dummy
+## sweep through multiple data points and store (append) the results in a jsonl file:
+python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 1 12 14 --input-len 256 512 --output-len 32 256 --run-name test_run
+## run with profiling:
+python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 1 12 14 --input-len 256 512 --profile
+## run with profiling to custom directory:
+export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
+python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 1 --input-len 256 --profile
+## run with CUDA profiler (nsys):
+nsys profile --force-overwrite=true -o bench_one_batch python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 1 --input-len 256 --profile --profile-activities CUDA_PROFILER
+# Usage (correctness test):
+python -m sglang.bench_one_batch --model-path TinyLlama/TinyLlama-1.1B-Chat-v0.4 --correct
+## Reference output (of the correctness test above, can be gpu dependent):
+input_ids=[[1, 450, 7483, 310, 3444, 338], [1, 450, 7483, 310, 278, 3303, 13187, 290, 338], [1, 20628, 338, 263, 6575, 1460, 2462, 322, 306, 763]]
+prefill logits (first half): tensor([[-10.0312,  -9.5000,   0.8931,  ...,  -4.9414,  -3.2422,  -3.3633],
+        [-10.0312,  -9.5000,   0.8931,  ...,  -4.9414,  -3.2422,  -3.3633],
+        [ -9.1875, -10.2500,   2.7129,  ...,  -4.3359,  -4.0664,  -4.1328]],
+       device='cuda:0')
+prefill logits (final): tensor([[-8.3125, -7.1172,  3.3457,  ..., -4.9570, -4.1328, -3.4141],
+        [-8.9141, -9.0156,  4.1445,  ..., -4.9922, -4.4961, -4.0781],
+        [-9.6328, -9.0547,  4.0195,  ..., -5.3047, -4.7148, -4.4570]],
+       device='cuda:0')
+========== Prompt 0 ==========
+<s> The capital of France is Paris.
+The capital of the United States is Washington, D.C.
+========== Prompt 1 ==========
+<s> The capital of the United Kindom is London.
+The capital of the United Kingdom is London.
+The capital of the
+========== Prompt 2 ==========
+<s> Today is a sunny day and I like to go for a walk in the park.
+I'm going to the park
+"""
+import argparse
+import copy
+import dataclasses
+import itertools
+import json
+import logging
+import multiprocessing
+import os
+import time
+from types import SimpleNamespace
+from typing import Optional, Tuple
+import numpy as np
+import torch
+import torch.distributed as dist
+from sglang.srt.configs.model_config import ModelConfig
+from sglang.srt.distributed.parallel_state import destroy_distributed_environment
+from sglang.srt.entrypoints.engine import _set_envs_and_config
+from sglang.srt.layers.moe import initialize_moe_config
+from sglang.srt.layers.quantization.fp4_utils import initialize_fp4_gemm_config
+from sglang.srt.layers.quantization.fp8_utils import initialize_fp8_gemm_config
+from sglang.srt.managers.schedule_batch import Req, ScheduleBatch
+from sglang.srt.managers.scheduler_dp_attn_mixin import prepare_mlp_sync_batch_raw
+from sglang.srt.model_executor.forward_batch_info import ForwardBatch
+from sglang.srt.model_executor.model_runner import ModelRunner
+from sglang.srt.sampling.sampling_params import SamplingParams
+from sglang.srt.server_args import PortArgs, ServerArgs
+from sglang.srt.speculative.spec_info import SpeculativeAlgorithm
+from sglang.srt.utils import (
+    configure_logger,
+    get_bool_env_var,
+    kill_process_tree,
+    maybe_reindex_device_id,
+    require_mlp_sync,
+    require_mlp_tp_gather,
+    set_gpu_proc_affinity,
+    suppress_other_loggers,
+)
+from sglang.srt.utils.hf_transformers_utils import get_tokenizer
+def start_profile(profile_activities, profile_record_shapes=False, rank_print=print):
+    """
+    Abstracted function to start profiling based on profile_activities.
+    Returns profiler object (or None).
+    """
+    if "CUDA_PROFILER" in profile_activities:
+        try:
+            torch.cuda.cudart().cudaProfilerStart()
+            rank_print("CUDA Profiler started (nsys will begin capturing)")
+        except Exception as e:
+            rank_print(f"Failed to start CUDA profiler: {e}")
+        return None
+    else:
+        activities = []
+        if "CPU" in profile_activities:
+            activities.append(torch.profiler.ProfilerActivity.CPU)
+        if "GPU" in profile_activities:
+            activities.append(torch.profiler.ProfilerActivity.CUDA)
+        if "XPU" in profile_activities:
+            activities.append(torch.profiler.ProfilerActivity.XPU)
+        if activities:
+            profiler = torch.profiler.profile(
+                activities=activities,
+                with_stack=True,
+                record_shapes=profile_record_shapes,
+            )
+            profiler.start()
+            return profiler
+        return None
+def stop_profile(
+    profiler,
+    profile_activities,
+    rank_print=print,
+    save_trace=False,
+    trace_filename=None,
+    stage=None,
+):
+    """
+    Abstracted function to stop profiling based on profile_activities.
+    Optionally saves trace results and prints completion messages.
+    """
+    if "CUDA_PROFILER" in profile_activities:
+        try:
+            torch.cuda.cudart().cudaProfilerStop()
+            rank_print("CUDA Profiler stopped (nsys should dump traces)")
+        except Exception as e:
+            rank_print(f"Failed to stop CUDA profiler: {e}")
+    elif profiler is not None:
+        profiler.stop()
+    if save_trace:
+        if profiler is not None:
+            if trace_filename:
+                _save_profile_trace_results(profiler, trace_filename)
+                stage_desc = f"for {stage}" if stage else ""
+                rank_print(
+                    f"torch profiler chrome trace {stage_desc} saved to {trace_filename}"
+                )
+        if "CUDA_PROFILER" in profile_activities:
+            rank_print(f"CUDA profiler trace for {stage} completed")
+@dataclasses.dataclass
+class BenchArgs:
+    run_name: str = "default"
+    batch_size: Tuple[int] = (1,)
+    input_len: Tuple[int] = (1024,)
+    output_len: Tuple[int] = (16,)
+    prompt_filename: str = ""
+    result_filename: str = "result.jsonl"
+    correctness_test: bool = False
+    # This is only used for correctness test
+    cut_len: int = 4
+    log_decode_step: int = 0
+    profile: bool = False
+    profile_record_shapes: bool = False
+    profile_activities: Tuple[str] = ("CPU", "GPU")
+    profile_stage: str = "all"
+    profile_filename_prefix: str = "profile"
+    profile_start_step: Optional[int] = None
+    profile_steps: Optional[int] = None
+    @staticmethod
+    def add_cli_args(parser: argparse.ArgumentParser):
+        parser.add_argument("--run-name", type=str, default=BenchArgs.run_name)
+        parser.add_argument(
+            "--batch-size", type=int, nargs="+", default=BenchArgs.batch_size
+        )
+        parser.add_argument(
+            "--input-len", type=int, nargs="+", default=BenchArgs.input_len
+        )
+        parser.add_argument(
+            "--output-len", type=int, nargs="+", default=BenchArgs.output_len
+        )
+        parser.add_argument(
+            "--prompt-filename", type=str, default=BenchArgs.prompt_filename
+        )
+        parser.add_argument(
+            "--result-filename", type=str, default=BenchArgs.result_filename
+        )
+        parser.add_argument("--correctness-test", action="store_true")
+        parser.add_argument("--cut-len", type=int, default=BenchArgs.cut_len)
+        parser.add_argument(
+            "--log-decode-step",
+            type=int,
+            default=BenchArgs.log_decode_step,
+            help="Log decode latency by step, default is set to zero to disable.",
+        )
+        parser.add_argument("--profile", action="store_true", help="Enable profiling.")
+        parser.add_argument(
+            "--profile-record-shapes",
+            action="store_true",
+            help="Record tensor shapes in profiling results.",
+        )
+        parser.add_argument(
+            "--profile-activities",
+            type=str,
+            nargs="+",
+            default=["CPU", "GPU"],
+            choices=["CPU", "GPU", "CUDA_PROFILER", "XPU"],
+            help="Profiler activities: CPU, GPU, XPU, CUDA_PROFILER. If CPU/GPU/XPU, use torch profiler. If CUDA_PROFILER, use CUDA profiler.",
+        )
+        parser.add_argument(
+            "--profile-stage",
+            type=str,
+            default=BenchArgs.profile_stage,
+            choices=["all", "prefill", "decode"],
+            help="Which stage to profile: all, prefill, or decode only.",
+        )
+        parser.add_argument(
+            "--profile-filename-prefix",
+            type=str,
+            default=BenchArgs.profile_filename_prefix,
+            help="Prefix of the profiling file names. The full profiling result file(s) be "
+            '"[profile_filename_prefix]_batch[batch_size]_input[input_len]_output[output_len].trace.json.gz"',
+        )
+        parser.add_argument(
+            "--profile-start-step",
+            type=int,
+            default=None,
+            help="Decode step at which to start profiling (0-indexed). If not specified, defaults to output_len // 2.",
+        )
+        parser.add_argument(
+            "--profile-steps",
+            type=int,
+            default=None,
+            help="Number of decode steps to profile starting from profile-start-step. If not specified, profiles only one step.",
+        )
+    @classmethod
+    def from_cli_args(cls, args: argparse.Namespace):
+        # use the default value's type to cast the args into correct types.
+        attrs = [(attr.name, type(attr.default)) for attr in dataclasses.fields(cls)]
+        result = {}
+        for attr, attr_type in attrs:
+            value = getattr(args, attr)
+            # Handle None values - don't try to cast them
+            if value is None or attr_type == type(None):
+                result[attr] = value
+            else:
+                result[attr] = attr_type(value)
+        return cls(**result)
+def load_model(server_args, port_args, gpu_id, tp_rank):
+    suppress_other_loggers()
+    rank_print = print if tp_rank == 0 else lambda *args, **kwargs: None
+    moe_ep_rank = tp_rank // (server_args.tp_size // server_args.ep_size)
+    model_config = ModelConfig.from_server_args(server_args)
+    model_runner = ModelRunner(
+        model_config=model_config,
+        mem_fraction_static=server_args.mem_fraction_static,
+        gpu_id=gpu_id,
+        tp_rank=tp_rank,
+        tp_size=server_args.tp_size,
+        moe_ep_rank=moe_ep_rank,
+        moe_ep_size=server_args.ep_size,
+        pp_rank=0,
+        pp_size=1,
+        nccl_port=port_args.nccl_port,
+        server_args=server_args,
+    )
+    rank_print(f"max_total_num_tokens={model_runner.max_total_num_tokens}")
+    tokenizer = get_tokenizer(
+        server_args.tokenizer_path,
+        tokenizer_mode=server_args.tokenizer_mode,
+        trust_remote_code=server_args.trust_remote_code,
+    )
+    if server_args.tp_size > 1:
+        dist.barrier()
+    return model_runner, tokenizer
+def prepare_inputs_for_correctness_test(bench_args, tokenizer, custom_prompts):
+    prompts = (
+        custom_prompts
+        if custom_prompts
+        else [
+            "The capital of France is",
+            "The capital of the United Kindom is",
+            "Today is a sunny day and I like",
+        ]
+    )
+    input_ids = [tokenizer.encode(p) for p in prompts]
+    sampling_params = SamplingParams(
+        temperature=0,
+        max_new_tokens=BenchArgs.output_len,
+    )
+    reqs = []
+    for i in range(len(prompts)):
+        assert len(input_ids[i]) > bench_args.cut_len
+        tmp_input_ids = input_ids[i][: bench_args.cut_len]
+        req = Req(
+            rid=i,
+            origin_input_text=prompts[i],
+            origin_input_ids=tmp_input_ids,
+            sampling_params=sampling_params,
+        )
+        req.fill_ids = req.origin_input_ids
+        req.logprob_start_len = -1
+        req.set_extend_input_len(len(req.fill_ids) - len(req.prefix_indices))
+        reqs.append(req)
+    return input_ids, reqs
+def prepare_extend_inputs_for_correctness_test(
+    bench_args, input_ids, reqs, model_runner
+):
+    for i in range(len(reqs)):
+        req: Req = reqs[i]
+        req.fill_ids += input_ids[i][bench_args.cut_len :]
+        req.prefix_indices = model_runner.req_to_token_pool.req_to_token[
+            i, : bench_args.cut_len
+        ].to(req.prefix_indices.dtype)
+        req.logprob_start_len = -1
+        req.set_extend_input_len(len(req.fill_ids) - len(req.prefix_indices))
+    return reqs
+def prepare_synthetic_inputs_for_latency_test(
+    batch_size, input_len, custom_inputs=None
+):
+    input_ids = (
+        custom_inputs
+        if custom_inputs
+        else np.random.randint(0, 10000, (batch_size, input_len), dtype=np.int32)
+    )
+    sampling_params = SamplingParams(
+        temperature=0,
+        max_new_tokens=BenchArgs.output_len,
+    )
+    reqs = []
+    for i in range(len(input_ids)):
+        req = Req(
+            rid=i,
+            origin_input_text="",
+            origin_input_ids=list(input_ids[i]),
+            sampling_params=sampling_params,
+        )
+        req.fill_ids = req.origin_input_ids
+        req.logprob_start_len = -1
+        req.set_extend_input_len(len(req.fill_ids) - len(req.prefix_indices))
+        reqs.append(req)
+    return reqs
+class TreeCacheNamespace(SimpleNamespace):
+    def supports_swa(self) -> bool:
+        return False
+    def supports_mamba(self) -> bool:
+        return False
+    def is_chunk_cache(self) -> bool:
+        return False
+    def is_tree_cache(self) -> bool:
+        return not self.is_chunk_cache()
+@torch.no_grad
+def extend(reqs, model_runner):
+    # Create dummy tree_cache for benchmarks (no prefix caching, just allocation)
+    dummy_tree_cache = TreeCacheNamespace(
+        page_size=model_runner.server_args.page_size,
+        device=model_runner.device,
+        token_to_kv_pool_allocator=model_runner.token_to_kv_pool_allocator,
+    )
+    batch = ScheduleBatch.init_new(
+        reqs=reqs,
+        req_to_token_pool=model_runner.req_to_token_pool,
+        token_to_kv_pool_allocator=model_runner.token_to_kv_pool_allocator,
+        tree_cache=dummy_tree_cache,
+        model_config=model_runner.model_config,
+        enable_overlap=False,
+        spec_algorithm=SpeculativeAlgorithm.NONE,
+    )
+    batch.prepare_for_extend()
+    _maybe_prepare_mlp_sync_batch(batch, model_runner)
+    model_worker_batch = batch.get_model_worker_batch()
+    forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
+    logits_output = model_runner.forward(forward_batch).logits_output
+    next_token_ids = model_runner.sample(logits_output, forward_batch)
+    return next_token_ids, logits_output.next_token_logits, batch
+@torch.no_grad
+def decode(input_token_ids, batch, model_runner):
+    batch.output_ids = input_token_ids
+    batch.prepare_for_decode()
+    _maybe_prepare_mlp_sync_batch(batch, model_runner)
+    model_worker_batch = batch.get_model_worker_batch()
+    forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
+    logits_output = model_runner.forward(forward_batch).logits_output
+    next_token_ids = model_runner.sample(logits_output, forward_batch)
+    return next_token_ids, logits_output.next_token_logits
+def _maybe_prepare_mlp_sync_batch(batch: ScheduleBatch, model_runner):
+    if require_mlp_sync(model_runner.server_args):
+        prepare_mlp_sync_batch_raw(
+            batch,
+            dp_size=model_runner.server_args.dp_size,
+            attn_tp_size=1,
+            tp_group=model_runner.tp_group,
+            get_idle_batch=None,
+            disable_cuda_graph=model_runner.server_args.disable_cuda_graph,
+            require_mlp_tp_gather=require_mlp_tp_gather(model_runner.server_args),
+            disable_overlap_schedule=model_runner.server_args.disable_overlap_schedule,
+            offload_tags=set(),
+        )
+def _read_prompts_from_file(prompt_file, rank_print):
+    """Read custom prompts from the file specified by `--prompt-filename`."""
+    if not prompt_file:
+        return []
+    if not os.path.exists(prompt_file):
+        rank_print(
+            f"Custom prompt file {prompt_file} not found. Using default inputs..."
+        )
+        return []
+    with open(prompt_file, "r") as pf:
+        return pf.readlines()
+def _get_torch_profiler_output_dir():
+    return os.environ.get("SGLANG_TORCH_PROFILER_DIR", "/tmp")
+def _create_torch_profiler_filename(
+    profile_filename_prefix, batch_size, input_len, output_len, stage
+):
+    output_dir = _get_torch_profiler_output_dir()
+    filename = f"{profile_filename_prefix}_batch{batch_size}_input{input_len}_output{output_len}_{stage}.trace.json.gz"
+    return os.path.join(output_dir, filename)
+def _save_profile_trace_results(profiler, filename):
+    parent_dir = os.path.dirname(os.path.abspath(filename))
+    os.makedirs(parent_dir, exist_ok=True)
+    profiler.export_chrome_trace(filename)
+    print(
+        profiler.key_averages(group_by_input_shape=True).table(
+            sort_by="self_cpu_time_total"
+        )
+    )
+def correctness_test(
+    server_args,
+    port_args,
+    bench_args,
+    gpu_id,
+    tp_rank,
+):
+    # Configure the logger
+    configure_logger(server_args, prefix=f" TP{tp_rank}")
+    rank_print = print if tp_rank == 0 else lambda *args, **kwargs: None
+    # Load the model
+    model_runner, tokenizer = load_model(server_args, port_args, gpu_id, tp_rank)
+    # Prepare inputs
+    custom_prompts = _read_prompts_from_file(bench_args.prompt_filename, rank_print)
+    input_ids, reqs = prepare_inputs_for_correctness_test(
+        bench_args, tokenizer, custom_prompts
+    )
+    rank_print(f"\n{input_ids=}\n")
+    if bench_args.cut_len > 0:
+        # Prefill
+        next_token_ids, next_token_logits, batch = extend(reqs, model_runner)
+        rank_print(f"prefill logits (first half): {next_token_logits} \n")
+    # Prepare extend inputs
+    reqs = prepare_extend_inputs_for_correctness_test(
+        bench_args, input_ids, reqs, model_runner
+    )
+    # Extend (prefill w/ KV cache)
+    next_token_ids, next_token_logits, batch = extend(reqs, model_runner)
+    rank_print(f"prefill logits (final): {next_token_logits} \n")
+    # Decode
+    output_ids = [input_ids[i] + [next_token_ids[i]] for i in range(len(input_ids))]
+    for _ in range(bench_args.output_len[0] - 1):
+        next_token_ids, _ = decode(next_token_ids, batch, model_runner)
+        next_token_ids_list = next_token_ids.tolist()
+        for i in range(len(reqs)):
+            output_ids[i].append(next_token_ids_list[i])
+    # Print output texts
+    for i in range(len(reqs)):
+        rank_print(f"========== Prompt {i} ==========")
+        rank_print(tokenizer.decode(output_ids[i]), "\n")
+def synchronize(device):
+    torch.get_device_module(device).synchronize()
+def latency_test_run_once(
+    run_name,
+    model_runner,
+    rank_print,
+    reqs,
+    batch_size,
+    input_len,
+    output_len,
+    device,
+    log_decode_step,
+    profile,
+    profile_record_shapes,
+    profile_activities,
+    profile_filename_prefix,
+    profile_stage,
+    tp_rank,
+    profile_start_step=None,
+    profile_steps=None,
+):
+    max_batch_size = model_runner.max_total_num_tokens // (input_len + output_len)
+    if batch_size > max_batch_size:
+        rank_print(
+            f"skipping ({batch_size}, {input_len}, {output_len}) due to max batch size limit"
+        )
+        return
+    model_runner.req_to_token_pool.clear()
+    model_runner.token_to_kv_pool_allocator.clear()
+    measurement_results = {
+        "run_name": run_name,
+        "batch_size": batch_size,
+        "input_len": input_len,
+        "output_len": output_len,
+    }
+    tot_latency = 0
+    profiler = None
+    enable_profile_prefill = profile and profile_stage in ["all", "prefill"]
+    if enable_profile_prefill:
+        profiler = start_profile(
+            profile_activities,
+            profile_record_shapes=profile_record_shapes,
+            rank_print=rank_print,
+        )
+    synchronize(device)
+    tic = time.perf_counter()
+    next_token_ids, _, batch = extend(reqs, model_runner)
+    synchronize(device)
+    prefill_latency = time.perf_counter() - tic
+    if enable_profile_prefill:
+        trace_filename = _create_torch_profiler_filename(
+            profile_filename_prefix, batch_size, input_len, output_len, "prefill"
+        )
+        stop_profile(
+            profiler,
+            profile_activities,
+            rank_print=rank_print,
+            save_trace=True,
+            trace_filename=trace_filename,
+            stage="prefill",
+        )
+    tot_latency += prefill_latency
+    throughput = input_len * batch_size / prefill_latency
+    rank_print(
+        f"Prefill. latency: {prefill_latency:6.5f} s, throughput: {throughput:9.2f} token/s"
+    )
+    measurement_results["prefill_latency"] = prefill_latency
+    measurement_results["prefill_throughput"] = throughput
+    decode_latencies = []
+    # Determine profiling start step and end step
+    profile_start = (
+        profile_start_step if profile_start_step is not None else (output_len // 2)
+    )
+    profile_end = profile_start + (profile_steps if profile_steps is not None else 1)
+    enable_profile_decode = profile and profile_stage in ["all", "decode"]
+    profiler = None
+    for i in range(output_len - 1):
+        synchronize(device)
+        # Start profiler at the specified step
+        if enable_profile_decode and i == profile_start:
+            profiler = start_profile(
+                profile_activities,
+                profile_record_shapes=profile_record_shapes,
+                rank_print=rank_print,
+            )
+        tic = time.perf_counter()
+        next_token_ids, _ = decode(next_token_ids, batch, model_runner)
+        synchronize(device)
+        latency = time.perf_counter() - tic
+        # Stop profiler after the specified number of steps
+        if enable_profile_decode and profiler is not None and i >= profile_end - 1:
+            trace_filename = _create_torch_profiler_filename(
+                profile_filename_prefix, batch_size, input_len, output_len, "decode"
+            )
+            stop_profile(
+                profiler,
+                profile_activities,
+                rank_print=rank_print,
+                save_trace=True,
+                trace_filename=trace_filename,
+                stage="decode",
+            )
+            profiler = None
+        tot_latency += latency
+        throughput = batch_size / latency
+        decode_latencies.append(latency)
+        if i < 5 or (log_decode_step > 0 and i % log_decode_step == 0):
+            rank_print(
+                f"Decode {i}. Batch size: {batch_size}, latency: {latency:6.5f} s, throughput: {throughput:9.2f} token/s"
+            )
+    # Record decode timing from 2nd output
+    if output_len > 1:
+        med_decode_latency = np.median(decode_latencies)
+        med_decode_throughput = batch_size / med_decode_latency
+        rank_print(
+            f"Decode.  median latency: {med_decode_latency:6.5f} s, median throughput: {med_decode_throughput:9.2f} token/s"
+        )
+        measurement_results["median_decode_latency"] = med_decode_latency
+        measurement_results["median_decode_throughput"] = med_decode_throughput
+    throughput = (input_len + output_len) * batch_size / tot_latency
+    rank_print(
+        f"Total. latency: {tot_latency:6.3f} s, throughput: {throughput:9.2f} token/s"
+    )
+    measurement_results["total_latency"] = tot_latency
+    measurement_results["overall_throughput"] = throughput
+    return measurement_results
+def latency_test(
+    server_args,
+    port_args,
+    bench_args,
+    gpu_id,
+    tp_rank,
+):
+    initialize_moe_config(server_args)
+    initialize_fp8_gemm_config(server_args)
+    initialize_fp4_gemm_config(server_args)
+    # Set CPU affinity
+    if get_bool_env_var("SGLANG_SET_CPU_AFFINITY"):
+        set_gpu_proc_affinity(
+            server_args.pp_size, server_args.tp_size, server_args.nnodes, tp_rank
+        )
+    # Configure the logger
+    configure_logger(server_args, prefix=f" TP{tp_rank}")
+    rank_print = print if tp_rank == 0 else lambda *args, **kwargs: None
+    # Load the model
+    model_runner, tokenizer = load_model(server_args, port_args, gpu_id, tp_rank)
+    # Prepare inputs for warm up
+    reqs = prepare_synthetic_inputs_for_latency_test(
+        bench_args.batch_size[0], bench_args.input_len[0]
+    )
+    # Warm up
+    rank_print("Warmup ...")
+    latency_test_run_once(
+        bench_args.run_name,
+        model_runner,
+        rank_print,
+        reqs,
+        bench_args.batch_size[0],
+        bench_args.input_len[0],
+        min(32, bench_args.output_len[0]),  # shorter decoding to speed up the warmup
+        server_args.device,
+        log_decode_step=0,
+        profile=False,
+        profile_record_shapes=False,
+        profile_activities=("CPU", "GPU"),
+        profile_filename_prefix="",
+        profile_stage="all",
+        tp_rank=tp_rank,
+        profile_start_step=None,
+        profile_steps=None,
+    )
+    rank_print("Benchmark ...")
+    custom_inputs = _read_prompts_from_file(bench_args.prompt_filename, rank_print)
+    custom_inputs = [tokenizer.encode(p.strip()) for p in custom_inputs]
+    custom_input_len = len(custom_inputs)
+    # Run the sweep
+    result_list = []
+    for bs, il, ol in itertools.product(
+        bench_args.batch_size, bench_args.input_len, bench_args.output_len
+    ):
+        bs_aligned_inputs = []
+        if custom_inputs:
+            if custom_input_len == bs:
+                bs_aligned_inputs = custom_inputs
+            elif custom_input_len > bs:
+                rank_print(
+                    f"Custom input size ({custom_input_len}) is larger than batch_size ({bs}). "
+                    f"Using the first {bs} prompts."
+                )
+                bs_aligned_inputs = copy.deepcopy(custom_inputs[:bs])
+            else:
+                rank_print(
+                    f"Custom input size ({custom_input_len}) is smaller than batch_size ({bs}). "
+                    f"Pad to the desired batch_size with the last prompt."
+                )
+                bs_aligned_inputs = copy.deepcopy(custom_inputs)
+                bs_aligned_inputs.extend(
+                    [bs_aligned_inputs[-1]] * (bs - custom_input_len)
+                )
+        reqs = prepare_synthetic_inputs_for_latency_test(bs, il, bs_aligned_inputs)
+        ret = latency_test_run_once(
+            bench_args.run_name,
+            model_runner,
+            rank_print,
+            reqs,
+            bs,
+            il,
+            ol,
+            server_args.device,
+            bench_args.log_decode_step,
+            bench_args.profile if tp_rank == 0 else None,
+            bench_args.profile_record_shapes if tp_rank == 0 else None,
+            bench_args.profile_activities,
+            bench_args.profile_filename_prefix,
+            bench_args.profile_stage,
+            tp_rank,
+            bench_args.profile_start_step,
+            bench_args.profile_steps,
+        )
+        if ret is not None:
+            result_list.append(ret)
+    # Write results in jsonlines format on rank 0.
+    if tp_rank == 0 and bench_args.result_filename:
+        with open(bench_args.result_filename, "a") as fout:
+            for result in result_list:
+                fout.write(json.dumps(result) + "\n")
+    if server_args.tp_size > 1:
+        destroy_distributed_environment()
+def main(server_args, bench_args):
+    server_args.cuda_graph_max_bs = max(bench_args.batch_size)
+    _set_envs_and_config(server_args)
+    if server_args.model_path:
+        if bench_args.correctness_test:
+            work_func = correctness_test
+        else:
+            work_func = latency_test
+    else:
+        raise ValueError(
+            "Provide --model-path for running the tests or "
+            "provide --result-filename for plotting the results"
+        )
+    port_args = PortArgs.init_new(server_args)
+    if server_args.tp_size == 1:
+        work_func(server_args, port_args, bench_args, 0, 0)
+    else:
+        workers = []
+        for tp_rank in range(server_args.tp_size):
+            with maybe_reindex_device_id(tp_rank) as gpu_id:
+                proc = multiprocessing.Process(
+                    target=work_func,
+                    args=(
+                        server_args,
+                        port_args,
+                        bench_args,
+                        gpu_id,
+                        tp_rank,
+                    ),
+                )
+                proc.start()
+                workers.append(proc)
+        for proc in workers:
+            proc.join()
+        proc.terminate()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ServerArgs.add_cli_args(parser)
+    BenchArgs.add_cli_args(parser)
+    args = parser.parse_args()
+    server_args = ServerArgs.from_cli_args(args)
+    bench_args = BenchArgs.from_cli_args(args)
+    logging.basicConfig(
+        level=getattr(logging, server_args.log_level.upper()),
+        format="%(message)s",
+    )
+    try:
+        main(server_args, bench_args)
+    finally:
+        if server_args.tp_size != 1:
+            kill_process_tree(os.getpid(), include_parent=False)

sglang/python/sglang/bench_one_batch_server.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""
+Benchmark the latency of running a single batch with a server.
+This script launches a server and uses the HTTP interface.
+It accepts server arguments (the same as launch_server.py) and benchmark arguments (e.g., batch size, input lengths).
+Usage:
+python3 -m sglang.bench_one_batch_server --model meta-llama/Meta-Llama-3.1-8B --batch-size 1 16 64 --input-len 1024 --output-len 8
+python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 8
+python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 8 --show-report --profile --profile-by-stage
+python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 8 --output-path results.json --profile
+"""
+import argparse
+from sglang.srt.server_args import ServerArgs
+from sglang.test.bench_one_batch_server_internal import (
+    BenchArgs,
+    run_benchmark_internal,
+)
+from sglang.test.nightly_bench_utils import save_results_as_pydantic_models
+def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
+    results, server_info = run_benchmark_internal(server_args, bench_args)
+    # Save results as pydantic models in the JSON format
+    if bench_args.pydantic_result_filename:
+        save_results_as_pydantic_models(
+            results,
+            pydantic_result_filename=bench_args.pydantic_result_filename,
+            model_path=server_args.model_path,
+            server_args=bench_args.server_args_for_metrics,
+        )
+    return results, server_info
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ServerArgs.add_cli_args(parser)
+    BenchArgs.add_cli_args(parser)
+    args = parser.parse_args()
+    server_args = ServerArgs.from_cli_args(args)
+    bench_args = BenchArgs.from_cli_args(args)
+    run_benchmark(server_args, bench_args)

sglang/python/sglang/bench_serving.py ADDED Viewed

	@@ -0,0 +1,2238 @@

+# Adapted from https://github.com/vllm-project/vllm/blob/6366efc67b0aedd2c1721c14385370e50b297fb3/benchmarks/backend_request_func.py
+# Adapted from https://github.com/vllm-project/vllm/blob/6366efc67b0aedd2c1721c14385370e50b297fb3/benchmarks/benchmark_serving.py
+"""
+Benchmark online serving with dynamic requests.
+Usage:
+python3 -m sglang.bench_serving --backend sglang --num-prompt 10
+python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5
+"""
+import argparse
+import asyncio
+import copy
+import importlib.util
+import json
+import os
+import random
+import shutil
+import sys
+import time
+import traceback
+import uuid
+import warnings
+from argparse import ArgumentParser
+from copy import deepcopy
+from dataclasses import dataclass, field, replace
+from datetime import datetime
+from pathlib import Path
+from typing import Any, AsyncGenerator, Callable, Dict, List, Optional, Tuple, Union
+import aiohttp
+import numpy as np
+import requests
+from tqdm.asyncio import tqdm
+from transformers import AutoTokenizer, PreTrainedTokenizerBase
+from sglang.benchmark.datasets import DatasetRow, get_dataset
+from sglang.benchmark.datasets.mooncake import get_mooncake_request_over_time
+from sglang.benchmark.utils import (
+    get_tokenizer,
+    parse_custom_headers,
+    remove_prefix,
+    set_ulimit,
+)
+_ROUTING_KEY_HEADER = "X-SMG-Routing-Key"
+TERM_PLOTLIB_AVAILABLE = (importlib.util.find_spec("termplotlib") is not None) and (
+    shutil.which("gnuplot") is not None
+)
+global args
+# don't want to import sglang package here
+def _get_bool_env_var(name: str, default: str = "false") -> bool:
+    value = os.getenv(name, default)
+    return value.lower() in ("true", "1")
+def _create_bench_client_session():
+    # When the pressure is big, the read buffer could be full before aio thread read
+    # the content. We increase the read_bufsize from 64K to 10M.
+    # Define constants for timeout and buffer size for clarity and maintainability
+    BENCH_AIOHTTP_TIMEOUT_SECONDS = 6 * 60 * 60  # 6 hours
+    BENCH_AIOHTTP_READ_BUFSIZE_BYTES = 10 * 1024**2  # 10 MB
+    aiohttp_timeout = aiohttp.ClientTimeout(total=BENCH_AIOHTTP_TIMEOUT_SECONDS)
+    return aiohttp.ClientSession(
+        timeout=aiohttp_timeout, read_bufsize=BENCH_AIOHTTP_READ_BUFSIZE_BYTES
+    )
+@dataclass
+class RequestFuncInput:
+    prompt: Union[str, List[str], List[Dict[str, str]]]
+    api_url: str
+    prompt_len: int
+    output_len: int
+    model: str
+    lora_name: str
+    image_data: Optional[List[str]]
+    extra_request_body: Dict[str, Any]
+    timestamp: Optional[float] = None
+    routing_key: Optional[str] = None
+@dataclass
+class RequestFuncOutput:
+    generated_text: str = ""
+    success: bool = False
+    latency: float = 0.0
+    ttft: float = 0.0  # Time to first token
+    itl: List[float] = field(default_factory=list)  # List of inter-token latencies
+    text_chunks: List[str] = field(default_factory=list)
+    prompt_len: int = 0
+    error: str = ""
+    output_len: int = 0
+    start_time: float = 0.0
+    @staticmethod
+    def init_new(request_func_input: RequestFuncInput):
+        output = RequestFuncOutput()
+        output.prompt_len = request_func_input.prompt_len
+        return output
+def get_auth_headers() -> Dict[str, str]:
+    openai_api_key = os.environ.get("OPENAI_API_KEY")
+    if openai_api_key:
+        return {"Authorization": f"Bearer {openai_api_key}"}
+    else:
+        api_key = os.environ.get("API_KEY")
+        if api_key:
+            return {"Authorization": f"{api_key}"}
+        return {}
+def get_request_headers() -> Dict[str, str]:
+    headers = get_auth_headers()
+    if h := getattr(args, "header", None):
+        headers.update(parse_custom_headers(h))
+    return headers
+def wait_for_endpoint(url: str, timeout_sec: int = 60) -> bool:
+    """Wait for the server to become ready by polling the given URL."""
+    print(f"Waiting up to {timeout_sec}s for {url} to become ready...")
+    start_time = time.perf_counter()
+    headers = get_auth_headers()
+    while True:
+        try:
+            response = requests.get(url, headers=headers, timeout=5)
+            if response.status_code == 200:
+                elapsed = time.perf_counter() - start_time
+                print(f"Server ready in {elapsed:.1f}s.")
+                return True
+        except requests.exceptions.RequestException:
+            pass
+        elapsed = time.perf_counter() - start_time
+        if elapsed >= timeout_sec:
+            print(f"Server did not become ready within {timeout_sec}s timeout.")
+            return False
+        time.sleep(1)
+# trt llm does not support ignore_eos
+# https://github.com/triton-inference-server/tensorrtllm_backend/issues/505
+async def async_request_trt_llm(
+    request_func_input: RequestFuncInput,
+    pbar: Optional[tqdm] = None,
+) -> RequestFuncOutput:
+    api_url = request_func_input.api_url
+    assert api_url.endswith("generate_stream")
+    async with _create_bench_client_session() as session:
+        payload = {
+            "accumulate_tokens": True,
+            "text_input": request_func_input.prompt,
+            "temperature": 0.000001,
+            "top_p": 1.0,
+            "max_tokens": request_func_input.output_len,
+            "stream": True,
+            "min_length": request_func_input.output_len,
+            "end_id": 1048576,
+            **request_func_input.extra_request_body,
+        }
+        if args.disable_ignore_eos:
+            del payload["min_length"]
+            del payload["end_id"]
+        output = RequestFuncOutput.init_new(request_func_input)
+        ttft = 0.0
+        st = time.perf_counter()
+        most_recent_timestamp = st
+        try:
+            async with session.post(url=api_url, json=payload) as response:
+                if response.status == 200:
+                    async for chunk_bytes in response.content:
+                        chunk_bytes = chunk_bytes.strip()
+                        if not chunk_bytes:
+                            continue
+                        chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data:")
+                        data = json.loads(chunk)
+                        output.generated_text += data["text_output"]
+                        timestamp = time.perf_counter()
+                        # First token
+                        if ttft == 0.0:
+                            ttft = timestamp - st
+                            output.ttft = ttft
+                        # Decoding phase
+                        else:
+                            output.itl.append(timestamp - most_recent_timestamp)
+                        most_recent_timestamp = timestamp
+                    output.latency = most_recent_timestamp - st
+                    output.success = True
+                    output.output_len = request_func_input.output_len
+                else:
+                    output.error = (
+                        (response.reason or "") + ": " + (await response.text())
+                    )
+                    output.success = False
+        except Exception:
+            output.success = False
+            exc_info = sys.exc_info()
+            output.error = "".join(traceback.format_exception(*exc_info))
+        if pbar:
+            pbar.update(1)
+        return output
+# set ignore_eos True by default
+async def async_request_openai_completions(
+    request_func_input: RequestFuncInput,
+    pbar: Optional[tqdm] = None,
+) -> RequestFuncOutput:
+    api_url = request_func_input.api_url
+    assert api_url.endswith(
+        "completions"
+    ), "OpenAI Completions API URL must end with 'completions'."
+    prompt = request_func_input.prompt
+    async with _create_bench_client_session() as session:
+        # Build payload with defaults that can be overridden by extra_request_body
+        payload = {
+            "model": request_func_input.model,
+            "prompt": prompt,
+            "best_of": 1,
+            "max_tokens": request_func_input.output_len,
+            "stream": not args.disable_stream,
+        }
+        # Add temperature default only if not specified in extra_request_body
+        if "temperature" not in request_func_input.extra_request_body:
+            payload["temperature"] = 0.0
+        # Add ignore_eos default only if not specified in extra_request_body
+        if "ignore_eos" not in request_func_input.extra_request_body:
+            payload["ignore_eos"] = not args.disable_ignore_eos
+        # Merge in extra parameters - these will override defaults if present
+        payload.update(request_func_input.extra_request_body)
+        # hack to accommodate different LoRA conventions between SGLang and vLLM.
+        if request_func_input.lora_name:
+            payload["model"] = request_func_input.lora_name
+            payload["lora_path"] = request_func_input.lora_name
+        if request_func_input.image_data:
+            payload.update({"image_data": request_func_input.image_data})
+        headers = get_request_headers()
+        if request_func_input.routing_key:
+            headers[_ROUTING_KEY_HEADER] = request_func_input.routing_key
+        output = RequestFuncOutput.init_new(request_func_input)
+        generated_text = ""
+        output_len = request_func_input.output_len
+        ttft = 0.0
+        st = time.perf_counter()
+        output.start_time = st
+        most_recent_timestamp = st
+        try:
+            async with session.post(
+                url=api_url, json=payload, headers=headers
+            ) as response:
+                if response.status == 200:
+                    async for chunk_bytes in response.content:
+                        chunk_bytes = chunk_bytes.strip()
+                        if not chunk_bytes:
+                            continue
+                        chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data: ")
+                        latency = time.perf_counter() - st
+                        if chunk == "[DONE]":
+                            pass
+                        else:
+                            data = json.loads(chunk)
+                            # NOTE: Some completion API might have a last
+                            # usage summary response without a token so we
+                            # want to check a token was generated
+                            if data["choices"][0]["text"]:
+                                timestamp = time.perf_counter()
+                                # First token
+                                if ttft == 0.0:
+                                    ttft = time.perf_counter() - st
+                                    output.ttft = ttft
+                                # Decoding phase
+                                else:
+                                    output.text_chunks.append(
+                                        data["choices"][0]["text"]
+                                    )
+                                    output.itl.append(timestamp - most_recent_timestamp)
+                                most_recent_timestamp = timestamp
+                                generated_text += data["choices"][0]["text"]
+                                output_len = (data.get("usage") or {}).get(
+                                    "completion_tokens", output_len
+                                )
+                    output.generated_text = generated_text
+                    output.success = True
+                    output.latency = latency
+                    output.output_len = output_len
+                else:
+                    output.error = (
+                        (response.reason or "") + ": " + (await response.text())
+                    )
+                    output.success = False
+        except Exception:
+            output.success = False
+            exc_info = sys.exc_info()
+            output.error = "".join(traceback.format_exception(*exc_info))
+    if pbar:
+        pbar.update(1)
+    return output
+async def async_request_openai_chat_completions(
+    request_func_input: RequestFuncInput,
+    pbar: Optional[tqdm] = None,
+) -> RequestFuncOutput:
+    """Makes a request to the OpenAI Chat Completions API.
+    Handles both streaming and non-streaming responses, including support
+    for image data in messages. Calculates and returns various performance
+    metrics.
+    Args:
+        request_func_input: Input parameters for the request.
+        pbar: Optional tqdm progress bar to update.
+    Returns:
+        RequestFuncOutput: Output of the request, including generated text,
+                           latency, TTFT, ITL, and success status.
+    """
+    api_url = request_func_input.api_url
+    assert api_url.endswith(
+        "chat/completions"
+    ), "OpenAI Chat Completions API URL must end with 'chat/completions'."
+    # TODO put it to other functions when `pbar` logic is refactored
+    if getattr(args, "print_requests", False):
+        rid = str(uuid.uuid4())
+        input_partial = deepcopy(request_func_input)
+        input_partial.prompt = "..."
+        request_start_time = time.time()
+        print(
+            f'rid={rid} time={request_start_time} message="request start" request_func_input="{str(input_partial)}"'
+        )
+    if isinstance(request_func_input.prompt, list):
+        messages = request_func_input.prompt
+    elif request_func_input.image_data:
+        # Build multi-image content: a list of image_url entries followed by the text
+        content_items = [
+            {
+                "type": "image_url",
+                "image_url": {"url": img_url},
+            }
+            for img_url in request_func_input.image_data
+        ]
+        content_items.append({"type": "text", "text": request_func_input.prompt})
+        messages = [
+            {
+                "role": "user",
+                "content": content_items,
+            },
+        ]
+    else:
+        messages = [{"role": "user", "content": request_func_input.prompt}]
+    async with _create_bench_client_session() as session:
+        # Build payload with defaults that can be overridden by extra_request_body
+        payload = {
+            "model": request_func_input.model,
+            "messages": messages,
+            "max_completion_tokens": request_func_input.output_len,
+            "stream": not args.disable_stream,
+        }
+        # Add temperature default only if not specified in extra_request_body
+        if "temperature" not in request_func_input.extra_request_body:
+            payload["temperature"] = 0.0
+        # Add ignore_eos default only if not specified in extra_request_body
+        # Default to False for more realistic behavior (respect EOS tokens)
+        if "ignore_eos" not in request_func_input.extra_request_body:
+            payload["ignore_eos"] = not args.disable_ignore_eos
+        # Merge in extra parameters (tools, temperature, top_p, etc.)
+        # These will override defaults if present
+        payload.update(request_func_input.extra_request_body)
+        # hack to accommodate different LoRA conventions between SGLang and vLLM.
+        if request_func_input.lora_name:
+            payload["model"] = request_func_input.lora_name
+            payload["lora_path"] = request_func_input.lora_name
+        headers = get_request_headers()
+        if request_func_input.routing_key:
+            headers[_ROUTING_KEY_HEADER] = request_func_input.routing_key
+        output = RequestFuncOutput.init_new(request_func_input)
+        generated_text = ""
+        output_len = request_func_input.output_len
+        ttft = 0.0
+        st = time.perf_counter()
+        output.start_time = st
+        most_recent_timestamp = st
+        try:
+            async with session.post(
+                url=api_url, json=payload, headers=headers
+            ) as response:
+                if response.status == 200:
+                    if args.disable_stream:
+                        # Non-streaming response
+                        response_json = await response.json()
+                        output.generated_text = response_json["choices"][0]["message"][
+                            "content"
+                        ]
+                        output.success = True
+                        output.latency = time.perf_counter() - st
+                        output.ttft = (
+                            output.latency
+                        )  # For non-streaming, TTFT = total latency
+                        output.output_len = response_json.get("usage", {}).get(
+                            "completion_tokens", output_len
+                        )
+                    else:
+                        # Streaming response
+                        async for chunk_bytes in response.content:
+                            chunk_bytes = chunk_bytes.strip()
+                            if not chunk_bytes:
+                                continue
+                            chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data: ")
+                            latency = time.perf_counter() - st
+                            if chunk == "[DONE]":
+                                pass
+                            else:
+                                data = json.loads(chunk)
+                                # Check if this chunk contains content
+                                delta = data.get("choices", [{}])[0].get("delta", {})
+                                content = delta.get("content", "")
+                                if content:
+                                    timestamp = time.perf_counter()
+                                    # First token
+                                    if ttft == 0.0:
+                                        ttft = timestamp - st
+                                        output.ttft = ttft
+                                    # Decoding phase
+                                    else:
+                                        output.text_chunks.append(content)
+                                        output.itl.append(
+                                            timestamp - most_recent_timestamp
+                                        )
+                                    most_recent_timestamp = timestamp
+                                    generated_text += content
+                                # Check for usage info in final chunk
+                                output_len = (data.get("usage") or {}).get(
+                                    "completion_tokens", output_len
+                                )
+                        output.generated_text = generated_text
+                        output.success = True
+                        output.latency = latency
+                        output.output_len = output_len
+                else:
+                    output.error = (
+                        (response.reason or "") + ": " + (await response.text())
+                    )
+                    output.success = False
+        except Exception:
+            output.success = False
+            exc_info = sys.exc_info()
+            output.error = "".join(traceback.format_exception(*exc_info))
+    # TODO put it to other functions when `pbar` logic is refactored
+    if getattr(args, "print_requests", False):
+        curr_t = time.time()
+        output_partial = deepcopy(output)
+        output_partial.generated_text = "..."
+        print(
+            f'rid={rid} time={curr_t} time_delta={curr_t - request_start_time} message="request end" output="{str(output_partial)}"'
+        )
+    if pbar:
+        pbar.update(1)
+    return output
+async def async_request_truss(
+    request_func_input: RequestFuncInput,
+    pbar: Optional[tqdm] = None,
+) -> RequestFuncOutput:
+    api_url = request_func_input.api_url
+    prompt = request_func_input.prompt
+    async with _create_bench_client_session() as session:
+        payload = {
+            "model": request_func_input.model,
+            "prompt": prompt,
+            "temperature": 0.0,
+            "best_of": 1,
+            "max_tokens": request_func_input.output_len,
+            "stream": not args.disable_stream,
+            "ignore_eos": not args.disable_ignore_eos,
+            **request_func_input.extra_request_body,
+        }
+        headers = get_request_headers()
+        output = RequestFuncOutput.init_new(request_func_input)
+        generated_text = ""
+        ttft = 0.0
+        st = time.perf_counter()
+        most_recent_timestamp = st
+        try:
+            async with session.post(
+                url=api_url, json=payload, headers=headers
+            ) as response:
+                if response.status == 200:
+                    async for chunk_bytes in response.content:
+                        chunk_bytes = chunk_bytes.strip()
+                        if not chunk_bytes:
+                            continue
+                        chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data: ")
+                        latency = time.perf_counter() - st
+                        if chunk == "[DONE]":
+                            pass
+                        else:
+                            data = json.loads(chunk)
+                            # NOTE: Some completion API might have a last
+                            # usage summary response without a token so we
+                            # want to check a token was generated
+                            if data["choices"][0]["text"]:
+                                timestamp = time.perf_counter()
+                                # First token
+                                if ttft == 0.0:
+                                    ttft = time.perf_counter() - st
+                                    output.ttft = ttft
+                                # Decoding phase
+                                else:
+                                    output.itl.append(timestamp - most_recent_timestamp)
+                                most_recent_timestamp = timestamp
+                                generated_text += data["choices"][0]["text"]
+                    output.generated_text = generated_text
+                    output.success = True
+                    output.latency = latency
+                    output.output_len = request_func_input.output_len
+                else:
+                    output.error = (
+                        (response.reason or "") + ": " + (await response.text())
+                    )
+                    output.success = False
+        except Exception:
+            output.success = False
+            exc_info = sys.exc_info()
+            output.error = "".join(traceback.format_exception(*exc_info))
+    if pbar:
+        pbar.update(1)
+    return output
+async def async_request_sglang_generate(
+    request_func_input: RequestFuncInput,
+    pbar: Optional[tqdm] = None,
+) -> RequestFuncOutput:
+    api_url = request_func_input.api_url
+    prompt = request_func_input.prompt
+    async with _create_bench_client_session() as session:
+        payload = {
+            ("text" if isinstance(prompt, str) else "input_ids"): prompt,
+            "sampling_params": {
+                "temperature": 0.0,
+                "max_new_tokens": request_func_input.output_len,
+                "ignore_eos": not args.disable_ignore_eos,
+            },
+            "stream": not args.disable_stream,
+            "lora_path": request_func_input.lora_name,
+            "return_logprob": args.return_logprob,
+            "return_routed_experts": args.return_routed_experts,
+            "logprob_start_len": -1,
+            **request_func_input.extra_request_body,
+        }
+        # Add image data if available (list of image urls/base64)
+        if request_func_input.image_data:
+            payload["image_data"] = request_func_input.image_data
+        headers = get_request_headers()
+        if request_func_input.routing_key:
+            headers[_ROUTING_KEY_HEADER] = request_func_input.routing_key
+        output = RequestFuncOutput.init_new(request_func_input)
+        generated_text = ""
+        output_len = request_func_input.output_len
+        ttft = 0.0
+        st = time.perf_counter()
+        output.start_time = st
+        most_recent_timestamp = st
+        last_output_len = 0
+        try:
+            async with session.post(
+                url=api_url, json=payload, headers=headers
+            ) as response:
+                if response.status == 200:
+                    async for chunk_bytes in response.content:
+                        chunk_bytes = chunk_bytes.strip()
+                        if not chunk_bytes:
+                            continue
+                        chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data: ")
+                        latency = time.perf_counter() - st
+                        if chunk == "[DONE]":
+                            pass
+                        else:
+                            data = json.loads(chunk)
+                            # NOTE: Some completion API might have a last
+                            # usage summary response without a token so we
+                            # want to check a token was generated
+                            if "text" in data and data["text"]:
+                                timestamp = time.perf_counter()
+                                generated_text = data["text"]
+                                output_len = data["meta_info"]["completion_tokens"]
+                                # First token
+                                if ttft == 0.0:
+                                    ttft = time.perf_counter() - st
+                                    output.ttft = ttft
+                                # Decoding phase
+                                else:
+                                    num_new_tokens = output_len - last_output_len
+                                    if num_new_tokens == 0:
+                                        continue
+                                    chunk_gap = timestamp - most_recent_timestamp
+                                    adjust_itl = chunk_gap / num_new_tokens
+                                    output.itl.extend([adjust_itl] * num_new_tokens)
+                                most_recent_timestamp = timestamp
+                                last_output_len = output_len
+                    output.generated_text = generated_text
+                    output.success = True
+                    output.latency = latency
+                    output.output_len = output_len
+                else:
+                    output.error = (
+                        (response.reason or "") + ": " + (await response.text())
+                    )
+                    output.success = False
+        except Exception:
+            output.success = False
+            exc_info = sys.exc_info()
+            output.error = "".join(traceback.format_exception(*exc_info))
+            print(f"{output.error=}")
+    if pbar:
+        pbar.update(1)
+    return output
+async def async_request_gserver(
+    request_func_input: RequestFuncInput,
+    pbar: Optional[tqdm] = None,
+) -> RequestFuncOutput:
+    raise NotImplementedError()
+async def async_request_profile(api_url: str) -> RequestFuncOutput:
+    async with _create_bench_client_session() as session:
+        output = RequestFuncOutput()
+        try:
+            if api_url.endswith("/start_profile"):
+                num_steps = getattr(args, "profile_num_steps", None)
+                profile_by_stage = getattr(args, "profile_by_stage", None)
+                if profile_by_stage and num_steps is None:
+                    num_steps = 5
+                output_dir = getattr(args, "profile_output_dir", None)
+                if output_dir is None:
+                    output_dir = os.getenv("SGLANG_TORCH_PROFILER_DIR", "/tmp")
+                output_dir = Path(os.path.abspath(os.path.normpath(output_dir))) / str(
+                    time.time()
+                )
+                output_dir.mkdir(exist_ok=True, parents=True)
+                output_dir = str(output_dir)
+                body = {
+                    "activities": getattr(args, "profile_activities", []),
+                    "num_steps": num_steps,
+                    "profile_by_stage": profile_by_stage,
+                    "profile_stages": getattr(args, "profile_stages", None),
+                    "output_dir": output_dir,
+                    "profile_prefix": getattr(args, "profile_prefix", None),
+                }
+            else:
+                # stop_profile doesn't need any parameters
+                body = {}
+            print(f"async_request_profile {api_url=} {body=}")
+            # Add optional profiling parameters if provided
+            if (
+                hasattr(args, "profile_start_step")
+                and args.profile_start_step is not None
+            ):
+                body["start_step"] = str(args.profile_start_step)
+            if hasattr(args, "profile_steps") and args.profile_steps is not None:
+                body["num_steps"] = str(args.profile_steps)
+            async with session.post(url=api_url, json=body) as response:
+                if response.status == 200:
+                    output.success = True
+                else:
+                    output.error = (
+                        (response.reason or "") + ": " + (await response.text())
+                    )
+                    output.success = False
+        except Exception:
+            output.success = False
+            exc_info = sys.exc_info()
+            output.error = "".join(traceback.format_exception(*exc_info))
+    return output
+def _build_profile_urls(
+    profile_prefill_url: Optional[List[str]],
+    profile_decode_url: Optional[List[str]],
+) -> List[Tuple[str, str]]:
+    """Build profile URLs list from prefill/decode URL arguments.
+    Returns:
+        List of (worker_type, url) tuples. e.g., [("Prefill-0", "http://..."), ("Decode-0", "http://...")]
+    """
+    profile_urls = []
+    if profile_prefill_url:
+        for idx, url in enumerate(profile_prefill_url):
+            profile_urls.append((f"Prefill-{idx}", url))
+    if profile_decode_url:
+        for idx, url in enumerate(profile_decode_url):
+            profile_urls.append((f"Decode-{idx}", url))
+    return profile_urls
+async def _call_profile_pd(profile_urls: List[Tuple[str, str]], mode: str) -> None:
+    """Call profile endpoint (start/stop) on PD separated workers.
+    Args:
+        profile_urls: List of (worker_type, url) tuples
+        mode: "start" or "stop"
+    """
+    endpoint = "/start_profile" if mode == "start" else "/stop_profile"
+    action = "Starting" if mode == "start" else "Stopping"
+    action_past = "started" if mode == "start" else "stopped"
+    print(f"{action} profiler...")
+    for worker_type, url in profile_urls:
+        profile_output = await async_request_profile(api_url=url + endpoint)
+        if profile_output.success:
+            print(f"Profiler {action_past} for {worker_type} worker at {url}")
+        else:
+            print(
+                f"Failed to {mode} profiler for {worker_type} worker at {url}: {profile_output.error}"
+            )
+ASYNC_REQUEST_FUNCS = {
+    "sglang": async_request_sglang_generate,
+    "sglang-native": async_request_sglang_generate,
+    "sglang-oai": async_request_openai_completions,
+    "sglang-oai-chat": async_request_openai_chat_completions,
+    "vllm": async_request_openai_completions,
+    "vllm-chat": async_request_openai_chat_completions,
+    "lmdeploy": async_request_openai_completions,
+    "lmdeploy-chat": async_request_openai_chat_completions,
+    "trt": async_request_trt_llm,
+    "gserver": async_request_gserver,
+    "truss": async_request_truss,
+}
+@dataclass
+class BenchmarkMetrics:
+    completed: int
+    total_input: int
+    total_input_text: int
+    total_input_vision: int
+    total_output: int
+    total_output_retokenized: int
+    request_throughput: float
+    input_throughput: float
+    output_throughput: float
+    output_throughput_retokenized: float
+    total_throughput: float
+    total_throughput_retokenized: float
+    mean_ttft_ms: float
+    median_ttft_ms: float
+    std_ttft_ms: float
+    p99_ttft_ms: float
+    mean_tpot_ms: float
+    median_tpot_ms: float
+    std_tpot_ms: float
+    p99_tpot_ms: float
+    mean_itl_ms: float
+    median_itl_ms: float
+    std_itl_ms: float
+    p95_itl_ms: float
+    p99_itl_ms: float
+    max_itl_ms: float
+    mean_e2e_latency_ms: float
+    median_e2e_latency_ms: float
+    std_e2e_latency_ms: float
+    p90_e2e_latency_ms: float
+    p99_e2e_latency_ms: float
+    concurrency: float
+    max_output_tokens_per_s: float = 0.0
+    max_concurrent_requests: int = 0
+async def get_request(
+    input_requests: List[DatasetRow],
+    request_rate: float,
+    use_trace_timestamps: bool = False,
+    slowdown_factor: float = 1.0,
+) -> AsyncGenerator[DatasetRow, None]:
+    if use_trace_timestamps:
+        print(
+            f"Using trace timestamps for request generation with slowdown factor {slowdown_factor}."
+        )
+        # Sort requests by timestamp for correct replay
+        input_requests.sort(key=lambda r: r.timestamp)
+        start_time = time.perf_counter()
+        trace_start_time_ms = input_requests[0].timestamp if input_requests else 0
+        for request in input_requests:
+            trace_time_s = (request.timestamp - trace_start_time_ms) / 1000.0
+            target_arrival_time = start_time + (trace_time_s * slowdown_factor)
+            sleep_duration = target_arrival_time - time.perf_counter()
+            if sleep_duration > 0:
+                await asyncio.sleep(sleep_duration)
+            yield request
+    else:
+        input_requests_iter = iter(input_requests)
+        for request in input_requests_iter:
+            yield request
+            if request_rate == float("inf"):
+                # If the request rate is infinity, then we don't need to wait.
+                continue
+            # Sample the request interval from the exponential distribution.
+            interval = np.random.exponential(1.0 / request_rate)
+            # The next request will be sent after the interval.
+            await asyncio.sleep(interval)
+def calculate_metrics(
+    input_requests: Optional[List[DatasetRow]],
+    outputs: List[RequestFuncOutput],
+    dur_s: float,
+    tokenizer: PreTrainedTokenizerBase,
+    backend: str,
+    accept_length: Optional[float] = None,
+    plot_throughput: bool = False,
+) -> Tuple[BenchmarkMetrics, List[int]]:
+    output_lens: List[int] = []
+    retokenized_output_lens: List[int] = []
+    total_input = 0
+    total_input_text = 0
+    total_input_vision = 0
+    completed = 0
+    itls: List[float] = []
+    tpots: List[float] = []
+    ttfts: List[float] = []
+    e2e_latencies: List[float] = []
+    retokenized_itls: List[float] = []
+    use_retokenized_itl = (
+        accept_length is not None
+        and accept_length > 0
+        and backend in ("sglang-oai", "sglang-oai-chat")
+    )
+    for i in range(len(outputs)):
+        if outputs[i].success:
+            output_len = outputs[i].output_len
+            output_lens.append(output_len)
+            retokenized_output_len = len(
+                tokenizer.encode(outputs[i].generated_text, add_special_tokens=False)
+            )
+            retokenized_output_lens.append(retokenized_output_len)
+            if input_requests is not None:
+                total_input += input_requests[i].prompt_len
+                total_input_text += input_requests[i].text_prompt_len
+                total_input_vision += input_requests[i].vision_prompt_len
+            if output_len > 1:
+                tpots.append((outputs[i].latency - outputs[i].ttft) / (output_len - 1))
+            if use_retokenized_itl:
+                for k, itl in enumerate(outputs[i].itl):
+                    num_tokens = len(
+                        tokenizer.encode(
+                            outputs[i].text_chunks[k], add_special_tokens=False
+                        )
+                    )
+                    adjusted_itl = itl / num_tokens
+                    retokenized_itls.extend([adjusted_itl] * num_tokens)
+            else:
+                itls += outputs[i].itl
+            ttfts.append(outputs[i].ttft)
+            e2e_latencies.append(outputs[i].latency)
+            completed += 1
+        else:
+            output_lens.append(0)
+            retokenized_output_lens.append(0)
+    if completed == 0:
+        warnings.warn(
+            "All requests failed. This is likely due to a misconfiguration "
+            "on the benchmark arguments.",
+            stacklevel=2,
+        )
+    max_output_tokens_per_s = 0.0
+    max_concurrent_requests = 0
+    successful_outputs = [output for output in outputs if output.success]
+    if successful_outputs:
+        min_start_time = min(output.start_time for output in successful_outputs)
+        max_end_time = max(
+            output.start_time + output.latency for output in successful_outputs
+        )
+        duration_seconds = int(np.ceil(max_end_time - min_start_time)) + 1
+        tokens_per_second = np.zeros(duration_seconds)
+        concurrent_requests_per_second = np.zeros(duration_seconds)
+        for output in outputs:
+            if not output.success:
+                continue
+            token_times = [output.start_time + output.ttft]
+            current_time = token_times[0]
+            for itl_value in output.itl:
+                current_time += itl_value
+                token_times.append(current_time)
+            for token_time in token_times:
+                second_bucket = int(token_time - min_start_time)
+                if 0 <= second_bucket < duration_seconds:
+                    tokens_per_second[second_bucket] += 1
+            request_start_second = int(output.start_time - min_start_time)
+            request_end_second = int(
+                (output.start_time + output.latency) - min_start_time
+            )
+            for second in range(
+                request_start_second, min(request_end_second + 1, duration_seconds)
+            ):
+                concurrent_requests_per_second[second] += 1
+        if len(tokens_per_second) > 0:
+            max_output_tokens_per_s = float(np.max(tokens_per_second))
+            max_concurrent_requests = int(np.max(concurrent_requests_per_second))
+        if plot_throughput:
+            if TERM_PLOTLIB_AVAILABLE:
+                import termplotlib as tpl
+                fig = tpl.figure()
+                fig.plot(
+                    np.arange(len(tokens_per_second)),
+                    tokens_per_second,
+                    title="Output tokens per second",
+                    xlabel="Time (s)",
+                )
+                fig.plot(
+                    np.arange(len(concurrent_requests_per_second)),
+                    concurrent_requests_per_second,
+                    title="Concurrent requests per second",
+                    xlabel="Time (s)",
+                )
+                fig.show()
+            else:
+                print("tip: install termplotlib and gnuplot to plot the metrics")
+    itls = retokenized_itls if use_retokenized_itl else itls
+    metrics = BenchmarkMetrics(
+        completed=completed,
+        total_input=total_input,
+        total_input_text=total_input_text,
+        total_input_vision=total_input_vision,
+        total_output=sum(output_lens),
+        total_output_retokenized=sum(retokenized_output_lens),
+        request_throughput=completed / dur_s,
+        input_throughput=total_input / dur_s,
+        output_throughput=sum(output_lens) / dur_s,
+        output_throughput_retokenized=sum(retokenized_output_lens) / dur_s,
+        total_throughput=(total_input + sum(output_lens)) / dur_s,
+        total_throughput_retokenized=(total_input + sum(retokenized_output_lens))
+        / dur_s,
+        mean_ttft_ms=np.mean(ttfts or 0)
+        * 1000,  # ttfts is empty if streaming is not supported by backend
+        median_ttft_ms=np.median(ttfts or 0) * 1000,
+        std_ttft_ms=np.std(ttfts or 0) * 1000,
+        p99_ttft_ms=np.percentile(ttfts or 0, 99) * 1000,
+        mean_tpot_ms=np.mean(tpots or 0) * 1000,
+        median_tpot_ms=np.median(tpots or 0) * 1000,
+        std_tpot_ms=np.std(tpots or 0) * 1000,
+        p99_tpot_ms=np.percentile(tpots or 0, 99) * 1000,
+        mean_itl_ms=np.mean(itls or 0) * 1000,
+        median_itl_ms=np.median(itls or 0) * 1000,
+        std_itl_ms=np.std(itls or 0) * 1000,
+        p95_itl_ms=np.percentile(itls or 0, 95) * 1000,
+        p99_itl_ms=np.percentile(itls or 0, 99) * 1000,
+        max_itl_ms=np.max(itls or 0) * 1000,
+        mean_e2e_latency_ms=np.mean(e2e_latencies) * 1000,
+        median_e2e_latency_ms=np.median(e2e_latencies) * 1000,
+        std_e2e_latency_ms=np.std(e2e_latencies) * 1000,
+        p90_e2e_latency_ms=np.percentile(e2e_latencies, 90) * 1000,
+        p99_e2e_latency_ms=np.percentile(e2e_latencies, 99) * 1000,
+        concurrency=np.sum(e2e_latencies) / dur_s,
+        max_output_tokens_per_s=max_output_tokens_per_s,
+        max_concurrent_requests=max_concurrent_requests,
+    )
+    return metrics, output_lens
+MULTI_TURN_BACKENDS = {"sglang-oai-chat", "vllm-chat", "lmdeploy-chat"}
+def wrap_multi_turn_request_func(request_func: Callable, backend: str) -> Callable:
+    assert (
+        backend in MULTI_TURN_BACKENDS
+    ), f"Multi-turn only supports chat backends: {MULTI_TURN_BACKENDS}, got {backend}"
+    async def f(
+        request_func_input: RequestFuncInput,
+        pbar: Optional[tqdm] = None,
+    ) -> List[RequestFuncOutput]:
+        prompts: List[str] = request_func_input.prompt
+        prev_messages: List[Dict[str, str]] = []
+        outputs = []
+        for round_index in range(len(prompts)):
+            prev_messages.append({"role": "user", "content": prompts[round_index]})
+            inner_input = replace(
+                copy.deepcopy(request_func_input), prompt=copy.deepcopy(prev_messages)
+            )
+            output = await request_func(
+                inner_input, pbar=pbar if round_index == len(prompts) - 1 else None
+            )
+            outputs.append(output)
+            prev_messages.append(
+                {"role": "assistant", "content": output.generated_text}
+            )
+        return outputs
+    return f
+async def benchmark(
+    backend: str,
+    api_url: str,
+    base_url: str,
+    model_id: str,
+    tokenizer: PreTrainedTokenizerBase,
+    input_requests: List[DatasetRow],
+    request_rate: float,
+    max_concurrency: Optional[int],
+    disable_tqdm: bool,
+    lora_names: List[str],
+    lora_request_distribution: Optional[str],
+    lora_zipf_alpha: Optional[float],
+    extra_request_body: Dict[str, Any],
+    profile: bool,
+    pd_separated: bool = False,
+    flush_cache: bool = False,
+    warmup_requests: int = 1,
+    use_trace_timestamps: bool = False,
+    mooncake_slowdown_factor=1.0,
+    mooncake_num_rounds=1,
+    profile_prefill_url: Optional[List[str]] = None,
+    profile_decode_url: Optional[List[str]] = None,
+):
+    if backend in ASYNC_REQUEST_FUNCS:
+        request_func = ASYNC_REQUEST_FUNCS[backend]
+    else:
+        raise ValueError(f"Unknown backend: {backend}")
+    # Check for multi-turn: prompt is a list of strings (not OpenAI messages dicts)
+    # Multi-turn format: ["turn1", "turn2", ...] - list of strings
+    # OpenAI format: [{"role": "user", "content": "..."}, ...] - list of dicts
+    first_prompt = input_requests[0].prompt
+    is_multi_turn = (
+        isinstance(first_prompt, list)
+        and len(first_prompt) > 0
+        and isinstance(first_prompt[0], str)
+    )
+    if is_multi_turn:
+        request_func = wrap_multi_turn_request_func(request_func, backend=backend)
+    # Limit concurrency
+    # From https://github.com/vllm-project/vllm/pull/9390
+    semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None
+    async def limited_request_func(request_func_input, pbar):
+        if semaphore is None:
+            return await request_func(request_func_input=request_func_input, pbar=pbar)
+        async with semaphore:
+            return await request_func(request_func_input=request_func_input, pbar=pbar)
+    # Warmup
+    print(f"Starting warmup with {warmup_requests} sequences...")
+    # Handle the data structure difference for the warmup request
+    if args.dataset_name == "mooncake":
+        # For mooncake, input_requests is a list of dicts.
+        # We need to build a temporary DatasetRow for the warmup phase.
+        warmup_record = input_requests[0]
+        # Build prompt from hash_ids, just like in the async generator
+        hash_ids = warmup_record.get("hash_ids", [])
+        prompt_text = ""
+        for hash_id in hash_ids:
+            prompt_text += f"{hash_id}" + " ".join(["hi"] * 512)
+        prompt_text += "Can you tell me a detailed story in 1000 words?"
+        output_len = warmup_record.get("output_length", 32)
+        prompt_len = len(tokenizer.encode(prompt_text))
+        # Create a temporary DatasetRow object for warmup
+        test_request = DatasetRow(
+            prompt=prompt_text,
+            prompt_len=prompt_len,
+            output_len=output_len,
+            image_data=None,  # Mooncake doesn't have image data
+        )
+    else:
+        # For all other datasets, input_requests is a list of DatasetRow objects
+        test_request = input_requests[0]
+    if lora_names is not None and len(lora_names) != 0:
+        lora_name = lora_names[0]
+    else:
+        lora_name = None
+    # Create the test input once
+    test_input = RequestFuncInput(
+        model=model_id,
+        prompt=test_request.prompt,
+        api_url=api_url,
+        prompt_len=test_request.prompt_len,
+        output_len=min(test_request.output_len, 32),
+        lora_name=lora_name,
+        image_data=test_request.image_data,
+        extra_request_body=extra_request_body,
+    )
+    # Run warmup requests
+    warmup_tasks = []
+    for _ in range(warmup_requests):
+        warmup_tasks.append(
+            asyncio.create_task(request_func(request_func_input=test_input))
+        )
+    warmup_outputs = await asyncio.gather(*warmup_tasks)
+    if is_multi_turn:
+        warmup_outputs = [x for output in warmup_outputs for x in output]
+    # Check if at least one warmup request succeeded
+    if warmup_requests > 0 and not any(output.success for output in warmup_outputs):
+        raise ValueError(
+            "Warmup failed - Please make sure benchmark arguments "
+            f"are correctly specified. Error: {warmup_outputs[0].error}"
+        )
+    else:
+        print(
+            f"Warmup completed with {args.warmup_requests} sequences. Starting main benchmark run..."
+        )
+    # Flush cache
+    if ("sglang" in backend and _get_bool_env_var("SGLANG_IS_IN_CI")) or flush_cache:
+        requests.post(base_url + "/flush_cache", headers=get_auth_headers())
+    time.sleep(1.0)
+    # Build profile URLs for PD separated mode (do this once at the beginning)
+    pd_profile_urls = []
+    if profile and pd_separated:
+        pd_profile_urls = _build_profile_urls(profile_prefill_url, profile_decode_url)
+        if not pd_profile_urls:
+            print(
+                "Warning: PD separated mode requires --profile-prefill-url or --profile-decode-url"
+            )
+            print("Skipping profiler start. Please specify worker URLs for profiling.")
+    # Start profiler
+    if profile:
+        if pd_separated:
+            if pd_profile_urls:
+                await _call_profile_pd(pd_profile_urls, "start")
+        else:
+            print("Starting profiler...")
+            profile_output = await async_request_profile(
+                api_url=base_url + "/start_profile"
+            )
+            if profile_output.success:
+                print("Profiler started")
+    # Run all requests
+    benchmark_start_time = time.perf_counter()
+    tasks: List[asyncio.Task] = []
+    pbar_total = len(input_requests)
+    if (
+        backend == "sglang" and args.dataset_name == "mooncake"
+    ):  # Assuming mooncake is mainly for sglang or similar backends
+        print("Using time-based Mooncake request scheduler, ignoring --request-rate.")
+        request_generator = get_mooncake_request_over_time(
+            input_requests, tokenizer, mooncake_slowdown_factor, mooncake_num_rounds
+        )
+        print(
+            f"Starting Mooncake trace replay. Sessions: {len(input_requests)}, Rounds per session: {mooncake_num_rounds}. Slowdown factor: {mooncake_slowdown_factor}"
+        )
+        pbar_total *= args.mooncake_num_rounds
+    else:
+        request_generator = get_request(input_requests, request_rate)
+    # Prepare LoRA request distribution parameters
+    if lora_request_distribution == "distinct":
+        lora_idx = 0
+    elif lora_request_distribution == "skewed":
+        weights = np.array([lora_zipf_alpha**-i for i in range(len(lora_names))])
+        lora_probs = weights / np.sum(weights)
+    else:
+        lora_idx = None
+        lora_probs = None
+    pbar = None if disable_tqdm else tqdm(total=pbar_total)
+    async for request in request_generator:
+        if lora_names is not None and len(lora_names) != 0:
+            if lora_request_distribution == "uniform":
+                lora_name = random.choice(lora_names)
+            elif lora_request_distribution == "distinct":
+                lora_name = lora_names[lora_idx]
+                lora_idx = (lora_idx + 1) % len(lora_names)
+            else:
+                assert (
+                    lora_request_distribution == "skewed"
+                ), f"Unexpected lora_request_distribution: {lora_request_distribution}. Expected 'skewed'."
+                lora_name = np.random.choice(lora_names, p=lora_probs)
+        else:
+            lora_name = None
+        # Merge global extra_request_body with per-request extras
+        # Per-request parameters take precedence over global ones
+        merged_extra_body = {**extra_request_body, **request.extra_request_body}
+        request_func_input = RequestFuncInput(
+            model=model_id,
+            prompt=request.prompt,
+            api_url=api_url,
+            prompt_len=request.prompt_len,
+            output_len=request.output_len,
+            lora_name=lora_name,
+            image_data=request.image_data,
+            extra_request_body=merged_extra_body,
+            timestamp=request.timestamp,
+            routing_key=request.routing_key,
+        )
+        tasks.append(
+            asyncio.create_task(
+                limited_request_func(request_func_input=request_func_input, pbar=pbar)
+            )
+        )
+    outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
+    if is_multi_turn:
+        outputs = [x for output in outputs for x in output]
+    # Stop profiler (only if profile_steps was not provided, as it auto-stops)
+    if profile and not (
+        hasattr(args, "profile_steps") and args.profile_steps is not None
+    ):
+        if pd_separated:
+            if pd_profile_urls:
+                await _call_profile_pd(pd_profile_urls, "stop")
+        else:
+            if getattr(args, "profile_num_steps", None) is None:
+                print("Stopping profiler...")
+                profile_output = await async_request_profile(
+                    api_url=base_url + "/stop_profile"
+                )
+                if profile_output.success:
+                    print("Profiler stopped")
+    if pbar is not None:
+        pbar.close()
+    if "sglang" in backend:
+        server_info = requests.get(
+            base_url + "/get_server_info", headers=get_auth_headers()
+        )
+        if server_info.status_code == 200:
+            server_info_json = server_info.json()
+            if "decode" in server_info_json:
+                server_info_json = server_info_json["decode"][0]
+            if (
+                "internal_states" in server_info_json
+                and server_info_json["internal_states"]
+            ):
+                accept_length = server_info_json["internal_states"][0].get(
+                    "avg_spec_accept_length", None
+                )
+            else:
+                accept_length = None
+        else:
+            accept_length = None
+    else:
+        accept_length = None
+    # Compute metrics and print results
+    benchmark_duration = time.perf_counter() - benchmark_start_time
+    metrics, output_lens = calculate_metrics(
+        input_requests=None if is_multi_turn else input_requests,
+        outputs=outputs,
+        dur_s=benchmark_duration,
+        tokenizer=tokenizer,
+        backend=backend,
+        accept_length=accept_length,
+        plot_throughput=args.plot_throughput,
+    )
+    print("\n{s:{c}^{n}}".format(s=" Serving Benchmark Result ", n=50, c="="))
+    print("{:<40} {:<10}".format("Backend:", backend))
+    print(
+        "{:<40} {:<10}".format(
+            "Traffic request rate:", "trace" if use_trace_timestamps else request_rate
+        )
+    )
+    print(
+        "{:<40} {:<10}".format(
+            "Max request concurrency:",
+            max_concurrency if max_concurrency else "not set",
+        )
+    )
+    print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
+    print("{:<40} {:<10.2f}".format("Benchmark duration (s):", benchmark_duration))
+    print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
+    print("{:<40} {:<10}".format("Total input text tokens:", metrics.total_input_text))
+    if args.dataset_name in ["image", "mmmu"]:
+        print(
+            "{:<40} {:<10}".format(
+                "Total input vision tokens:", metrics.total_input_vision
+            )
+        )
+    print("{:<40} {:<10}".format("Total generated tokens:", metrics.total_output))
+    print(
+        "{:<40} {:<10}".format(
+            "Total generated tokens (retokenized):", metrics.total_output_retokenized
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Request throughput (req/s):", metrics.request_throughput
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Input token throughput (tok/s):", metrics.input_throughput
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Output token throughput (tok/s):", metrics.output_throughput
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Peak output token throughput (tok/s):", metrics.max_output_tokens_per_s
+        )
+    )
+    print(
+        "{:<40} {:<10}".format(
+            "Peak concurrent requests:", metrics.max_concurrent_requests
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Total token throughput (tok/s):", metrics.total_throughput
+        )
+    )
+    print("{:<40} {:<10.2f}".format("Concurrency:", metrics.concurrency))
+    if accept_length:
+        print("{:<40} {:<10.2f}".format("Accept length:", accept_length))
+    print("{s:{c}^{n}}".format(s="End-to-End Latency", n=50, c="-"))
+    print(
+        "{:<40} {:<10.2f}".format("Mean E2E Latency (ms):", metrics.mean_e2e_latency_ms)
+    )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Median E2E Latency (ms):", metrics.median_e2e_latency_ms
+        )
+    )
+    print(
+        "{:<40} {:<10.2f}".format("P90 E2E Latency (ms):", metrics.p90_e2e_latency_ms)
+    )
+    print(
+        "{:<40} {:<10.2f}".format("P99 E2E Latency (ms):", metrics.p99_e2e_latency_ms)
+    )
+    print("{s:{c}^{n}}".format(s="Time to First Token", n=50, c="-"))
+    print("{:<40} {:<10.2f}".format("Mean TTFT (ms):", metrics.mean_ttft_ms))
+    print("{:<40} {:<10.2f}".format("Median TTFT (ms):", metrics.median_ttft_ms))
+    print("{:<40} {:<10.2f}".format("P99 TTFT (ms):", metrics.p99_ttft_ms))
+    print(
+        "{s:{c}^{n}}".format(s="Time per Output Token (excl. 1st token)", n=50, c="-")
+    )
+    print("{:<40} {:<10.2f}".format("Mean TPOT (ms):", metrics.mean_tpot_ms))
+    print("{:<40} {:<10.2f}".format("Median TPOT (ms):", metrics.median_tpot_ms))
+    print("{:<40} {:<10.2f}".format("P99 TPOT (ms):", metrics.p99_tpot_ms))
+    print("{s:{c}^{n}}".format(s="Inter-Token Latency", n=50, c="-"))
+    print("{:<40} {:<10.2f}".format("Mean ITL (ms):", metrics.mean_itl_ms))
+    print("{:<40} {:<10.2f}".format("Median ITL (ms):", metrics.median_itl_ms))
+    print("{:<40} {:<10.2f}".format("P95 ITL (ms):", metrics.p95_itl_ms))
+    print("{:<40} {:<10.2f}".format("P99 ITL (ms):", metrics.p99_itl_ms))
+    print("{:<40} {:<10.2f}".format("Max ITL (ms):", metrics.max_itl_ms))
+    print("=" * 50)
+    resp = requests.get(base_url + "/get_server_info", headers=get_auth_headers())
+    server_info = resp.json() if resp.status_code == 200 else None
+    if (
+        metrics.median_ttft_ms is not None
+        and metrics.mean_itl_ms is not None
+        and metrics.output_throughput is not None
+    ):
+        result = {
+            # Arguments
+            "tag": getattr(args, "tag", None),
+            "backend": args.backend,
+            "dataset_name": args.dataset_name,
+            "request_rate": "trace" if use_trace_timestamps else request_rate,
+            "max_concurrency": max_concurrency,
+            "sharegpt_output_len": args.sharegpt_output_len,
+            "random_input_len": args.random_input_len,
+            "random_output_len": args.random_output_len,
+            "random_range_ratio": args.random_range_ratio,
+            # Information
+            "server_info": server_info,
+            # Results
+            "duration": benchmark_duration,
+            "completed": metrics.completed,
+            "total_input_tokens": metrics.total_input,
+            "total_input_text_tokens": metrics.total_input_text,
+            "total_input_vision_tokens": metrics.total_input_vision,
+            "total_output_tokens": metrics.total_output,
+            "total_output_tokens_retokenized": metrics.total_output_retokenized,
+            "request_throughput": metrics.request_throughput,
+            "input_throughput": metrics.input_throughput,
+            "output_throughput": metrics.output_throughput,
+            "total_throughput": metrics.total_throughput,
+            "mean_e2e_latency_ms": metrics.mean_e2e_latency_ms,
+            "median_e2e_latency_ms": metrics.median_e2e_latency_ms,
+            "std_e2e_latency_ms": metrics.std_e2e_latency_ms,
+            "p90_e2e_latency_ms": metrics.p90_e2e_latency_ms,
+            "p99_e2e_latency_ms": metrics.p99_e2e_latency_ms,
+            "mean_ttft_ms": metrics.mean_ttft_ms,
+            "median_ttft_ms": metrics.median_ttft_ms,
+            "std_ttft_ms": metrics.std_ttft_ms,
+            "p99_ttft_ms": metrics.p99_ttft_ms,
+            "mean_tpot_ms": metrics.mean_tpot_ms,
+            "median_tpot_ms": metrics.median_tpot_ms,
+            "std_tpot_ms": metrics.std_tpot_ms,
+            "p99_tpot_ms": metrics.p99_tpot_ms,
+            "mean_itl_ms": metrics.mean_itl_ms,
+            "median_itl_ms": metrics.median_itl_ms,
+            "std_itl_ms": metrics.std_itl_ms,
+            "p95_itl_ms": metrics.p95_itl_ms,
+            "p99_itl_ms": metrics.p99_itl_ms,
+            "concurrency": metrics.concurrency,
+            "accept_length": accept_length,
+            "max_output_tokens_per_s": metrics.max_output_tokens_per_s,
+            "max_concurrent_requests": metrics.max_concurrent_requests,
+        }
+    else:
+        print(f"Error running benchmark for request rate: {request_rate}")
+        print("-" * 30)
+    # Determine output file name
+    if args.output_file:
+        output_file_name = args.output_file
+    else:
+        now = datetime.now().strftime("%m%d")
+        if args.dataset_name == "image":
+            output_file_name = (
+                f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_"
+                f"{args.random_output_len}_{args.image_count}imgs_"
+                f"{args.image_resolution}.jsonl"
+            )
+        elif args.dataset_name.startswith("random"):
+            output_file_name = f"{args.backend}_{now}_{args.num_prompts}_{args.random_input_len}_{args.random_output_len}.jsonl"
+        else:
+            output_file_name = (
+                f"{args.backend}_{now}_{args.num_prompts}_{args.dataset_name}.jsonl"
+            )
+    result_details = {
+        "input_lens": [output.prompt_len for output in outputs],
+        "output_lens": output_lens,
+        "ttfts": [output.ttft for output in outputs],
+        "itls": [output.itl for output in outputs],
+        "generated_texts": [output.generated_text for output in outputs],
+        "errors": [output.error for output in outputs],
+    }
+    # Append results to a JSONL file
+    with open(output_file_name, "a") as file:
+        if args.output_details:
+            result_for_dump = result | result_details
+        else:
+            result_for_dump = result
+        file.write(json.dumps(result_for_dump) + "\n")
+    return result | result_details
+def check_chat_template(model_path):
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        return "chat_template" in tokenizer.init_kwargs
+    except Exception as e:
+        print(f"Fail to load tokenizer config with error={e}")
+        return False
+def set_global_args(args_: argparse.Namespace):
+    """Set the global args."""
+    global args
+    args = args_
+def run_benchmark(args_: argparse.Namespace):
+    global args
+    args = args_
+    # Set default value for max_concurrency if not present
+    if not hasattr(args, "max_concurrency"):
+        args.max_concurrency = None
+    # Set default value for warmup_requests if not present
+    if not hasattr(args, "warmup_requests"):
+        args.warmup_requests = 1
+    if not hasattr(args, "output_details"):
+        args.output_details = False
+    if not hasattr(args, "tokenize_prompt"):
+        args.tokenize_prompt = False
+    if not hasattr(args, "plot_throughput"):
+        args.plot_throughput = False
+    if not hasattr(args, "use_trace_timestamps"):
+        args.use_trace_timestamps = False
+    if not hasattr(args, "mooncake_slowdown_factor"):
+        args.mooncake_slowdown_factor = 1.0
+    if not hasattr(args, "mooncake_slowdown_factor"):
+        args.mooncake_slowdown_factor = 1.0
+    if not hasattr(args, "mooncake_num_rounds"):
+        args.mooncake_num_rounds = 1
+    if not hasattr(args, "served_model_name"):
+        args.served_model_name = None
+    if getattr(args, "print_requests", False):
+        assert args.backend == "sglang-oai-chat"  # only support this now
+    print(f"benchmark_args={args}")
+    # Set global environments
+    set_ulimit()
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    extra_request_body = {}
+    if args.extra_request_body:
+        extra_request_body = json.loads(args.extra_request_body)
+    if args.tokenize_prompt:
+        assert (
+            args.backend == "sglang"
+        ), "`--tokenize-prompt` only compatible with `--backend sglang` currently"
+    # Set url
+    if args.port is None:
+        args.port = {
+            "sglang": 30000,
+            "sglang-native": 30000,
+            "sglang-oai": 30000,
+            "lmdeploy": 23333,
+            "vllm": 8000,
+            "trt": 8000,
+            "gserver": 9988,
+            "truss": 8080,
+        }.get(args.backend, 30000)
+    model_url = (
+        f"{args.base_url}/v1/models"
+        if args.base_url
+        else f"http://{args.host}:{args.port}/v1/models"
+    )
+    if args.backend in ["sglang", "sglang-native"]:
+        api_url = (
+            f"{args.base_url}/generate"
+            if args.base_url
+            else f"http://{args.host}:{args.port}/generate"
+        )
+    elif args.backend in ["sglang-oai", "vllm", "lmdeploy"]:
+        api_url = (
+            f"{args.base_url}/v1/completions"
+            if args.base_url
+            else f"http://{args.host}:{args.port}/v1/completions"
+        )
+    elif args.backend in ["sglang-oai-chat", "vllm-chat", "lmdeploy-chat"]:
+        api_url = (
+            f"{args.base_url}/v1/chat/completions"
+            if args.base_url
+            else f"http://{args.host}:{args.port}/v1/chat/completions"
+        )
+    elif args.backend == "trt":
+        api_url = (
+            f"{args.base_url}/v2/models/ensemble/generate_stream"
+            if args.base_url
+            else f"http://{args.host}:{args.port}/v2/models/ensemble/generate_stream"
+        )
+        if args.model is None:
+            print("Please provide a model using `--model` when using `trt` backend.")
+            sys.exit(1)
+    elif args.backend == "gserver":
+        api_url = args.base_url if args.base_url else f"{args.host}:{args.port}"
+        args.model = args.model or "default"
+    elif args.backend == "truss":
+        api_url = (
+            f"{args.base_url}/v1/models/model:predict"
+            if args.base_url
+            else f"http://{args.host}:{args.port}/v1/models/model:predict"
+        )
+    base_url = (
+        f"http://{args.host}:{args.port}" if args.base_url is None else args.base_url
+    )
+    # Wait for server to be ready
+    if args.ready_check_timeout_sec > 0:
+        health_url = model_url if args.backend not in ("trt", "gserver") else base_url
+        if not wait_for_endpoint(health_url, args.ready_check_timeout_sec):
+            print(f"Server at {health_url} is not ready. Exiting.")
+            sys.exit(1)
+    # Get model name
+    if args.model is None:
+        if args.backend == "truss":
+            print(
+                "Please provide a model with `--model` when using truss backend. e.g. --model meta-llama/Llama-3.1-8B-Instruct"
+            )
+            sys.exit(1)
+        try:
+            response = requests.get(model_url, headers=get_auth_headers())
+            model_list = response.json().get("data", [])
+            args.model = model_list[0]["id"] if model_list else None
+        except Exception as e:
+            print(f"Failed to fetch model from {model_url}. Error: {e}")
+            print(
+                "Please specify the correct host and port using `--host` and `--port`."
+            )
+            sys.exit(1)
+    if args.model is None:
+        print("No model specified or found. Please provide a model using `--model`.")
+        sys.exit(1)
+    if not check_chat_template(args.model):
+        print(
+            "\nWARNING It is recommended to use the `Chat` or `Instruct` model for benchmarking.\n"
+            "Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.\n"
+        )
+    if args.dataset_name in ["image", "mmmu"]:
+        args.apply_chat_template = True
+        assert (
+            not args.tokenize_prompt
+        ), "`--tokenize-prompt` not compatible with image dataset"
+    if args.lora_request_distribution in ["distinct", "skewed"]:
+        assert (
+            args.lora_name is not None and len(args.lora_name) > 1
+        ), "More than 1 LoRA adapter must be specified via --lora-name to use 'distinct' or 'skewed' request distribution."
+    assert (
+        args.lora_zipf_alpha > 1
+    ), f"Got invalid value for --lora-zipf-alpha of {args.lora_zipf_alpha}. It must be greater than 1."
+    print(f"{args}\n")
+    # Read dataset
+    backend = args.backend
+    model_id = args.served_model_name or args.model
+    tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
+    tokenizer = get_tokenizer(tokenizer_id)
+    input_requests = get_dataset(args, tokenizer, model_id)
+    # compatible with SimpleNamespace
+    if not hasattr(args, "flush_cache"):
+        args.flush_cache = False
+    # Prepare LoRA arguments
+    lora_request_distribution = (
+        args.lora_request_distribution if args.lora_name is not None else None
+    )
+    lora_zipf_alpha = (
+        args.lora_zipf_alpha
+        if args.lora_name is not None and args.lora_request_distribution == "skewed"
+        else None
+    )
+    return asyncio.run(
+        benchmark(
+            backend=backend,
+            api_url=api_url,
+            base_url=base_url,
+            model_id=model_id,
+            tokenizer=tokenizer,
+            input_requests=input_requests,
+            request_rate=args.request_rate,
+            max_concurrency=args.max_concurrency,
+            disable_tqdm=args.disable_tqdm,
+            lora_names=args.lora_name,
+            lora_request_distribution=lora_request_distribution,
+            lora_zipf_alpha=lora_zipf_alpha,
+            extra_request_body=extra_request_body,
+            profile=args.profile,
+            pd_separated=args.pd_separated,
+            flush_cache=args.flush_cache,
+            warmup_requests=args.warmup_requests,
+            use_trace_timestamps=args.use_trace_timestamps,
+            mooncake_slowdown_factor=args.mooncake_slowdown_factor,
+            mooncake_num_rounds=args.mooncake_num_rounds,
+            profile_prefill_url=getattr(args, "profile_prefill_url", None),
+            profile_decode_url=getattr(args, "profile_decode_url", None),
+        )
+    )
+class LoRAPathAction(argparse.Action):
+    def __call__(self, parser, namespace, values, option_string=None):
+        setattr(namespace, self.dest, [])
+        for lora_name in values:
+            getattr(namespace, self.dest).append(lora_name)
+if __name__ == "__main__":
+    parser = ArgumentParser(description="Benchmark the online serving throughput.")
+    parser.add_argument(
+        "--backend",
+        type=str,
+        choices=list(ASYNC_REQUEST_FUNCS.keys()),
+        default="sglang",
+        help="Must specify a backend, depending on the LLM Inference Engine.",
+    )
+    parser.add_argument(
+        "--base-url",
+        type=str,
+        default=None,
+        help="Server or API base url if not using http host and port.",
+    )
+    parser.add_argument(
+        "--host", type=str, default="0.0.0.0", help="Default host is 0.0.0.0."
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        help="If not set, the default port is configured according to its default value for different LLM Inference Engines.",
+    )
+    parser.add_argument(
+        "--ready-check-timeout-sec",
+        type=int,
+        default=60,
+        help="Maximum time in seconds to wait for the server to be ready before benchmarking. Set to 0 to skip. Default: 60.",
+    )
+    parser.add_argument(
+        "--dataset-name",
+        type=str,
+        default="sharegpt",
+        choices=[
+            "sharegpt",
+            "custom",
+            "openai",
+            "random",
+            "random-ids",
+            "generated-shared-prefix",
+            "mmmu",
+            "image",
+            "mooncake",
+        ],
+        help="Name of the dataset to benchmark on.",
+    )
+    parser.add_argument(
+        "--dataset-path", type=str, default="", help="Path to the dataset."
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        help="Name or path of the model. If not set, the default model will request /v1/models for conf.",
+    )
+    parser.add_argument(
+        "--served-model-name",
+        type=str,
+        help="The name of the model as served by the serving service. If not set, this defaults to the value of --model.",
+    )
+    parser.add_argument(
+        "--tokenizer",
+        type=str,
+        help="Name or path of the tokenizer. If not set, using the model conf.",
+    )
+    parser.add_argument(
+        "--num-prompts",
+        type=int,
+        default=1000,
+        help="Number of prompts to process. Default is 1000.",
+    )
+    parser.add_argument(
+        "--sharegpt-output-len",
+        type=int,
+        default=None,
+        help="Output length for each request. Overrides the output length from the ShareGPT dataset.",
+    )
+    parser.add_argument(
+        "--sharegpt-context-len",
+        type=int,
+        default=None,
+        help="The context length of the model for the ShareGPT dataset. Requests longer than the context length will be dropped.",
+    )
+    parser.add_argument(
+        "--random-input-len",
+        type=int,
+        default=1024,
+        help="Number of input tokens per request, used only for random and image dataset.",
+    )
+    parser.add_argument(
+        "--random-output-len",
+        default=1024,
+        type=int,
+        help="Number of output tokens per request, used only for random and image dataset.",
+    )
+    parser.add_argument(
+        "--random-range-ratio",
+        type=float,
+        default=0.0,
+        help="Range of sampled ratio of input/output length, "
+        "used only for random and image dataset.",
+    )
+    # image dataset args
+    parser.add_argument(
+        "--image-count",
+        type=int,
+        default=1,
+        help="Number of images per request (only available with the image dataset)",
+    )
+    parser.add_argument(
+        "--image-resolution",
+        type=str,
+        default="1080p",
+        help=(
+            "Resolution of images for image dataset. "
+            "Supports presets 4k/1080p/720p/360p or custom 'heightxwidth' (e.g., 1080x1920)."
+        ),
+    )
+    parser.add_argument(
+        "--random-image-count",
+        action="store_true",
+        help="Enable Random Image Count",
+    )
+    parser.add_argument(
+        "--image-format",
+        type=str,
+        default="jpeg",
+        help=("Format of images for image dataset. " "Supports jpeg and png."),
+    )
+    parser.add_argument(
+        "--image-content",
+        type=str,
+        default="random",
+        help=("Content for images for image dataset. " "Supports random and blank."),
+    )
+    parser.add_argument(
+        "--request-rate",
+        type=float,
+        default=float("inf"),
+        help="Number of requests per second. If this is inf, then all the requests are sent at time 0. "
+        "Otherwise, we use Poisson process to synthesize the request arrival times. Default is inf.",
+    )
+    parser.add_argument(
+        "--use-trace-timestamps",
+        action="store_true",
+        help="Use timestamps from the trace file for request scheduling. Only valid for 'mooncake' dataset.",
+    )
+    parser.add_argument(
+        "--max-concurrency",
+        type=int,
+        default=None,
+        help="Maximum number of concurrent requests. This can be used "
+        "to help simulate an environment where a higher level component "
+        "is enforcing a maximum number of concurrent requests. While the "
+        "--request-rate argument controls the rate at which requests are "
+        "initiated, this argument will control how many are actually allowed "
+        "to execute at a time. This means that when used in combination, the "
+        "actual request rate may be lower than specified with --request-rate, "
+        "if the server is not processing requests fast enough to keep up.",
+    )
+    parser.add_argument("--output-file", type=str, help="Output JSONL file name.")
+    parser.add_argument(
+        "--output-details", action="store_true", help="Output details of benchmarking."
+    )
+    parser.add_argument(
+        "--print-requests",
+        action="store_true",
+        help="Print requests immediately during benchmarking. Useful to quickly realize issues.",
+    )
+    parser.add_argument(
+        "--disable-tqdm",
+        action="store_true",
+        help="Specify to disable tqdm progress bar.",
+    )
+    parser.add_argument(
+        "--disable-stream",
+        action="store_true",
+        help="Disable streaming mode.",
+    )
+    parser.add_argument(
+        "--return-logprob",
+        action="store_true",
+        help="Return logprob.",
+    )
+    parser.add_argument(
+        "--return-routed-experts",
+        action="store_true",
+        help="Return routed experts.",
+    )
+    parser.add_argument("--seed", type=int, default=1, help="The random seed.")
+    parser.add_argument(
+        "--disable-ignore-eos",
+        action="store_true",
+        help="Disable ignoring EOS.",
+    )
+    parser.add_argument(
+        "--extra-request-body",
+        metavar='{"key1": "value1", "key2": "value2"}',
+        type=str,
+        help="Append given JSON object to the request payload. You can use this to specify"
+        "additional generate params like sampling params.",
+    )
+    parser.add_argument(
+        "--apply-chat-template",
+        action="store_true",
+        help="Apply chat template",
+    )
+    parser.add_argument(
+        "--profile",
+        action="store_true",
+        help="Use Torch Profiler. The endpoint must be launched with "
+        "SGLANG_TORCH_PROFILER_DIR to enable profiler.",
+    )
+    parser.add_argument(
+        "--plot-throughput",
+        action="store_true",
+        help="Plot throughput and concurrent requests over time. Requires termplotlib and gnuplot.",
+    )
+    # TODO unify all these
+    parser.add_argument(
+        "--profile-activities",
+        type=str,
+        nargs="+",
+        default=["CPU", "GPU"],
+        choices=["CPU", "GPU", "CUDA_PROFILER", "XPU"],
+        help="Profiler activities to capture: CPU, GPU, XPU, CUDA_PROFILER.",
+    )
+    parser.add_argument(
+        "--profile-start-step",
+        type=int,
+        default=None,
+        help="Start profiling after this many forward steps. Useful for warmup.",
+    )
+    parser.add_argument(
+        "--profile-steps",
+        type=int,
+        default=None,
+        help="Number of steps to profile. If specified, profiling stops automatically after this many steps.",
+    )
+    parser.add_argument("--profile-num-steps", type=int, default=None)
+    parser.add_argument("--profile-by-stage", action="store_true", default=False)
+    parser.add_argument("--profile-stages", nargs="+", default=None)
+    parser.add_argument(
+        "--profile-output-dir",
+        type=str,
+        default=None,
+        help="Output directory for profile traces.",
+    )
+    parser.add_argument(
+        "--profile-prefix",
+        type=str,
+        default=None,
+        help="Prefix for profile trace filenames.",
+    )
+    parser.add_argument(
+        "--lora-name",
+        type=str,
+        nargs="*",
+        default=None,
+        action=LoRAPathAction,
+        help="The names of LoRA adapters. You can provide a list of names in the format {name} {name} {name}...",
+    )
+    parser.add_argument(
+        "--lora-request-distribution",
+        type=str,
+        default="uniform",
+        choices=[
+            "uniform",
+            "distinct",
+            "skewed",
+        ],
+        help="What distribution to sample the LoRA adapters specified in --lora-name. Borrowed from the Punica paper. "
+        "'distinct' distribution means selecting a new LoRA adapter for every request. "
+        "'skewed' distribution follows the Zipf distribution, where the number of requests "
+        "to model i specified in --lora-name is α times the number of requests for model i+1, "
+        "where α > 1.",
+    )
+    parser.add_argument(
+        "--lora-zipf-alpha",
+        type=float,
+        default=1.5,
+        help="The parameter to use for the Zipf distribution when --lora-request-distribution='skewed'.",
+    )
+    parser.add_argument(
+        "--prompt-suffix",
+        type=str,
+        default="",
+        help="Suffix applied to the end of all user prompts, followed by assistant prompt suffix.",
+    )
+    parser.add_argument(
+        "--pd-separated",
+        action="store_true",
+        help="Benchmark PD disaggregation server",
+    )
+    # Create a mutually exclusive group for profiling URLs
+    # In PD separated mode, prefill and decode workers must be profiled separately
+    profile_url_group = parser.add_mutually_exclusive_group()
+    profile_url_group.add_argument(
+        "--profile-prefill-url",
+        type=str,
+        nargs="*",
+        default=None,
+        help="URL(s) of the prefill worker(s) for profiling in PD separated mode. "
+        "Can specify multiple URLs: --profile-prefill-url http://localhost:30000 http://localhost:30001. "
+        "NOTE: Cannot be used together with --profile-decode-url. "
+        "In PD separated mode, prefill and decode workers must be profiled separately.",
+    )
+    profile_url_group.add_argument(
+        "--profile-decode-url",
+        type=str,
+        nargs="*",
+        default=None,
+        help="URL(s) of the decode worker(s) for profiling in PD separated mode. "
+        "Can specify multiple URLs: --profile-decode-url http://localhost:30010 http://localhost:30011. "
+        "NOTE: Cannot be used together with --profile-prefill-url. "
+        "In PD separated mode, prefill and decode workers must be profiled separately.",
+    )
+    parser.add_argument(
+        "--flush-cache",
+        action="store_true",
+        help="Flush the cache before running the benchmark",
+    )
+    parser.add_argument(
+        "--warmup-requests",
+        type=int,
+        default=1,
+        help="Number of warmup requests to run before the benchmark",
+    )
+    parser.add_argument(
+        "--tokenize-prompt",
+        action="store_true",
+        help="Use integer ids instead of string for inputs. Useful to control prompt lengths accurately",
+    )
+    group = parser.add_argument_group("generated-shared-prefix dataset arguments")
+    group.add_argument(
+        "--gsp-num-groups",
+        type=int,
+        default=64,
+        help="Number of system prompt groups for generated-shared-prefix dataset",
+    )
+    group.add_argument(
+        "--gsp-prompts-per-group",
+        type=int,
+        default=16,
+        help="Number of prompts per system prompt group for generated-shared-prefix dataset",
+    )
+    group.add_argument(
+        "--gsp-system-prompt-len",
+        type=int,
+        default=2048,
+        help="Target length in tokens for system prompts in generated-shared-prefix dataset",
+    )
+    group.add_argument(
+        "--gsp-question-len",
+        type=int,
+        default=128,
+        help="Target length in tokens for questions in generated-shared-prefix dataset",
+    )
+    group.add_argument(
+        "--gsp-output-len",
+        type=int,
+        default=256,
+        help="Target length in tokens for outputs in generated-shared-prefix dataset",
+    )
+    parser.add_argument(
+        "--gsp-range-ratio",
+        type=float,
+        # WARN: The default 1.0 is for backward compatibility, and is different from the default 0.0 for random dataset
+        default=1.0,
+        help="Range of sampled ratio of input/output length, used only for gsp dataset.",
+    )
+    group.add_argument(
+        "--gsp-fast-prepare",
+        action="store_true",
+        help="Speedup preparing by removing statistics computation, which will make some output statistics inaccurate but suitable for pressure tests.",
+    )
+    group.add_argument(
+        "--gsp-send-routing-key",
+        action="store_true",
+        help="Send routing key in requests via X-SMG-Routing-Key header. Requests with the same prefix share the same routing key.",
+    )
+    group.add_argument(
+        "--gsp-num-turns",
+        type=int,
+        default=1,
+        help="Number of turns for multi-turn conversations. If > 1, each prompt becomes a list of questions sharing the same system prefix.",
+    )
+    group.add_argument(
+        "--gsp-ordered",
+        action="store_true",
+        help="Keep requests in order without shuffling. By default, requests are shuffled randomly.",
+    )
+    mooncake_group = parser.add_argument_group("mooncake dataset arguments")
+    mooncake_group.add_argument(
+        "--mooncake-slowdown-factor",
+        type=float,
+        default=1.0,
+        help="Slowdown factor for replaying the mooncake trace. "
+        "A value of 2.0 means the replay is twice as slow. "
+        "NOTE: --request-rate is IGNORED in mooncake mode.",
+    )
+    mooncake_group.add_argument(
+        "--mooncake-num-rounds",
+        type=int,
+        default=1,
+        help="Number of conversation rounds for each session in the mooncake dataset. "
+        "A value > 1 will enable true multi-turn session benchmarking.",
+    )
+    mooncake_group.add_argument(
+        "--mooncake-workload",
+        type=str,
+        default="conversation",
+        choices=[
+            "mooncake",
+            "conversation",
+            "synthetic",
+            "toolagent",
+        ],
+        help="Underlying workload for the mooncake dataset.",
+    )
+    parser.add_argument(
+        "--tag", type=str, default=None, help="The tag to be dumped to output."
+    )
+    parser.add_argument(
+        "--header",
+        type=str,
+        nargs="+",
+        default=None,
+        help="Custom HTTP headers in Key=Value format. Example: --header MyHeader=MY_VALUE MyAnotherHeader=myanothervalue",
+    )
+    args = parser.parse_args()
+    run_benchmark(args)

sglang/python/sglang/benchmark/__init__.py ADDED Viewed

File without changes

sglang/python/sglang/benchmark/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (169 Bytes). View file

sglang/python/sglang/benchmark/__pycache__/utils.cpython-311.pyc ADDED Viewed

Binary file (7.65 kB). View file

sglang/python/sglang/benchmark/datasets/__init__.py ADDED Viewed

	@@ -0,0 +1,47 @@

+from typing import Dict, Type
+from sglang.benchmark.datasets.common import BaseDataset, DatasetRow
+from sglang.benchmark.datasets.custom import CustomDataset
+from sglang.benchmark.datasets.generated_shared_prefix import (
+    GeneratedSharedPrefixDataset,
+)
+from sglang.benchmark.datasets.image import ImageDataset
+from sglang.benchmark.datasets.mmmu import MMMUDataset
+from sglang.benchmark.datasets.mooncake import MooncakeDataset
+from sglang.benchmark.datasets.openai_dataset import OpenAIDataset
+from sglang.benchmark.datasets.random import RandomDataset
+from sglang.benchmark.datasets.sharegpt import ShareGPTDataset
+DATASET_MAPPING: Dict[str, Type[BaseDataset]] = {
+    "sharegpt": ShareGPTDataset,
+    "custom": CustomDataset,
+    "openai": OpenAIDataset,
+    # TODO: "random" vs "random-ids" should be a flag (e.g. --random-source=sharegpt|integers),
+    # not two separate dataset names sharing the same class.
+    "random": RandomDataset,
+    "random-ids": RandomDataset,
+    "generated-shared-prefix": GeneratedSharedPrefixDataset,
+    "mmmu": MMMUDataset,
+    "image": ImageDataset,
+    "mooncake": MooncakeDataset,
+}
+def get_dataset(args, tokenizer, model_id=None):
+    dataset_name = args.dataset_name
+    if dataset_name.startswith("random") and dataset_name not in DATASET_MAPPING:
+        dataset_name = "random-ids"
+    if dataset_name not in DATASET_MAPPING:
+        raise ValueError(f"Unknown dataset: {args.dataset_name}")
+    dataset_cls = DATASET_MAPPING[dataset_name]
+    dataset = dataset_cls.from_args(args)
+    return dataset.load(tokenizer=tokenizer, model_id=model_id)
+__all__ = [
+    "DATASET_MAPPING",
+    "DatasetRow",
+    "get_dataset",
+]

sglang/python/sglang/benchmark/datasets/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (2.1 kB). View file

sglang/python/sglang/benchmark/datasets/__pycache__/common.cpython-311.pyc ADDED Viewed

Binary file (5.24 kB). View file

sglang/python/sglang/benchmark/datasets/__pycache__/custom.cpython-311.pyc ADDED Viewed

Binary file (6.7 kB). View file

sglang/python/sglang/benchmark/datasets/__pycache__/generated_shared_prefix.cpython-311.pyc ADDED Viewed

Binary file (11.9 kB). View file

sglang/python/sglang/benchmark/datasets/__pycache__/image.cpython-311.pyc ADDED Viewed

Binary file (12.6 kB). View file

sglang/python/sglang/benchmark/datasets/__pycache__/mmmu.cpython-311.pyc ADDED Viewed

Binary file (5.98 kB). View file