File size: 3,419 Bytes
6268841 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | # Evaluating New Models with SGLang
This document provides commands for evaluating models' accuracy and performance. Before open-sourcing new models, we strongly suggest running these commands to verify whether the score matches your internal benchmark results.
**For cross verification, please submit commands for installation, server launching, and benchmark running with all the scores and hardware requirements when open-sourcing your models.**
[Reference: MiniMax M2](https://github.com/sgl-project/sglang/pull/12129)
## Accuracy
### LLMs
SGLang provides built-in scripts to evaluate common benchmarks.
**MMLU**
```bash
python -m sglang.test.run_eval \
--eval-name mmlu \
--port 30000 \
--num-examples 1000 \
--max-tokens 8192
```
**GSM8K**
```bash
python -m sglang.test.few_shot_gsm8k \
--host 127.0.0.1 \
--port 30000 \
--num-questions 200 \
--num-shots 5
```
**HellaSwag**
```bash
python benchmark/hellaswag/bench_sglang.py \
--host 127.0.0.1 \
--port 30000 \
--num-questions 200 \
--num-shots 20
```
**GPQA**
```bash
python -m sglang.test.run_eval \
--eval-name gpqa \
--port 30000 \
--num-examples 198 \
--max-tokens 120000 \
--repeat 8
```
```{tip}
For reasoning models, add `--thinking-mode <mode>` (e.g., `qwen3`, `deepseek-v3`). You may skip it if the model has forced thinking enabled.
```
**HumanEval**
```bash
pip install human_eval
python -m sglang.test.run_eval \
--eval-name humaneval \
--num-examples 10 \
--port 30000
```
### VLMs
**MMMU**
```bash
python benchmark/mmmu/bench_sglang.py \
--port 30000 \
--concurrency 64
```
```{tip}
You can set max tokens by passing `--extra-request-body '{"max_tokens": 4096}'`.
```
For models capable of processing video, we recommend extending the evaluation to include `VideoMME`, `MVBench`, and other relevant benchmarks.
## Performance
Performance benchmarks measure **Latency** (Time To First Token - TTFT) and **Throughput** (tokens/second).
### LLMs
**Latency-Sensitive Benchmark**
This simulates a scenario with low concurrency (e.g., single user) to measure latency.
```bash
python -m sglang.bench_serving \
--backend sglang \
--host 0.0.0.0 \
--port 30000 \
--dataset-name random \
--num-prompts 10 \
--max-concurrency 1
```
**Throughput-Sensitive Benchmark**
This simulates a high-traffic scenario to measure maximum system throughput.
```bash
python -m sglang.bench_serving \
--backend sglang \
--host 0.0.0.0 \
--port 30000 \
--dataset-name random \
--num-prompts 1000 \
--max-concurrency 100
```
**Single Batch Performance**
You can also benchmark the performance of processing a single batch offline.
```bash
python -m sglang.bench_one_batch_server \
--model <model-path> \
--batch-size 8 \
--input-len 1024 \
--output-len 1024
```
You can run more granular benchmarks:
- **Low Concurrency**: `--num-prompts 10 --max-concurrency 1`
- **Medium Concurrency**: `--num-prompts 80 --max-concurrency 16`
- **High Concurrency**: `--num-prompts 500 --max-concurrency 100`
## Reporting Results
For each evaluation, please report:
1. **Metric Score**: Accuracy % (LLMs and VLMs); Latency (ms) and Throughput (tok/s) (LLMs only).
2. **Environment settings**: GPU type/count, SGLang commit hash.
3. **Launch configuration**: Model path, TP size, and any special flags.
4. **Evaluation parameters**: Number of shots, examples, max tokens.
|