| ## Launch GLM-4.5 / GLM-4.6 / GLM-4.7 with SGLang |
|
|
| To serve GLM-4.5 / GLM-4.6 FP8 models on 8xH100/H200 GPUs: |
|
|
| ```bash |
| python3 -m sglang.launch_server --model zai-org/GLM-4.6-FP8 --tp 8 |
| ``` |
|
|
| ### EAGLE Speculative Decoding |
|
|
| **Description**: SGLang has supported GLM-4.5 / GLM-4.6 models |
| with [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding). |
|
|
| **Usage**: |
| Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and |
| `--speculative-num-draft-tokens` to enable this feature. For example: |
|
|
| ``` bash |
| python3 -m sglang.launch_server \ |
| --model-path zai-org/GLM-4.6-FP8 \ |
| --tp-size 8 \ |
| --tool-call-parser glm45 \ |
| --reasoning-parser glm45 \ |
| --speculative-algorithm EAGLE \ |
| --speculative-num-steps 3 \ |
| --speculative-eagle-topk 1 \ |
| --speculative-num-draft-tokens 4 \ |
| --mem-fraction-static 0.9 \ |
| --served-model-name glm-4.6-fp8 \ |
| --enable-custom-logit-processor |
| ``` |
|
|
| ```{tip} |
| To enable the experimental overlap scheduler for EAGLE speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages. |
| ``` |
|
|
| ### Thinking Budget for GLM-4.5 / GLM-4.6 |
| **Note**: For GLM-4.7, `--tool-call-parser` should be set to `glm47`, for GLM-4.5 and GLM-4.6, it should be set to `glm45`. |
|
|
| In SGLang, we can implement thinking budget with `CustomLogitProcessor`. |
|
|
| Launch a server with `--enable-custom-logit-processor` flag on. |
|
|
| Sample Request: |
|
|
| ```python |
| import openai |
| from rich.pretty import pprint |
| from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor |
| |
| |
| client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*") |
| response = client.chat.completions.create( |
| model="zai-org/GLM-4.6", |
| messages=[ |
| { |
| "role": "user", |
| "content": "Question: Is Paris the Capital of France?", |
| } |
| ], |
| max_tokens=1024, |
| extra_body={ |
| "custom_logit_processor": Glm4MoeThinkingBudgetLogitProcessor().to_str(), |
| "custom_params": { |
| "thinking_budget": 512, |
| }, |
| }, |
| ) |
| pprint(response) |
| ``` |
|
|