Instructions to use QuantTrio/GLM-5-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-5-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-5-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-5-AWQ")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-5-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuantTrio/GLM-5-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-5-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-5-AWQ

SGLang

How to use QuantTrio/GLM-5-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-5-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-5-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-5-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-5-AWQ
```

GLM-5-AWQ vLLM 部署指南

by CharlesChen2023 - opened Mar 1

Discussion

CharlesChen2023

Mar 1

GLM-5-AWQ vLLM 部署指南

本文档详细记录了 GLM-5-AWQ 模型在 vLLM 上的部署过程，包括遇到的问题及解决方案。

模型信息

模型名称: GLM-5-AWQ
模型架构: GlmMoeDsaForCausalLM (MoE, 744B 参数)
量化方式: AWQ (Activation-aware Weight Quantization)
模型大小: 392GB (141个分片)
专家数量: 256 experts
Base Model: zai-org/GLM-5

硬件环境

组件	规格
GPU	NVIDIA H20 x8 (SM 9.0, 141GB each)
CPU	多核处理器
内存	建议 512GB+ (模型加载需要大量系统内存)
存储	500GB+ SSD

软件环境

组件	版本	说明
OS	Ubuntu 22.04
NVIDIA Driver	570.158.01	支持 CUDA 12.8
CUDA Toolkit	12.8	需要安装
cuda-compat-12-9	575.57.08	关键！解决 PTX 兼容性问题
Python	3.10
PyTorch	2.10.0+cu128	vLLM nightly 自带
vLLM	0.16.1rc1.dev117	nightly 版本
transformers	5.3.0.dev0	GLM-5 需要

遇到的问题及解决方案

问题 1: CUDA PTX 版本不兼容

错误信息:

torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
Search for `cudaErrorUnsupportedPtxVersion`

原因分析:

vLLM nightly wheel 是用 CUDA 12.9 编译的
系统 NVIDIA 驱动 570.x 最高只支持 CUDA 12.8
驱动无法识别更高版本的 PTX 指令集

解决方案:

安装 NVIDIA CUDA Forward Compatibility 包：

# Ubuntu
sudo apt-get update
sudo apt-get install cuda-compat-12-9

# 设置环境变量（关键！）
export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH

原理: cuda-compat 包提供了用户态驱动，可以让旧版本驱动运行新版本 CUDA 编译的代码。

问题 2: GlmMoeDsaForCausalLM 架构不支持

错误信息:

ValueError: Model architectures ['GlmMoeDsaForCausalLM'] are not supported for now.

原因: 稳定版 vLLM (如 0.6.5) 不支持 GLM-5 的 MoE 架构。

解决方案: 使用 vLLM nightly 版本 (>=0.16.1rc1)。

问题 3: Worker 进程初始化失败

错误信息:

Exception: WorkerProc initialization failed due to an exception in a background process.

原因: CUDA Graph 在 MoE 大模型上消耗大量显存，导致初始化时崩溃。

解决方案: 添加 --enforce-eager 参数禁用 CUDA Graph。

问题 4: 分布式通信超时

原因: 模型巨大（392GB），默认超时时间不足。

解决方案: 设置环境变量延长超时时间：

export VLLM_DIST_TIMEOUT=1800  # 30分钟

完整安装步骤

Step 1: 安装 CUDA Toolkit 12.8

# Ubuntu
sudo apt-get update
sudo apt-get install cuda-toolkit-12-8

# 创建符号链接
sudo ln -sf /usr/local/cuda-12.8 /usr/local/cuda

Step 2: 安装 cuda-compat-12-9 (关键!)

sudo apt-get install cuda-compat-12-9

Step 3: 创建 Python 环境

conda create -n vllm-glm python=3.10 -y
conda activate vllm-glm

Step 4: 安装 vLLM nightly

pip install -U vllm --pre \
    --index-url https://pypi.org/simple \
    --extra-index-url https://wheels.vllm.ai/nightly

Step 5: 安装 transformers 开发版

GLM-5 需要 transformers >= 5.3.0：

pip install git+https://github.com/huggingface/transformers.git

Step 6: 安装 flashinfer-cubin (可选，性能优化)

pip install flashinfer-cubin==0.6.4

启动命令

基础启动命令

# 设置环境变量
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

# 启动 vLLM
vllm serve /path/to/GLM-5-AWQ \
    --served-model-name glm-5 \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --enforce-eager \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

完整启动脚本

#!/bin/bash
# GLM-5-AWQ vLLM 启动脚本

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# 关键：使用 cuda-compat-12-9 来支持更高版本的 PTX
export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH

# vLLM 环境变量
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve /path/to/GLM-5-AWQ \
    --served-model-name glm-5 \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --enforce-eager \
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

参数说明

参数	说明
`--tensor-parallel-size 8`	8卡张量并行
`--enable-expert-parallel`	启用专家并行 (MoE 模型必需)
`--enforce-eager`	禁用 CUDA Graph，避免显存溢出
`--gpu-memory-utilization 0.95`	GPU 显存利用率
`--max-model-len 32768`	最大序列长度
`--swap-space 16`	CPU swap 空间 (GB)
`--tool-call-parser glm47`	工具调用解析器
`--reasoning-parser glm45`	推理解析器

验证部署

检查服务状态

curl http://localhost:8000/health

查看模型列表

curl http://localhost:8000/v1/models

测试对话

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5",
    "messages": [{"role": "user", "content": "你好，请介绍一下你自己"}],
    "max_tokens": 500
  }'

Python 测试代码

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="glm-5",
    messages=[
        {"role": "user", "content": "请介绍一下你自己"}
    ],
    max_tokens=500
)

print(response.choices[0].message.content)

性能参考

指标	数值
模型加载时间	~3分钟 (392GB, 8x H20)
Engine 初始化	~50秒
首次推理延迟	~2秒
服务启动总时间	~5分钟

故障排除

问题: OOM (Out of Memory)

解决方案:

降低 --gpu-memory-utilization
减少 --max-model-len
减少 --max-num-seqs

问题: NCCL 通信错误

解决方案:

检查 NVLink 连接状态: nvidia-smi nvlink --status
设置 NCCL_DEBUG=INFO 查看详细日志
尝试设置 NCCL_P2P_DISABLE=1 禁用 P2P

问题: 模型加载卡住

解决方案:

检查系统内存是否充足 (free -h)
检查磁盘 IO (iostat -x 1)
增加 VLLM_DIST_TIMEOUT

版本兼容性矩阵

vLLM 版本	CUDA 版本	最低驱动版本	备注
0.16.1rc1	12.9	575.x	nightly, 需要 cuda-compat
0.16.0	12.8	570.x	stable
0.6.x	12.4	550.x	不支持 GLM-5 MoE

参考链接

更新日志

2026-03-01: 初始版本，成功部署 GLM-5-AWQ
- 解决 cuda-compat PTX 兼容性问题
- 解决 Worker 进程初始化失败问题
- 验证 API 可用性

final script
#!/bin/bash

GLM-5-AWQ vLLM 启动脚本

模型: /global_data/mllm/clw/GLM-5-AWQ

vllm: 0.16.1rc1.dev117 (nightly)

CUDA: 12.8

参考: https://huggingface.co/QuantTrio/GLM-5-AWQ

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

关键：使用 cuda-compat-12-9 来支持更高版本的 PTX

export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH

vllm 环境变量（官方推荐）

export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

解决分布式通信超时问题

export VLLM_DIST_TIMEOUT=1800

source /root/miniconda3/etc/profile.d/conda.sh && conda activate vllm-glm-fp4 && vllm serve
/global_data/mllm/clw/GLM-5-AWQ
--served-model-name glm-5
--swap-space 16
--max-num-seqs 32
--max-model-len 32768
--gpu-memory-utilization 0.95
--tensor-parallel-size 8
--enable-expert-parallel
--enforce-eager
--enable-auto-tool-choice
--tool-call-parser glm47
--reasoning-parser glm45
--trust-remote-code
--host 0.0.0.0
--port 8000

tclf90

QuantTrio org Mar 2

•

edited Mar 2

please use cuda 12.8 or 13.0

oliverjohnwilson

Mar 7

Do you have any guidance on how to get this model served on sm121?

Hardware

4x NVIDIA DGX Spark Founders Edition (GB10 Grace Blackwell)
1 GPU per node, total 4 GPUs across 4 nodes
Interconnect: ConnectX-7 RoCE, switched fabric at 200G, MTU 4200 (RoCE active_mtu 4096)

Software versions (container)

vLLM: 0.17.0rc1.dev125+gc188749bc.d20260306
transformers: 5.3.0
flashinfer-python: 0.6.5
Base container: nvcr.io/nvidia/pytorch:26.01-py3

Serving command:

# vLLM
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve "$MODEL" \
    --served-model-name "QuantTrio/GLM-5-AWQ" \
    --distributed-executor-backend ray \
    --enforce-eager
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Error:

ValueError: No valid attention backend found for CUDA

AttentionSelectorConfig:
  head_size: 576
  dtype: torch.bfloat16
  kv_cache_dtype: auto
  block_size: None
  use_mla: true
  has_sink: false
  use_sparse: true
  use_mm_prefix: false
  use_per_head_quant_scales: false
  attn_type: AttentionType.DECODER

Backends tried (and why each failed):

  1) FLASH_ATTN_MLA
     - sparse not supported
     - compute capability not supported
     - FlashAttention MLA not supported on this device

  2) FLASHMLA
     - sparse not supported
     - compute capability not supported
     - vllm._flashmla_C is not available
       (likely not compiled due to insufficient NVCC version
        or target GPU arch not included in compilation)

  3) FLASHINFER_MLA
     - sparse not supported
     - compute capability not supported
     - FlashInfer MLA kernel constraint violated:
       requires qk_nope_head_dim == 128, but got 192

  4) TRITON_MLA
     - sparse not supported

  5) FLASHMLA_SPARSE
     - compute capability not supported

Cluster note:
  - repeated 3x across cluster

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment