Instructions to use QuantTrio/GLM-4.5V-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-4.5V-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="QuantTrio/GLM-4.5V-AWQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("QuantTrio/GLM-4.5V-AWQ")
model = AutoModelForMultimodalLM.from_pretrained("QuantTrio/GLM-4.5V-AWQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuantTrio/GLM-4.5V-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-4.5V-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5V-AWQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-4.5V-AWQ

SGLang

How to use QuantTrio/GLM-4.5V-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-4.5V-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5V-AWQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-4.5V-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5V-AWQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-4.5V-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-4.5V-AWQ
```

RuntimeError: operator _C::marlin_qqq_gemm does not exist

by sunnykaibai - opened Aug 23, 2025

Discussion

sunnykaibai

Aug 23, 2025

I follow the guide to install env:

22716)

git clone -b glm-45 https://github.com/zRzRzRzRzRzRzR/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .

Install preview build of Transformers with GLM-4.5V support

pip install transformers-v4.55.0-GLM-4.5V-preview

but still got the error

INFO 08-23 17:18:09 [init.py:241] Automatically detected platform cuda.
Traceback (most recent call last):
File "/mnt/workspace/zichen.shx/infer/infer_v6_cmd_glm.py", line 25, in
from vllm import LLM, SamplingParams
File "", line 1075, in _handle_fromlist
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/init.py", line 64, in getattr
module = import_module(module_name, package)
File "/opt/conda/envs/glmfp4/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 24, in
from vllm.engine.llm_engine import LLMEngine
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 28, in
from vllm.engine.output_processor.util import create_output_by_sequence_group
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/output_processor/util.py", line 8, in
from vllm.model_executor.layers.sampler import SamplerOutput
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 16, in
from vllm.model_executor.layers.utils import apply_penalties
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/utils.py", line 8, in
from vllm import _custom_ops as ops
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/_custom_ops.py", line 472, in
def _marlin_qqq_gemm_fake(a: torch.Tensor, b_q_weight: torch.Tensor,
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 1023, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 214, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator _C::marlin_qqq_gemm does not exist

JunHowie

QuantTrio org Aug 25, 2025

After completing the installation of vllm, simply downgrade the transformers version.
The commands you need are:

pip install -U vllm
pip install transformers-v4.55.0-GLM-4.5V-preview

Check dependencies with:

pip list
pip show vllm transformers

It is also recommended to clear the pip cache:

pip cache purge

When launching this model with vllm, the required command is:

vllm serve \
    QuantTrio/GLM-4.5V-AWQ \
    --served-model-name GLM-4.5V-AWQ \
    --enable-expert-parallel \
    --tensor-parallel-size 4   # replace 4 with the actual number of GPUs

tclf90

QuantTrio org Aug 25, 2025

I follow the guide to install env:

Patched vLLM (see: https://github.com/vllm-project/vllm/pull/22716)

git clone -b glm-45 https://github.com/zRzRzRzRzRzRzR/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .

Install preview build of Transformers with GLM-4.5V support

pip install transformers-v4.55.0-GLM-4.5V-preview

but still got the error

INFO 08-23 17:18:09 [init.py:241] Automatically detected platform cuda.
Traceback (most recent call last):
File "/mnt/workspace/zichen.shx/infer/infer_v6_cmd_glm.py", line 25, in
from vllm import LLM, SamplingParams
File "", line 1075, in _handle_fromlist
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/init.py", line 64, in getattr
module = import_module(module_name, package)
File "/opt/conda/envs/glmfp4/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 24, in
from vllm.engine.llm_engine import LLMEngine
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 28, in
from vllm.engine.output_processor.util import create_output_by_sequence_group
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/engine/output_processor/util.py", line 8, in
from vllm.model_executor.layers.sampler import SamplerOutput
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 16, in
from vllm.model_executor.layers.utils import apply_penalties
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/model_executor/layers/utils.py", line 8, in
from vllm import _custom_ops as ops
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/vllm/_custom_ops.py", line 472, in
def _marlin_qqq_gemm_fake(a: torch.Tensor, b_q_weight: torch.Tensor,
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 1023, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/library.py", line 214, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/opt/conda/envs/glmfp4/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator _C::marlin_qqq_gemm does not exist

Thank you for your feedback. This error occurs a lot in the recent nightly versions.
At the moment of this post, we can just install the official vllm:

pip install vllm==0.10.1.1
pip install transformers-v4.55.0-GLM-4.5V-preview

I have updated the readme file accordingly.

sunnykaibai

Aug 25, 2025

thank you for your answer, it does work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment