Instructions to use QuantTrio/GLM-5-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-5-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-5-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-5-AWQ")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-5-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuantTrio/GLM-5-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-5-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-5-AWQ

SGLang

How to use QuantTrio/GLM-5-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-5-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-5-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-5-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-5-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-5-AWQ
```

[Request] Great work! Do you have plans to also create GLM-5.1-AWQ?

by ag1988 - opened Apr 7

Discussion

ag1988

Apr 7

GLM-5.1 has been released https://huggingface.co/zai-org/GLM-5.1. Are you planning on also creating a AWQ version of this?

tclf90

QuantTrio org Apr 8

downloading it 🥹

ag1988

Apr 9

Hey, sorry for the naive question but what caliberation data did you use - the default data is from pileeval as shown in the AutoAWQ library. Given that this model has a chat template did you any chat data (e.g. smoltalk) or did you just use the default AutoAWQ settings. This info would be immensely helpful.

Default caliberation data used in AutoAWQ: https://github.com/casper-hansen/AutoAWQ/blob/88e4c76b20755db275574e6a03c83c84ba3bece5/awq/models/base.py#L150

JunHowie

QuantTrio org Apr 10

Hey, sorry for the naive question but what caliberation data did you use - the default data is from pileeval as shown in the AutoAWQ library. Given that this model has a chat template did you any chat data (e.g. smoltalk) or did you just use the default AutoAWQ settings. This info would be immensely helpful.

Default caliberation data used in AutoAWQ: https://github.com/casper-hansen/AutoAWQ/blob/88e4c76b20755db275574e6a03c83c84ba3bece5/awq/models/base.py#L150

Reference readme, using data-free quantization (no calibration dataset required).

ag1988

Apr 10

Thanks for the clarification 👍

BM-TNG

Apr 13

I was hoping to find a QuantTrio GLM-5.1-AWQ quantization as we were quite happy with your GLM-5-AWQ variant.
Today I realized that someone else was quicker: https://huggingface.co/cyankiwi/GLM-5.1-AWQ-4bit
Would be nice to know if you plan with similar settings or a different configuration.

ag1988

Apr 13

This is also why I've been waiting for you guys to release 5.1 version as I am happy with your glm-5 version. Yesterday, I started to put together some code to do it myself based on llm-compressor as I didnt hear back from you guys but later thought maybe its better to wait as you have more experience.

JunHowie

QuantTrio org Apr 17

Soon to be released

ag1988

Apr 22

Sounds great - looking forward to the release.

JunHowie

QuantTrio org Apr 22

https://huggingface.co/QuantTrio/GLM-5.1-AWQ

ag1988

Apr 22

Really appreciate this - thank you so much!!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment