Instructions to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/GLM-4.5-Air-AWQ-FP16Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuantTrio/GLM-4.5-Air-AWQ-FP16Mix")
model = AutoModelForCausalLM.from_pretrained("QuantTrio/GLM-4.5-Air-AWQ-FP16Mix")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/GLM-4.5-Air-AWQ-FP16Mix

SGLang

How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/GLM-4.5-Air-AWQ-FP16Mix",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/GLM-4.5-Air-AWQ-FP16Mix with Docker Model Runner:
```
docker model run hf.co/QuantTrio/GLM-4.5-Air-AWQ-FP16Mix
```

request for fp4 quants

by hareram241 - opened Aug 29, 2025

Discussion

hareram241

Aug 29, 2025

Hi is it possible to get fp4 quants for this model? thanks

hareram241

Aug 29, 2025

import modelopt.torch.quantization as mtq
from modelopt.torch.utils.dataset_utils import get_dataset_dataloader
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
import torch

from configuration_glm4_moe import Glm4MoeConfig
from modeling_glm4_moe import Glm4MoeForCausalLM

config = Glm4MoeConfig.from_pretrained("downloaded_models/GLM-4.5-Air")
model = Glm4MoeForCausalLM.from_pretrained("downloaded_models/GLM-4.5-Air", config=config, device_map="auto", torch_dtype=torch.bfloat16)

print(model.hf_device_map)

Select the quantization config, for example, INT8 Smooth Quant

config = mtq.NVFP4_DEFAULT_CFG
tokenizer = AutoTokenizer.from_pretrained("downloaded_models/GLM-4.5-Air")
batch_size = 1
num_samples = 64

calib_dataset = get_dataset_dataloader(
dataset_name="cnn_dailymail",
tokenizer=tokenizer,
batch_size=batch_size,
num_samples=num_samples,
)

def forward_loop(model):
for data in calib_dataset:
model(data['input_ids'])

PTQ with in-place replacement to quantized modules

model = mtq.quantize(model, config, forward_loop)

mtq.print_quant_summary(model)

from modelopt.torch.export import export_hf_checkpoint

export_dir = "downloaded_models/GLM-4.5-Air-nvfp4-1"
with torch.inference_mode():
export_hf_checkpoint(
model, # The quantized model.
export_dir = export_dir, # The directory where the exported files will be stored.
)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment