Instructions to use varjosoft/GLM-5.1-Open-TQ3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use varjosoft/GLM-5.1-Open-TQ3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="varjosoft/GLM-5.1-Open-TQ3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("varjosoft/GLM-5.1-Open-TQ3")
model = AutoModelForCausalLM.from_pretrained("varjosoft/GLM-5.1-Open-TQ3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use varjosoft/GLM-5.1-Open-TQ3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "varjosoft/GLM-5.1-Open-TQ3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "varjosoft/GLM-5.1-Open-TQ3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/varjosoft/GLM-5.1-Open-TQ3

SGLang

How to use varjosoft/GLM-5.1-Open-TQ3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "varjosoft/GLM-5.1-Open-TQ3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "varjosoft/GLM-5.1-Open-TQ3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "varjosoft/GLM-5.1-Open-TQ3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "varjosoft/GLM-5.1-Open-TQ3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use varjosoft/GLM-5.1-Open-TQ3 with Docker Model Runner:
```
docker model run hf.co/varjosoft/GLM-5.1-Open-TQ3
```

Pending GPU & vLLM validation

by nwzjk - opened Apr 9

Discussion

nwzjk

Apr 9

I'm working with a server equipped with 8×48GB GPUs, and I'd like to know whether it can successfully load and inference the model weights.

varjoranta

Varjosoft Oy org Apr 9

The BF16 model needs ~1,510 GB VRAM, so 8×48GB (384 GB) isn't enough. Options:

Unsloth 2-bit GGUF (236 GB) — fits, runs via llama.cpp. See unsloth/GLM-5.1-GGUF
TQ3 checkpoint (309 GB) — doesn't quite fit in 384 GB either. Available at varjosoft/GLM-5.1-Open-TQ3 but needs ~400+ GB VRAM
You'd need 8×80GB (A100/H100) or 4×141GB (H200) for the TQ3 version

Note: the TQ3 checkpoint was created using code validated on GLM-4.7-Flash (same architecture), but we haven't run inference on GLM-5.1 itself yet — waiting on multi-GPU availability for testing. The checkpoint and model card will be updated with results once tested.

zenmagnets

Apr 13

@varjoranta can you elaborate on why you think 400gb vram is needed?

varjoranta

Varjosoft Oy org Apr 14

The 400+ GB estimate was conservative and assumed standard vLLM loading, which materializes the full BF16 model in VRAM during initialization (309 GB + KV cache + CUDA overhead).

We're actively working on a --quantization turboquant flag that uses meta-device initialization... the model allocates zero GPU memory at init, then loads and compresses weights one layer at a time.
With this approach, the compressed model should fit in ~75-80 GB total VRAM, which would work on 2×H200 (282 GB) or potentially 8×48 GB (384 GB) with TP=8.

This is still being validated on GPU — we'll update the model card with confirmed results and serving instructions once tested. Stay tuned.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment