Instructions to use unsloth/GLM-4.7-Flash-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/GLM-4.7-Flash-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/GLM-4.7-Flash-FP8-Dynamic")
model = AutoModelForCausalLM.from_pretrained("unsloth/GLM-4.7-Flash-FP8-Dynamic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/GLM-4.7-Flash-FP8-Dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-Flash-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/GLM-4.7-Flash-FP8-Dynamic

SGLang

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/GLM-4.7-Flash-FP8-Dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-Flash-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/GLM-4.7-Flash-FP8-Dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/GLM-4.7-Flash-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/GLM-4.7-Flash-FP8-Dynamic",
    max_seq_length=2048,
)

Docker Model Runner
How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Docker Model Runner:
```
docker model run hf.co/unsloth/GLM-4.7-Flash-FP8-Dynamic
```

dual 3090 inference

by evetsagg - opened Jan 27

Discussion

evetsagg

Jan 27

I'm getting about 12 t/s inference not using Flash Speculative Decoding and 1 t/s using it using the installation instructions on the Unsloth page (no fp8 kv). Is that expected?

danielhanchen

Unsloth AI org Jan 27

Did you use the same commands as in the guide? Can you try:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
    --served-model-name unsloth/GLM-4.7-Flash \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --port 8000

or try 1 GPU via:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
    --served-model-name unsloth/GLM-4.7-Flash \
    --tensor-parallel-size 2 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --port 8000

evetsagg

Jan 27

•

edited Jan 27

Exactly the same but added --gpu-memory-utilization .9 --max-num-seqs 1 --max-model-len 80000.

I noticed this only happens with around 40k context. Fresh prompt generates around 70 t/s. Looks like generation slows down exponentially as context fills?

evetsagg

Jan 27

Here are more attention details:

(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:25 [gpu_model_runner.py:4021] Starting to load model unsloth/GLM-4.7-Flash-FP8-Dynamic...
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [cuda.py:364] Using TRITON_MLA attention backend out of potential backends: ('TRITON_MLA',)
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [mla_attention.py:1399] Using FlashAttention prefill for MLA
(Worker_TP0_EP0 pid=37627) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP1_EP1 pid=37628) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:82] FlashInfer CUTLASS MoE is available for EP but not enabled, consider setting VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:103] Using TRITON backend for Unquantized MoE
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [fp8.py:329] Using MARLIN Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'VLLM_CUTLASS', 'BATCHED_VLLM_CUTLASS', 'TRITON', 'BATCHED_TRITON', 'MARLIN'].
(Worker_TP1_EP1 pid=37628) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->32, 1->33, 2->34, 3->35, 4->36, 5->37, 6->38, 7->39, 8->40, 9->41, 10->42, 11->43, 12->44, 13->45, 14->46, 15->47, 16->48, 17->49, 18->50, 19->51, 20->52, 21->53, 22->54, 23->55, 24->56, 25->57, 26->58, 27->59, 28->60, 29->61, 30->62, 31->63.

danielhanchen

Unsloth AI org Jan 28

Oh can you try setting VLLM_USE_FLASHINFER_MOE_FP16=1 maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?

evetsagg

Jan 28

Oh can you try setting VLLM_USE_FLASHINFER_MOE_FP16=1 maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?

Just tried it. Same result. I think you are right. In the mean time, I think I'll just use llama.cpp.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment