Instructions to use unsloth/GLM-4.7-Flash-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/GLM-4.7-Flash-FP8-Dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("unsloth/GLM-4.7-Flash-FP8-Dynamic") model = AutoModelForCausalLM.from_pretrained("unsloth/GLM-4.7-Flash-FP8-Dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/GLM-4.7-Flash-FP8-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.7-Flash-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/GLM-4.7-Flash-FP8-Dynamic
- SGLang
How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/GLM-4.7-Flash-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.7-Flash-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/GLM-4.7-Flash-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/GLM-4.7-Flash-FP8-Dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio
How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/GLM-4.7-Flash-FP8-Dynamic to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="unsloth/GLM-4.7-Flash-FP8-Dynamic", max_seq_length=2048, ) - Docker Model Runner
How to use unsloth/GLM-4.7-Flash-FP8-Dynamic with Docker Model Runner:
docker model run hf.co/unsloth/GLM-4.7-Flash-FP8-Dynamic
dual 3090 inference
I'm getting about 12 t/s inference not using Flash Speculative Decoding and 1 t/s using it using the installation instructions on the Unsloth page (no fp8 kv). Is that expected?
Did you use the same commands as in the guide? Can you try:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
--served-model-name unsloth/GLM-4.7-Flash \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--dtype bfloat16 \
--seed 3407 \
--port 8000
or try 1 GPU via:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
--served-model-name unsloth/GLM-4.7-Flash \
--tensor-parallel-size 2 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--dtype bfloat16 \
--seed 3407 \
--port 8000
Exactly the same but added --gpu-memory-utilization .9 --max-num-seqs 1 --max-model-len 80000.
I noticed this only happens with around 40k context. Fresh prompt generates around 70 t/s. Looks like generation slows down exponentially as context fills?
Here are more attention details:
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:25 [gpu_model_runner.py:4021] Starting to load model unsloth/GLM-4.7-Flash-FP8-Dynamic...
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [cuda.py:364] Using TRITON_MLA attention backend out of potential backends: ('TRITON_MLA',)
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [mla_attention.py:1399] Using FlashAttention prefill for MLA
(Worker_TP0_EP0 pid=37627) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP1_EP1 pid=37628) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:82] FlashInfer CUTLASS MoE is available for EP but not enabled, consider setting VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:103] Using TRITON backend for Unquantized MoE
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [fp8.py:329] Using MARLIN Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'VLLM_CUTLASS', 'BATCHED_VLLM_CUTLASS', 'TRITON', 'BATCHED_TRITON', 'MARLIN'].
(Worker_TP1_EP1 pid=37628) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->32, 1->33, 2->34, 3->35, 4->36, 5->37, 6->38, 7->39, 8->40, 9->41, 10->42, 11->43, 12->44, 13->45, 14->46, 15->47, 16->48, 17->49, 18->50, 19->51, 20->52, 21->53, 22->54, 23->55, 24->56, 25->57, 26->58, 27->59, 28->60, 29->61, 30->62, 31->63.
Oh can you try setting VLLM_USE_FLASHINFER_MOE_FP16=1 maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?
Oh can you try setting
VLLM_USE_FLASHINFER_MOE_FP16=1maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?
Just tried it. Same result. I think you are right. In the mean time, I think I'll just use llama.cpp.