dual 3090 inference
I'm getting about 12 t/s inference not using Flash Speculative Decoding and 1 t/s using it using the installation instructions on the Unsloth page (no fp8 kv). Is that expected?
Did you use the same commands as in the guide? Can you try:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
--served-model-name unsloth/GLM-4.7-Flash \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--dtype bfloat16 \
--seed 3407 \
--port 8000
or try 1 GPU via:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
--served-model-name unsloth/GLM-4.7-Flash \
--tensor-parallel-size 2 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--dtype bfloat16 \
--seed 3407 \
--port 8000
Exactly the same but added --gpu-memory-utilization .9 --max-num-seqs 1 --max-model-len 80000.
I noticed this only happens with around 40k context. Fresh prompt generates around 70 t/s. Looks like generation slows down exponentially as context fills?
Here are more attention details:
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:25 [gpu_model_runner.py:4021] Starting to load model unsloth/GLM-4.7-Flash-FP8-Dynamic...
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [cuda.py:364] Using TRITON_MLA attention backend out of potential backends: ('TRITON_MLA',)
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [mla_attention.py:1399] Using FlashAttention prefill for MLA
(Worker_TP0_EP0 pid=37627) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP1_EP1 pid=37628) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:82] FlashInfer CUTLASS MoE is available for EP but not enabled, consider setting VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:103] Using TRITON backend for Unquantized MoE
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [fp8.py:329] Using MARLIN Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'VLLM_CUTLASS', 'BATCHED_VLLM_CUTLASS', 'TRITON', 'BATCHED_TRITON', 'MARLIN'].
(Worker_TP1_EP1 pid=37628) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->32, 1->33, 2->34, 3->35, 4->36, 5->37, 6->38, 7->39, 8->40, 9->41, 10->42, 11->43, 12->44, 13->45, 14->46, 15->47, 16->48, 17->49, 18->50, 19->51, 20->52, 21->53, 22->54, 23->55, 24->56, 25->57, 26->58, 27->59, 28->60, 29->61, 30->62, 31->63.
Oh can you try setting VLLM_USE_FLASHINFER_MOE_FP16=1 maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?
Oh can you try setting
VLLM_USE_FLASHINFER_MOE_FP16=1maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?
Just tried it. Same result. I think you are right. In the mean time, I think I'll just use llama.cpp.