dual 3090 inference

#1
by evetsagg - opened

I'm getting about 12 t/s inference not using Flash Speculative Decoding and 1 t/s using it using the installation instructions on the Unsloth page (no fp8 kv). Is that expected?

Unsloth AI org

Did you use the same commands as in the guide? Can you try:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
    --served-model-name unsloth/GLM-4.7-Flash \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --port 8000

or try 1 GPU via:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \
    --served-model-name unsloth/GLM-4.7-Flash \
    --tensor-parallel-size 2 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \
    --port 8000

Exactly the same but added --gpu-memory-utilization .9 --max-num-seqs 1 --max-model-len 80000.

I noticed this only happens with around 40k context. Fresh prompt generates around 70 t/s. Looks like generation slows down exponentially as context fills?

Here are more attention details:

(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:25 [gpu_model_runner.py:4021] Starting to load model unsloth/GLM-4.7-Flash-FP8-Dynamic...
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [cuda.py:364] Using TRITON_MLA attention backend out of potential backends: ('TRITON_MLA',)
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [mla_attention.py:1399] Using FlashAttention prefill for MLA
(Worker_TP0_EP0 pid=37627) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP1_EP1 pid=37628) WARNING 01-26 16:57:26 [compressed_tensors.py:766] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:82] FlashInfer CUTLASS MoE is available for EP but not enabled, consider setting VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [unquantized.py:103] Using TRITON backend for Unquantized MoE
(Worker_TP0_EP0 pid=37627) INFO 01-26 16:57:26 [fp8.py:329] Using MARLIN Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'BATCHED_DEEPGEMM', 'VLLM_CUTLASS', 'BATCHED_VLLM_CUTLASS', 'TRITON', 'BATCHED_TRITON', 'MARLIN'].
(Worker_TP1_EP1 pid=37628) INFO 01-26 16:57:26 [layer.py:475] [EP Rank 1/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 32/64. Experts local to global index map: 0->32, 1->33, 2->34, 3->35, 4->36, 5->37, 6->38, 7->39, 8->40, 9->41, 10->42, 11->43, 12->44, 13->45, 14->46, 15->47, 16->48, 17->49, 18->50, 19->51, 20->52, 21->53, 22->54, 23->55, 24->56, 25->57, 26->58, 27->59, 28->60, 29->61, 30->62, 31->63.
Unsloth AI org

Oh can you try setting VLLM_USE_FLASHINFER_MOE_FP16=1 maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?

Oh can you try setting VLLM_USE_FLASHINFER_MOE_FP16=1 maybe? Hmmm interesting it might be vLLM hasn't optimized GLM Flash that much?

Just tried it. Same result. I think you are right. In the mean time, I think I'll just use llama.cpp.

Sign up or log in to comment