Using this speculator with Red Hat AI's quantized model

#2
by nebula0248 - opened

Does this speculator work with RedHatAI/Qwen3-32B-FP8-dynamic? I used this speculator with the 16-bit quantized Qwen3-32B, and it worked well with an acceptance length of 2–3. However, it fails with RedHatAI/Qwen3-32B-FP8-dynamic: the main model isn’t accepting the speculator’s predicted tokens. Is this expected behavior? I assumed that even with quantization, the token prediction distribution should be very close to the unquantized model.

Hi @nebula0248 ,

Thank you for reporting this. We have conducted a performance audit to investigate the reported drop in acceptance length when using the RedHatAI/Qwen3-32B-speculator.eagle3 with the RedHatAI/Qwen3-32B-FP8-dynamic verifier.

Using vLLM main(commit 6ca4f400d), our internal benchmarks show that the quantized verifier actually maintains—and in some cases slightly exceeds—the performance of the BF16 base model.

Evaluation Results

We ran a side-by-side comparison between the base configuration and the FP8-dynamic verifier using a standard evaluation suite.

Metric Base Model (Qwen/Qwen3-32B) Quantized Verifier (RedHatAI/...-FP8-dynamic)
Avg. Drafted Tokens 18,119.62 22,581.43
Weighted Acceptance Rates [0.705, 0.476, 0.313] [0.714, 0.492, 0.333]
Conditional Acceptance Rates [0.705, 0.675, 0.657] [0.714, 0.689, 0.678]

Technical Analysis

Our data indicates that the RedHatAI/Qwen3-32B-speculator.eagle3 is highly robust to the FP8-dynamic quantization of the target verifier. Since we are seeing stable acceptance rates in our environment (2x H100), the performance drop you observed might be related to specific serving parameters, memory pressure, or hardware-specific kernels.

To help us narrow this down, could you please provide:

  1. Your GPU hardware (e.g., A100, H100, etc.).
  2. The exact vllm serve command you are using.
  3. Your vLLM version or specific commit.

We are closing this for now based on our verification results, but please feel free to share your logs and re-open this if you continue to see regressions!

Sign up or log in to comment