Using this speculator with Red Hat AI's quantized model
Does this speculator work with RedHatAI/Qwen3-32B-FP8-dynamic? I used this speculator with the 16-bit quantized Qwen3-32B, and it worked well with an acceptance length of 2–3. However, it fails with RedHatAI/Qwen3-32B-FP8-dynamic: the main model isn’t accepting the speculator’s predicted tokens. Is this expected behavior? I assumed that even with quantization, the token prediction distribution should be very close to the unquantized model.
Hi @nebula0248 ,
Thank you for reporting this. We have conducted a performance audit to investigate the reported drop in acceptance length when using the RedHatAI/Qwen3-32B-speculator.eagle3 with the RedHatAI/Qwen3-32B-FP8-dynamic verifier.
Using vLLM main(commit 6ca4f400d), our internal benchmarks show that the quantized verifier actually maintains—and in some cases slightly exceeds—the performance of the BF16 base model.
Evaluation Results
We ran a side-by-side comparison between the base configuration and the FP8-dynamic verifier using a standard evaluation suite.
| Metric | Base Model (Qwen/Qwen3-32B) |
Quantized Verifier (RedHatAI/...-FP8-dynamic) |
|---|---|---|
| Avg. Drafted Tokens | 18,119.62 | 22,581.43 |
| Weighted Acceptance Rates | [0.705, 0.476, 0.313] |
[0.714, 0.492, 0.333] |
| Conditional Acceptance Rates | [0.705, 0.675, 0.657] |
[0.714, 0.689, 0.678] |
Technical Analysis
Our data indicates that the RedHatAI/Qwen3-32B-speculator.eagle3 is highly robust to the FP8-dynamic quantization of the target verifier. Since we are seeing stable acceptance rates in our environment (2x H100), the performance drop you observed might be related to specific serving parameters, memory pressure, or hardware-specific kernels.
To help us narrow this down, could you please provide:
- Your GPU hardware (e.g., A100, H100, etc.).
- The exact
vllm servecommand you are using. - Your vLLM version or specific commit.
We are closing this for now based on our verification results, but please feel free to share your logs and re-open this if you continue to see regressions!