Could you also provide FP8-block version of the gemma-4-31B-it drafted model?

#5
by RayHuang1991 - opened
Red Hat AI org

Hi @RayHuang1991 , it looks like this follows similar naming structure to the original model. If you'd like to try it out yourself, you can mimic the creation script. No specialized hardware required

Hi @RayHuang1991 , it looks like this follows similar naming structure to the original model. If you'd like to try it out yourself, you can mimic the creation script. No specialized hardware required

Thank you! With the new release of vllm, I can use the Gemma 4 31B assistant as MTP now. But there are issues with the combination (RedHatAI/gemma-4-31B-it-FP8-block + Gemma-4-31B-it-assistant), even if the rejection rate of speculative decoding is low, the output is very different for same task as compared to the gemma-4-31B-it-FP8-block alone.
the MTP drafted model shares some layers with the main model, I think this might cause discrepancies when Main model is slightly different from drafted model

Sign up or log in to comment