Is there a specific context length where performance begins to degrade?
I recently reviewed the OWL paper: https://arxiv.org/html/2510.07535v1 and understand that speculative decoding performance (in tok/s) tends to plummet when the input context length exceeds 2K tokens. I am also aware that the eagle3 models by RedHatAI are trained on UltraChat-200K and ShareGPT, following the methodology of the original EAGLE-3 paper.
However, during my own testing with prompt lengths between 2K and 8K tokens, I observed something different: while there was a drop in tok/s, the speculative model consistently maintained higher throughput than the base-only model (the base model never outperformed the speculator).
Given this, I am curious if any specific techniques were used when training the RedHatAI speculator model to achieve this robustness at longer contexts.