Is there a specific context length where performance begins to degrade?

by naive-ai-engineer - opened 24 days ago

24 days ago

I recently reviewed the OWL paper: https://arxiv.org/html/2510.07535v1 and understand that speculative decoding performance (in tok/s) tends to plummet when the input context length exceeds 2K tokens. I am also aware that the eagle3 models by RedHatAI are trained on UltraChat-200K and ShareGPT, following the methodology of the original EAGLE-3 paper.

However, during my own testing with prompt lengths between 2K and 8K tokens, I observed something different: while there was a drop in tok/s, the speculative model consistently maintained higher throughput than the base-only model (the base model never outperformed the speculator).

Given this, I am curious if any specific techniques were used when training the RedHatAI speculator model to achieve this robustness at longer contexts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment