This was a really enjoyable read! Just have a slight clarification to make:
Based on the execution timeline that you gave, it seems like carry over from batch 0 is not done for batch 1, but rather, is delayed until batch 2. Did I read this correctly? And if so, is this a common practice for inference engines?