I really appreciate your effort to explaining this so well. Just one doubt I have, what exactly is being cached?
- The QK^t dot product results and the Value vectors of the already generated tokens
or - The just the key vectors and the value vectors of already generated tokens?
Also, is this done for each transformer block in an LLM?