About model.generate and KV cache
#6
by quaternior - opened
Hello, thanks for your impressive work, I really enjoy this project and model.
I have some questions about this model. This example code shows model.generate for convenience reproduction. But it seems that it doesn't use KV cache for decode-like phase. And also, for efficient deployment, I saw that there are some choices such as dInfer (for efficient inference framework) and SGLang (for efficient serving framework). Is it right?
Again, thanks for your amazing work!