About model.generate and KV cache

#6
by quaternior - opened

Hello, thanks for your impressive work, I really enjoy this project and model.

I have some questions about this model. This example code shows model.generate for convenience reproduction. But it seems that it doesn't use KV cache for decode-like phase. And also, for efficient deployment, I saw that there are some choices such as dInfer (for efficient inference framework) and SGLang (for efficient serving framework). Is it right?

Again, thanks for your amazing work!

Sign up or log in to comment