Support for video input and "4x temporal compression" with vLLM / SGLang
Thank you for releasing Kimi-K2.5! It is a fantastic model and a truly impressive piece of work.
I have a question regarding the model's video understanding capabilities. In the technical report, it is mentioned that Kimi-K2.5 supports a 4x temporal compression mechanism for video processing. However, the current documentation states that "Chat with video content is an experimental feature and is only supported in our official API for now." Currently, to process videos using the vLLM/SGLang deployment, I have to manually extract frames from the video and feed them into the model as a list of independent "multi-image" inputs. As a result, the token consumption is massive, making the inference speed extremely slow and putting immense pressure on the KV cache.
Are there any plans or an estimated timeline for bringing video input support (and the corresponding video processor logic) to open-source inference frameworks like vLLM and SGLang?