| --- |
| license: apache-2.0 |
| pipeline_tag: video-text-to-text |
| --- |
| |
| # JoyAI-VL-Interaction |
|
|
| **The first open, vision-driven real-time interaction model β it watches a live video stream and decides on its own when to speak, stay silent, or delegate.** |
|
|
| [π Paper](https://arxiv.org/abs/2606.14777) Β· [π Project Page & Demos](https://joyai-vl-video-future-academy-jd.github.io/JoyAI-VL-Interaction/) Β· [π» GitHub](https://github.com/jd-opensource/JoyAI-VL-Interaction) Β· [π€ Paper Page](https://huggingface.co/papers/2606.14777) |
|
|
| --- |
| ## Overview |
|
|
| Most large models today are **turn-based**: they answer only when you ask. But many moments in the real world don't wait for a question β a fire starts on a security feed, someone falls, a product flashes by in a livestream. Once missed, the moment is gone. |
|
|
| **JoyAI-VL-Interaction** is built for exactly these moments. It is an **8B-scale, vision-first interaction model** that continuously watches a live video stream and, **every second, decides on its own** to take one of three actions: |
|
|
| - **Speak** β respond when something is worth saying |
| - **Stay silent** β keep watching when nothing warrants a response (a first-class, trained action) |
| - **Delegate** β hand a hard subtask to a background model/agent, keep watching, and weave the result back in when it returns |
|
|
| The decision of *when to act* is **learned inside the model** (from second-by-second time-aligned data + RL), not bolted on by an external turn-detector or polling loop. Vision is the first-class driver; speech (ASR/TTS) is treated as pluggable I/O. |
|
|
| To our knowledge, this is the **first open, vision-driven interaction model** released together with its training recipe, data, and a complete deployable system. |
|
|
| --- |
| ## vLLM Usage |
|
|
| [vLLM-Omni](https://github.com/vllm-project/vllm-omni) provides **day-0 support** for JoyAI-VL-Interaction! The model is a standard Qwen3-VL VLM served by a plain `vllm serve`; vLLM-Omni adds the real-time interaction layer on top β the per-second **speak / silence / delegate** orchestration, 3-tier summary memory, and pluggable ASR / TTS / delegation. For installation and full setup, see the [vLLM-Omni recipe](https://github.com/vllm-project/vllm-omni/blob/main/recipes/JD/JoyAI-VL-Interaction.md). |
|
|
| ### Online Serving |
|
|
| ```bash |
| # git clone https://github.com/vllm-project/vllm-omni.git |
| |
| # 1. Serve the model (plain `vllm serve`, NOT --omni β it is vanilla Qwen3-VL) |
| vllm serve jdopensource/JoyAI-VL-Interaction-Preview \ |
| --served-model-name JoyAI-VL-Interaction-Preview --port 8061 \ |
| --max-model-len 131072 --enable-prefix-caching --limit-mm-per-prompt '{"image":256,"video":1}' |
| |
| # 2. Start the interaction orchestrator (OpenAI-compatible, :8070) |
| python -m vllm_omni.experimental.fullduplex.joyvl.serving.server --port 8070 \ |
| --main-backend-url http://127.0.0.1:8061/v1 --main-model JoyAI-VL-Interaction-Preview |
| ``` |
|
|
| For the full browser demo β live webcam / RTSP input, voice (ASR/TTS), and the per-tick decision stream β run JD's official WebUI (`services/webui`) in front of the orchestrator; see the [vLLM-Omni recipe](https://github.com/vllm-project/vllm-omni/blob/main/recipes/JD/JoyAI-VL-Interaction.md) for the steps. |