--- license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE pipeline_tag: text-to-video tags: - text-to-video - video-generation - audio-video-generation - long-video - multi-shot - dmd library_name: ltx-video ---

JoyAI-Echo generated video gallery

JoyAI-Echo

🎬 Pushing the Frontier of Long Video Generation

Standalone, inference-only release for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.

For academic research and non-commercial use only.

📄 Paper | 🌐 Project Page | 🚀 Quickstart | 📊 Results | 📝 Citation

## Abstract Long video generation still suffers from error accumulation, weak temporal coherence, and prohibitive latency, limiting its applicability to interactive scenarios. We present **JoyAI-Echo**, a framework that breaks these barriers through four key advances. Central to its performance, a cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combines memory-based reinforcement learning with distribution matching distillation for a **7.5× speedup** to substantially boost visual quality and alignment. Empowered by these two components, **JoyAI-Echo** decisively outperforms *HappyOyster* (directing mode) on long-form generation and even surpasses the short-video specialist *Wan 2.6* on human-centric tasks. Beyond raw generation quality, an interactive agent enables real-time user editing through conversational instructions, and a lightweight super-resolution module maintains high definition under streaming latency, further elevating the overall experience and delivering instantly editable, conversation-speed video creation. For the first time, **JoyAI-Echo** simultaneously achieves long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output — without compromise, inaugurating a new era of interactive video generation. Codes and weights will be open-sourced. ## Highlights - 🎞️ **Minute-level multi-shot stories**: generate a sequence of coherent shots from one prompt JSON. - ⚡ **DMD-distilled few-step inference**: ~7.5x faster than the original pipeline. - 🔊 **Joint audio-video generation**: one pipeline produces synchronized video and audio. - 🧠 **Paired cross-modal memory bank**: conditions each new shot on prior visual identity and voice context for story-level consistency. ## Current Release Scope JoyAI-Echo currently focuses on **text-to-video (T2V)** and **multi-shot long-video generation with paired audio-video memory**. The memory used in our official pipeline is built from generated T2V shots. Please note that **image-to-video (I2V)** is **not supported in the current release**. We are actively working on I2V support and plan to release it in a future version. ## Demo Gallery Explore long-form and short-form JoyAI-Echo cases on the [Project Page](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/). 🍿 ## Results ### Reported Scale | Item | Value | | --- | ---: | | 🎬 Long-form coherent story length | **5 min** | | ⚡ Generation speedup over the original multi-step pipeline | **7.5x** | | 📚 Benchmark stories | **100** | | 🎞️ Generated evaluation shots | **3,000** | | 🕒 Frames per shot | **241 @ 25 fps** | ### Human Evaluation GSB user study on long- and short-video generation. The numbers denote the percentage of user preferences. | Aspect
(Long Video) | JoyAI-Echo | Tie | HappyOyster
(Directing) | | --- | ---: | ---: | ---: | | Visual aesthetics | **63.6%** | 8.8% | 27.6% | | Audio quality | **81.7%** | 6.5% | 11.8% | | Prompt following | **80.6%** | 13.5% | 5.9% | | IP consistency | **59.4%** | 12.9% | 27.7% | | Aspect
(Short Video) | JoyAI-Echo | Tie | Wan 2.6 | | --- | ---: | ---: | ---: | | Visual aesthetics | **58.8%** | 14.7% | 26.5% | | Audio quality | 32.3% | 30.9% | 36.8% | | Prompt following | 33.8% | 36.8% | 29.4% | ## Quickstart ### 1. Clone Get the Repo at first! ```bash git clone https://github.com/jd-opensource/JoyAI-Echo.git cd JoyAI-Echo ``` ### 2. Create the environment The reference environment is **Python 3.11 + PyTorch 2.8 + CUDA 12.8**. With conda: ```bash conda env create -f environment.yml conda activate echo-long ``` With `uv`: ```bash uv venv --python 3.11 .venv source .venv/bin/activate uv pip install --extra-index-url https://download.pytorch.org/whl/cu128 -r requirements.txt ``` [`ffmpeg`](https://ffmpeg.org/download.html) must be available on `PATH` for shot concatenation. The conda recipe includes it. If you use `uv`, install it with your system package manager: ```bash sudo apt install ffmpeg # macOS: brew install ffmpeg ``` ### 3. Download checkpoint Download the JoyAI-Echo release checkpoint and Gemma text encoder: | File | Description | Size | Link | | --- | --- | --- | --- | | `echo-longvideo-release.safetensors` | Full model (transformer + VAE + vocoder) | ~46 GB |[`JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo) | | `gemma-3-12b/` | Instruction-tuned model (text encoder) | ~24 GB | [`gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) | Place them under `checkpoints/`: ```text checkpoints/ +-- echo-longvideo-release.safetensors `-- gemma-3-12b/ ``` ### 4. Write a story prompt Create a JSON file under `prompts/`. Each string is one complete shot description. A single prompt creates a single shot. Multiple prompts create a multi-shot story conditioned through the paired audio-video memory bank. ### 5. Run inference ```bash python inference.py ``` This loads the model once and processes all prompt files under `prompts/`. > 💡 **Note**: The inference pipeline is optimized to run on lower-VRAM > GPUs. Peak GPU usage is around **46–50 GB**, at the cost of slightly > longer per-shot inference time. Outputs are written to: ```text inference_result/outputs//inference_/ ``` ## Configuration All inference parameters are managed in `configs/inference.yaml`. The file is organized into sections: | Section | Contents | | --- | --- | | `paths` | Checkpoint path, prompts directory, output root | | `video` | Resolution, frame count, FPS, seed | | `denoising` | Step list and sigma schedule | | `memory` | Memory bank size, save mode, LoRA settings | | `audio_memory` | Audio window, mel-spectrogram params | | `inference` | Device, dtype, grad scale | ### Override via CLI Any YAML parameter can be overridden from the command line: ```bash python inference.py --seed 42 --num-frames 121 --video-height 480 --video-width 832 ``` Use a custom config file: ```bash python inference.py --config configs/my_experiment.yaml ``` The Python entrypoint exposes the full configuration surface: ```bash python inference.py --help ``` ## Hardware Peak GPU usage is around **46–50 GB** for the default **25 fps x 241 frames x 1280 x 736** setting, so a single H100/A100-class (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, reduce resolution/frames: ```bash python inference.py --num-frames 121 --video-height 480 --video-width 832 ``` ## TODO List - [x] Release inference code - [x] Release model checkpoints - [x] Add prompt examples - [ ] Release Director Agent ## Links - Project page: [`https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/`](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/) - Repository: [`https://github.com/jd-opensource/JoyAI-Echo`](https://github.com/jd-opensource/JoyAI-Echo) - huggingface: [`https://huggingface.co/jdopensource/JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo) ## Acknowledgements We gratefully acknowledge the open-source projects this work builds upon — in particular [LTX2.3](https://huggingface.co/Lightricks/LTX-2.3) for the base video generator and [Gemma](https://huggingface.co/google/gemma-3-12b-it) for the text encoder. Thanks to the broader research community whose contributions made this release possible. ## Citation If JoyAI-Echo helps your research or products, please cite: ```bibtex @techreport{echo2026longvideo, title = {JoyAI-Echo: Pushing the Frontier of Long Video Generation}, author = {{Echo Team @ Joy Future Academy, JD}}, institution = {Joy Future Academy, JD}, year = {2026}, month = {May} } ``` ## License This project is based on LTX-2 by Lightricks Ltd. Portions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only. This project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd. All original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained. This project remains subject to the LTX-2 Community License Agreement.