---
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
pipeline_tag: text-to-video
tags:
- text-to-video
- video-generation
- audio-video-generation
- long-video
- multi-shot
- dmd
library_name: ltx-video
---
JoyAI-Echo
🎬 Pushing the Frontier of Long Video Generation
Standalone, inference-only release for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.
For academic research and non-commercial use only.
📄 Paper |
🌐 Project Page |
🚀 Quickstart |
📊 Results |
📝 Citation
## Abstract
Long video generation still suffers from error accumulation, weak temporal coherence, and prohibitive latency, limiting its applicability to interactive scenarios. We present **JoyAI-Echo**, a framework that breaks these barriers through four key advances.
Central to its performance, a cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combines memory-based reinforcement learning with distribution matching distillation for a **7.5× speedup** to substantially boost visual quality and alignment.
Empowered by these two components, **JoyAI-Echo** decisively outperforms *HappyOyster* (directing mode) on long-form generation and even surpasses the short-video specialist *Wan 2.6* on human-centric tasks.
Beyond raw generation quality, an interactive agent enables real-time user editing through conversational instructions, and a lightweight super-resolution module maintains high definition under streaming latency, further elevating the overall experience and delivering instantly editable, conversation-speed video creation.
For the first time, **JoyAI-Echo** simultaneously achieves long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output — without compromise, inaugurating a new era of interactive video generation.
Codes and weights will be open-sourced.
## Highlights
- 🎞️ **Minute-level multi-shot stories**: generate a sequence of coherent shots from one prompt JSON.
- ⚡ **DMD-distilled few-step inference**: ~7.5x faster than the original pipeline.
- 🔊 **Joint audio-video generation**: one pipeline produces synchronized video and audio.
- 🧠 **Paired cross-modal memory bank**: conditions each new shot on prior visual identity and voice context for story-level consistency.
## Current Release Scope
JoyAI-Echo currently focuses on **text-to-video (T2V)** and **multi-shot long-video generation with paired audio-video memory**. The memory used in our official pipeline is built from generated T2V shots.
Please note that **image-to-video (I2V)** is **not supported in the current release**.
We are actively working on I2V support and plan to release it in a future version.
## Demo Gallery
Explore long-form and short-form JoyAI-Echo cases on the [Project Page](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/). 🍿
## Results
### Reported Scale
| Item | Value |
| --- | ---: |
| 🎬 Long-form coherent story length | **5 min** |
| ⚡ Generation speedup over the original multi-step pipeline | **7.5x** |
| 📚 Benchmark stories | **100** |
| 🎞️ Generated evaluation shots | **3,000** |
| 🕒 Frames per shot | **241 @ 25 fps** |
### Human Evaluation
GSB user study on long- and short-video generation. The numbers denote the percentage of user preferences.
| Aspect
(Long Video) | JoyAI-Echo | Tie | HappyOyster
(Directing) |
| --- | ---: | ---: | ---: |
| Visual aesthetics | **63.6%** | 8.8% | 27.6% |
| Audio quality | **81.7%** | 6.5% | 11.8% |
| Prompt following | **80.6%** | 13.5% | 5.9% |
| IP consistency | **59.4%** | 12.9% | 27.7% |
| Aspect
(Short Video) | JoyAI-Echo | Tie | Wan 2.6 |
| --- | ---: | ---: | ---: |
| Visual aesthetics | **58.8%** | 14.7% | 26.5% |
| Audio quality | 32.3% | 30.9% | 36.8% |
| Prompt following | 33.8% | 36.8% | 29.4% |
## Quickstart
### 1. Clone
Get the Repo at first!
```bash
git clone https://github.com/jd-opensource/JoyAI-Echo.git
cd JoyAI-Echo
```
### 2. Create the environment
The reference environment is **Python 3.11 + PyTorch 2.8 + CUDA 12.8**.
With conda:
```bash
conda env create -f environment.yml
conda activate echo-long
```
With `uv`:
```bash
uv venv --python 3.11 .venv
source .venv/bin/activate
uv pip install --extra-index-url https://download.pytorch.org/whl/cu128 -r requirements.txt
```
[`ffmpeg`](https://ffmpeg.org/download.html) must be available on `PATH` for shot concatenation. The conda recipe includes it. If you use `uv`, install it with your system package manager:
```bash
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
```
### 3. Download checkpoint
Download the JoyAI-Echo release checkpoint and Gemma text encoder:
| File | Description | Size | Link |
| --- | --- | --- | --- |
| `echo-longvideo-release.safetensors` | Full model (transformer + VAE + vocoder) | ~46 GB |[`JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo) |
| `gemma-3-12b/` | Instruction-tuned model (text encoder) | ~24 GB | [`gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) |
Place them under `checkpoints/`:
```text
checkpoints/
+-- echo-longvideo-release.safetensors
`-- gemma-3-12b/
```
### 4. Write a story prompt
Create a JSON file under `prompts/`.
Each string is one complete shot description. A single prompt creates a single shot. Multiple prompts create a multi-shot story conditioned through the paired audio-video memory bank.
### 5. Run inference
```bash
python inference.py
```
This loads the model once and processes all prompt files under `prompts/`.
> 💡 **Note**: The inference pipeline is optimized to run on lower-VRAM
> GPUs. Peak GPU usage is around **46–50 GB**, at the cost of slightly
> longer per-shot inference time.
Outputs are written to:
```text
inference_result/outputs//inference_/
```
## Configuration
All inference parameters are managed in `configs/inference.yaml`. The file is organized into sections:
| Section | Contents |
| --- | --- |
| `paths` | Checkpoint path, prompts directory, output root |
| `video` | Resolution, frame count, FPS, seed |
| `denoising` | Step list and sigma schedule |
| `memory` | Memory bank size, save mode, LoRA settings |
| `audio_memory` | Audio window, mel-spectrogram params |
| `inference` | Device, dtype, grad scale |
### Override via CLI
Any YAML parameter can be overridden from the command line:
```bash
python inference.py --seed 42 --num-frames 121 --video-height 480 --video-width 832
```
Use a custom config file:
```bash
python inference.py --config configs/my_experiment.yaml
```
The Python entrypoint exposes the full configuration surface:
```bash
python inference.py --help
```
## Hardware
Peak GPU usage is around **46–50 GB** for the default **25 fps x 241 frames x 1280 x 736** setting, so a single H100/A100-class (80 GB) or 48 GB GPU is sufficient.
For smaller GPUs, reduce resolution/frames:
```bash
python inference.py --num-frames 121 --video-height 480 --video-width 832
```
## TODO List
- [x] Release inference code
- [x] Release model checkpoints
- [x] Add prompt examples
- [ ] Release Director Agent
## Links
- Project page: [`https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/`](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/)
- Repository: [`https://github.com/jd-opensource/JoyAI-Echo`](https://github.com/jd-opensource/JoyAI-Echo)
- huggingface: [`https://huggingface.co/jdopensource/JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo)
## Acknowledgements
We gratefully acknowledge the open-source projects this work builds upon — in particular [LTX2.3](https://huggingface.co/Lightricks/LTX-2.3) for the base video generator and [Gemma](https://huggingface.co/google/gemma-3-12b-it) for the text encoder. Thanks to the broader research community whose contributions made this release possible.
## Citation
If JoyAI-Echo helps your research or products, please cite:
```bibtex
@techreport{echo2026longvideo,
title = {JoyAI-Echo: Pushing the Frontier of Long Video Generation},
author = {{Echo Team @ Joy Future Academy, JD}},
institution = {Joy Future Academy, JD},
year = {2026},
month = {May}
}
```
## License
This project is based on LTX-2 by Lightricks Ltd.
Portions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only.
This project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd.
All original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained.
This project remains subject to the LTX-2 Community License Agreement.