Buckets:
| license: other | |
| license_name: ltx-2-community-license-agreement | |
| license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE | |
| pipeline_tag: text-to-video | |
| tags: | |
| - text-to-video | |
| - video-generation | |
| - audio-video-generation | |
| - long-video | |
| - multi-shot | |
| - dmd | |
| library_name: ltx-video | |
| <p align="center"> | |
| <img src="assets/image.png" alt="JoyAI-Echo generated video gallery" width="100%"> | |
| </p> | |
| <div align="center"> | |
| <h1>JoyAI-Echo</h1> | |
| <p><strong>๐ฌ Pushing the Frontier of Long Video Generation</strong></p> | |
| <p>Official model weights for <strong>minute-level multi-shot audio-video generation</strong> with a distilled DMD generator, paired cross-modal memory, and story-level consistency.</p> | |
| <p><strong>For academic research and non-commercial use only.</strong></p> | |
| <p> | |
| <a href="https://github.com/jd-opensource/JoyAI-Echo/blob/main/joyai-echo%20tech%20report.pdf"><b>๐ Paper</b></a> | | |
| <a href="https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/"><b>๐ Project Page</b></a> | | |
| <a href="https://github.com/jd-opensource/JoyAI-Echo"><b>๐ป Inference Code</b></a> | | |
| <a href="#model-details"><b>๐งฌ Model</b></a> | | |
| <a href="#usage"><b>๐ Usage</b></a> | | |
| <a href="#results"><b>๐ Results</b></a> | | |
| <a href="#citation"><b>๐ Citation</b></a> | |
| </p> | |
| <p> | |
| <img src="https://img.shields.io/badge/Task-Text--to--Video-blue?style=flat-square" alt="Text-to-Video"> | |
| <img src="https://img.shields.io/badge/Modality-Audio%2BVideo-purple?style=flat-square" alt="Audio + Video"> | |
| <img src="https://img.shields.io/badge/Long%20Video-5%20min-d61f2c?style=flat-square" alt="5 minute long video"> | |
| <img src="https://img.shields.io/badge/Release-Model%20Weights-black?style=flat-square" alt="Model Weights"> | |
| </p> | |
| </div> | |
| ## Model Summary | |
| **JoyAI-Echo** is a long-form, multi-shot, audio-video generation framework that breaks the barriers of error accumulation, weak temporal coherence, and prohibitive latency in long video generation. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over **five-minute** videos, while a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a **7.5ร inference speedup** without sacrificing quality. | |
| JoyAI-Echo decisively outperforms *HappyOyster* (directing mode) on long-form generation and even surpasses the short-video specialist *Wan 2.6* on human-centric tasks. | |
| This repository hosts the **released checkpoint**. Inference code is released separately โ see the [Usage](#usage) section. | |
| ## Model Details | |
| - **Developed by:** Echo Team @ Joy Future Academy, JD | |
| - **Model type:** Text-to-(Audio+Video) diffusion transformer, DMD 8-step | |
| - **Modality:** Text โ synchronized video + audio | |
| - **Backbone:** Built on top of [LTX-Video](https://github.com/Lightricks/LTX-Video) | |
| - **Text encoder:** [`google/gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) (downloaded separately) | |
| - **Resolution / length (by default):** 1280 ร 736, 241 frames @ 25 fps per shot | |
| - **Max story length:** up to 5 minutes (multi-shot) | |
| - **License:** LTX-2 Community License Agreement | |
| ## Highlights | |
| - ๐๏ธ **Minute-level multi-shot stories**: generate a sequence of coherent shots from one prompt JSON. | |
| - โก **DMD-distilled few-step inference**: ~7.5ร faster than the original pipeline. | |
| - ๐ **Joint audio-video generation**: one pipeline produces synchronized video and audio. | |
| - ๐ง **Paired cross-modal memory bank**: conditions each new shot on prior visual identity and voice context for story-level consistency. | |
| ## Demo Gallery | |
| Explore long-form and short-form JoyAI-Echo cases on the [Project Page](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/). ๐ฟ | |
| ## Usage | |
| Inference is run with the standalone **JoyAI-Echo** inference repository. | |
| ### 1. Download the checkpoint | |
| ```bash | |
| huggingface-cli download jdopensource/JoyAI-Echo \ | |
| --local-dir checkpoints | |
| ``` | |
| Also download the Gemma text encoder: | |
| ```bash | |
| huggingface-cli download google/gemma-3-12b-it \ | |
| --local-dir checkpoints/gemma-3-12b | |
| ``` | |
| Expected layout: | |
| ```text | |
| checkpoints/ | |
| โโโ echo-longvideo-release.safetensors | |
| โโโ gemma-3-12b/ | |
| ``` | |
| ### 2. Get the inference code | |
| ```bash | |
| git clone https://github.com/jd-opensource/JoyAI-Echo.git | |
| cd JoyAI-Echo | |
| ``` | |
| Environment: **Python 3.11 + PyTorch 2.8 + CUDA 12.8** (see the inference repo's `environment.yml` / `requirements.txt`). | |
| ### 3. Write a story prompt | |
| **Enhance your prompt first.** We provide prompt enhancers โ system prompts that expand a short story or idea into well-formed shot prompts: **`prompts/long_story_writer_system_prompt.md`** for long, multi-shot video, and **`prompts/short_story_writer_system_prompt.md`** for single-shot short video. We **strongly recommend** running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results. | |
| Create a JSON file under `prompts/`. Each file is a single object with a `prompts` list, where **every string is one complete shot**. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank. | |
| Inside each string, write these parts in order: | |
| | Part | What to describe | | |
| | --- | --- | | |
| | **Roles & Subjects** | Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable. | | |
| | **Action & Dialogue** | What the subject does and speaks. | | |
| | **Style** | The overall visual and emotional aesthetic โ e.g. realistic motorsport film language, cool daylight, restrained cinematic tension. | | |
| | **Camera Movement** | The shot type and framing or movement โ e.g. a stable close-up on the face, or a medium shot from the waist up. | | |
| | **Background** | The setting and scene details behind the subject. | | |
| | **Sound Effects & BGM** | The sounds in the scene and the background music โ e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or no background music. | | |
| ### 4. Run | |
| ```bash | |
| python inference.py | |
| ``` | |
| Outputs land in `inference_result/outputs/<prompt-name>/inference_<timestamp>/`. | |
| ## Hardware | |
| Peak GPU memory is **~46โ50 GB** at the default 1280 ร 736 ร 241 frame setting โ a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count: | |
| ```bash | |
| python inference.py --num-frames 121 --video-height 480 --video-width 832 | |
| ``` | |
| ## Results | |
| ### Reported Scale | |
| | Item | Value | | |
| | --- | ---: | | |
| | ๐ฌ Long-form coherent story length | **5 min** | | |
| | โก Generation speedup over the original multi-step pipeline | **7.5ร** | | |
| | ๐ Benchmark stories | **100** | | |
| | ๐๏ธ Generated evaluation shots | **3,000** | | |
| | ๐ Frames per shot | **241 @ 25 fps** | | |
| ### Human Evaluation | |
| GSB user study on long- and short-video generation. The numbers denote the percentage of user preferences. | |
| | Aspect (Long Video) | JoyAI-Echo | Tie | HappyOyster (Directing) | | |
| | --- | ---: | ---: | ---: | | |
| | Visual aesthetics | **63.6%** | 8.8% | 27.6% | | |
| | Audio quality | **81.7%** | 6.5% | 11.8% | | |
| | Prompt following | **80.6%** | 13.5% | 5.9% | | |
| | IP consistency | **59.4%** | 12.9% | 27.7% | | |
| | Aspect (Short Video) | JoyAI-Echo | Tie | Wan 2.6 | | |
| | --- | ---: | ---: | ---: | | |
| | Visual aesthetics | **58.8%** | 14.7% | 26.5% | | |
| | Audio quality | 32.3% | 30.9% | 36.8% | | |
| | Prompt following | 33.8% | 36.8% | 29.4% | | |
| ## Links | |
| - Project page: [`https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/`](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/) | |
| - Inference code: [`https://github.com/jd-opensource/JoyAI-Echo`](https://github.com/jd-opensource/JoyAI-Echo) | |
| - HuggingFace: [`https://huggingface.co/jdopensource/JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo) | |
| ## Acknowledgements | |
| We gratefully acknowledge the open-source projects this work builds upon โ in particular [LTX2.3](https://huggingface.co/Lightricks/LTX-2.3) for the base video generator and [Gemma](https://huggingface.co/google/gemma-3-12b-it) for the text encoder. Thanks to the broader research community whose contributions made this release possible. | |
| ## Citation | |
| If JoyAI-Echo helps your research or products, please cite: | |
| ```bibtex | |
| @techreport{echo2026JoyEcho, | |
| title = {JoyAI-Echo: Pushing the Frontier of Long Video Generation}, | |
| author = {{Echo Team @ Joy Future Academy, JD}}, | |
| institution = {Joy Future Academy, JD}, | |
| year = {2026}, | |
| month = {May} | |
| } | |
| ``` | |
| ## License | |
| This project is based on LTX-2 by Lightricks Ltd. | |
| Portions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only. | |
| This project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd. | |
| All original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained. | |
| This project remains subject to the LTX-2 Community License Agreement. | |
Xet Storage Details
- Size:
- 9.15 kB
- Xet hash:
- 4cec311fadc7e8bf7e08fe3b482d878765db5e9f780113ee236fa0276f2c4a0c
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.