Studytime171/JoyAI / README.md
Studytime171's picture
|
download
raw
9.15 kB
---
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
pipeline_tag: text-to-video
tags:
- text-to-video
- video-generation
- audio-video-generation
- long-video
- multi-shot
- dmd
library_name: ltx-video
---
<p align="center">
<img src="assets/image.png" alt="JoyAI-Echo generated video gallery" width="100%">
</p>
<div align="center">
<h1>JoyAI-Echo</h1>
<p><strong>๐ŸŽฌ Pushing the Frontier of Long Video Generation</strong></p>
<p>Official model weights for <strong>minute-level multi-shot audio-video generation</strong> with a distilled DMD generator, paired cross-modal memory, and story-level consistency.</p>
<p><strong>For academic research and non-commercial use only.</strong></p>
<p>
<a href="https://github.com/jd-opensource/JoyAI-Echo/blob/main/joyai-echo%20tech%20report.pdf"><b>๐Ÿ“„ Paper</b></a> |
<a href="https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/"><b>๐ŸŒ Project Page</b></a> |
<a href="https://github.com/jd-opensource/JoyAI-Echo"><b>๐Ÿ’ป Inference Code</b></a> |
<a href="#model-details"><b>๐Ÿงฌ Model</b></a> |
<a href="#usage"><b>๐Ÿš€ Usage</b></a> |
<a href="#results"><b>๐Ÿ“Š Results</b></a> |
<a href="#citation"><b>๐Ÿ“ Citation</b></a>
</p>
<p>
<img src="https://img.shields.io/badge/Task-Text--to--Video-blue?style=flat-square" alt="Text-to-Video">
<img src="https://img.shields.io/badge/Modality-Audio%2BVideo-purple?style=flat-square" alt="Audio + Video">
<img src="https://img.shields.io/badge/Long%20Video-5%20min-d61f2c?style=flat-square" alt="5 minute long video">
<img src="https://img.shields.io/badge/Release-Model%20Weights-black?style=flat-square" alt="Model Weights">
</p>
</div>
## Model Summary
**JoyAI-Echo** is a long-form, multi-shot, audio-video generation framework that breaks the barriers of error accumulation, weak temporal coherence, and prohibitive latency in long video generation. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over **five-minute** videos, while a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a **7.5ร— inference speedup** without sacrificing quality.
JoyAI-Echo decisively outperforms *HappyOyster* (directing mode) on long-form generation and even surpasses the short-video specialist *Wan 2.6* on human-centric tasks.
This repository hosts the **released checkpoint**. Inference code is released separately โ€” see the [Usage](#usage) section.
## Model Details
- **Developed by:** Echo Team @ Joy Future Academy, JD
- **Model type:** Text-to-(Audio+Video) diffusion transformer, DMD 8-step
- **Modality:** Text โ†’ synchronized video + audio
- **Backbone:** Built on top of [LTX-Video](https://github.com/Lightricks/LTX-Video)
- **Text encoder:** [`google/gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) (downloaded separately)
- **Resolution / length (by default):** 1280 ร— 736, 241 frames @ 25 fps per shot
- **Max story length:** up to 5 minutes (multi-shot)
- **License:** LTX-2 Community License Agreement
## Highlights
- ๐ŸŽž๏ธ **Minute-level multi-shot stories**: generate a sequence of coherent shots from one prompt JSON.
- โšก **DMD-distilled few-step inference**: ~7.5ร— faster than the original pipeline.
- ๐Ÿ”Š **Joint audio-video generation**: one pipeline produces synchronized video and audio.
- ๐Ÿง  **Paired cross-modal memory bank**: conditions each new shot on prior visual identity and voice context for story-level consistency.
## Demo Gallery
Explore long-form and short-form JoyAI-Echo cases on the [Project Page](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/). ๐Ÿฟ
## Usage
Inference is run with the standalone **JoyAI-Echo** inference repository.
### 1. Download the checkpoint
```bash
huggingface-cli download jdopensource/JoyAI-Echo \
--local-dir checkpoints
```
Also download the Gemma text encoder:
```bash
huggingface-cli download google/gemma-3-12b-it \
--local-dir checkpoints/gemma-3-12b
```
Expected layout:
```text
checkpoints/
โ”œโ”€โ”€ echo-longvideo-release.safetensors
โ””โ”€โ”€ gemma-3-12b/
```
### 2. Get the inference code
```bash
git clone https://github.com/jd-opensource/JoyAI-Echo.git
cd JoyAI-Echo
```
Environment: **Python 3.11 + PyTorch 2.8 + CUDA 12.8** (see the inference repo's `environment.yml` / `requirements.txt`).
### 3. Write a story prompt
**Enhance your prompt first.** We provide prompt enhancers โ€” system prompts that expand a short story or idea into well-formed shot prompts: **`prompts/long_story_writer_system_prompt.md`** for long, multi-shot video, and **`prompts/short_story_writer_system_prompt.md`** for single-shot short video. We **strongly recommend** running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.
Create a JSON file under `prompts/`. Each file is a single object with a `prompts` list, where **every string is one complete shot**. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.
Inside each string, write these parts in order:
| Part | What to describe |
| --- | --- |
| **Roles & Subjects** | Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable. |
| **Action & Dialogue** | What the subject does and speaks. |
| **Style** | The overall visual and emotional aesthetic โ€” e.g. realistic motorsport film language, cool daylight, restrained cinematic tension. |
| **Camera Movement** | The shot type and framing or movement โ€” e.g. a stable close-up on the face, or a medium shot from the waist up. |
| **Background** | The setting and scene details behind the subject. |
| **Sound Effects & BGM** | The sounds in the scene and the background music โ€” e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or no background music. |
### 4. Run
```bash
python inference.py
```
Outputs land in `inference_result/outputs/<prompt-name>/inference_<timestamp>/`.
## Hardware
Peak GPU memory is **~46โ€“50 GB** at the default 1280 ร— 736 ร— 241 frame setting โ€” a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:
```bash
python inference.py --num-frames 121 --video-height 480 --video-width 832
```
## Results
### Reported Scale
| Item | Value |
| --- | ---: |
| ๐ŸŽฌ Long-form coherent story length | **5 min** |
| โšก Generation speedup over the original multi-step pipeline | **7.5ร—** |
| ๐Ÿ“š Benchmark stories | **100** |
| ๐ŸŽž๏ธ Generated evaluation shots | **3,000** |
| ๐Ÿ•’ Frames per shot | **241 @ 25 fps** |
### Human Evaluation
GSB user study on long- and short-video generation. The numbers denote the percentage of user preferences.
| Aspect (Long Video) | JoyAI-Echo | Tie | HappyOyster (Directing) |
| --- | ---: | ---: | ---: |
| Visual aesthetics | **63.6%** | 8.8% | 27.6% |
| Audio quality | **81.7%** | 6.5% | 11.8% |
| Prompt following | **80.6%** | 13.5% | 5.9% |
| IP consistency | **59.4%** | 12.9% | 27.7% |
| Aspect (Short Video) | JoyAI-Echo | Tie | Wan 2.6 |
| --- | ---: | ---: | ---: |
| Visual aesthetics | **58.8%** | 14.7% | 26.5% |
| Audio quality | 32.3% | 30.9% | 36.8% |
| Prompt following | 33.8% | 36.8% | 29.4% |
## Links
- Project page: [`https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/`](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/)
- Inference code: [`https://github.com/jd-opensource/JoyAI-Echo`](https://github.com/jd-opensource/JoyAI-Echo)
- HuggingFace: [`https://huggingface.co/jdopensource/JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo)
## Acknowledgements
We gratefully acknowledge the open-source projects this work builds upon โ€” in particular [LTX2.3](https://huggingface.co/Lightricks/LTX-2.3) for the base video generator and [Gemma](https://huggingface.co/google/gemma-3-12b-it) for the text encoder. Thanks to the broader research community whose contributions made this release possible.
## Citation
If JoyAI-Echo helps your research or products, please cite:
```bibtex
@techreport{echo2026JoyEcho,
title = {JoyAI-Echo: Pushing the Frontier of Long Video Generation},
author = {{Echo Team @ Joy Future Academy, JD}},
institution = {Joy Future Academy, JD},
year = {2026},
month = {May}
}
```
## License
This project is based on LTX-2 by Lightricks Ltd.
Portions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only.
This project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd.
All original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained.
This project remains subject to the LTX-2 Community License Agreement.

Xet Storage Details

Size:
9.15 kB
ยท
Xet hash:
4cec311fadc7e8bf7e08fe3b482d878765db5e9f780113ee236fa0276f2c4a0c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.