Buckets:

Studytime171
/

JoyAI

Files

xet

Studytime171/JoyAI / README.md

Studytime171

7 days ago

preview code

download

raw

9.15 kB

	---
	license: other
	license_name: ltx-2-community-license-agreement
	license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
	pipeline_tag: text-to-video
	tags:
	- text-to-video
	- video-generation
	- audio-video-generation
	- long-video
	- multi-shot
	- dmd
	library_name: ltx-video
	---

	<p align="center">
	<img src="assets/image.png" alt="JoyAI-Echo generated video gallery" width="100%">
	</p>

	<div align="center">

	<h1>JoyAI-Echo</h1>

	<p><strong>🎬 Pushing the Frontier of Long Video Generation</strong></p>

	<p>Official model weights for <strong>minute-level multi-shot audio-video generation</strong> with a distilled DMD generator, paired cross-modal memory, and story-level consistency.</p>

	<p><strong>For academic research and non-commercial use only.</strong></p>

	<p>
	<a href="https://github.com/jd-opensource/JoyAI-Echo/blob/main/joyai-echo%20tech%20report.pdf"><b>📄 Paper</b></a> \|
	<a href="https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/"><b>🌐 Project Page</b></a> \|
	<a href="https://github.com/jd-opensource/JoyAI-Echo"><b>💻 Inference Code</b></a> \|
	<a href="#model-details"><b>🧬 Model</b></a> \|
	<a href="#usage"><b>🚀 Usage</b></a> \|
	<a href="#results"><b>📊 Results</b></a> \|
	<a href="#citation"><b>📝 Citation</b></a>
	</p>

	<p>
	<img src="https://img.shields.io/badge/Task-Text--to--Video-blue?style=flat-square" alt="Text-to-Video">
	<img src="https://img.shields.io/badge/Modality-Audio%2BVideo-purple?style=flat-square" alt="Audio + Video">
	<img src="https://img.shields.io/badge/Long%20Video-5%20min-d61f2c?style=flat-square" alt="5 minute long video">
	<img src="https://img.shields.io/badge/Release-Model%20Weights-black?style=flat-square" alt="Model Weights">
	</p>

	</div>

	## Model Summary

	JoyAI-Echo is a long-form, multi-shot, audio-video generation framework that breaks the barriers of error accumulation, weak temporal coherence, and prohibitive latency in long video generation. A cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combining memory-based reinforcement learning with distribution matching distillation (DMD) delivers a 7.5× inference speedup without sacrificing quality.

	JoyAI-Echo decisively outperforms HappyOyster (directing mode) on long-form generation and even surpasses the short-video specialist Wan 2.6 on human-centric tasks.

	This repository hosts the released checkpoint. Inference code is released separately — see the [Usage](#usage) section.

	## Model Details

	- Developed by: Echo Team @ Joy Future Academy, JD
	- Model type: Text-to-(Audio+Video) diffusion transformer, DMD 8-step
	- Modality: Text → synchronized video + audio
	- Backbone: Built on top of [LTX-Video](https://github.com/Lightricks/LTX-Video)
	- Text encoder: [`google/gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) (downloaded separately)
	- Resolution / length (by default): 1280 × 736, 241 frames @ 25 fps per shot
	- Max story length: up to 5 minutes (multi-shot)
	- License: LTX-2 Community License Agreement

	## Highlights

	- 🎞️ Minute-level multi-shot stories: generate a sequence of coherent shots from one prompt JSON.
	- ⚡ DMD-distilled few-step inference: ~7.5× faster than the original pipeline.
	- 🔊 Joint audio-video generation: one pipeline produces synchronized video and audio.
	- 🧠 Paired cross-modal memory bank: conditions each new shot on prior visual identity and voice context for story-level consistency.

	## Demo Gallery

	Explore long-form and short-form JoyAI-Echo cases on the [Project Page](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/). 🍿

	## Usage

	Inference is run with the standalone JoyAI-Echo inference repository.

	### 1. Download the checkpoint

	```bash
	huggingface-cli download jdopensource/JoyAI-Echo \
	--local-dir checkpoints
	```

	Also download the Gemma text encoder:

	```bash
	huggingface-cli download google/gemma-3-12b-it \
	--local-dir checkpoints/gemma-3-12b
	```

	Expected layout:

	```text
	checkpoints/
	├── echo-longvideo-release.safetensors
	└── gemma-3-12b/
	```

	### 2. Get the inference code

	```bash
	git clone https://github.com/jd-opensource/JoyAI-Echo.git
	cd JoyAI-Echo
	```

	Environment: Python 3.11 + PyTorch 2.8 + CUDA 12.8 (see the inference repo's `environment.yml` / `requirements.txt`).

	### 3. Write a story prompt

	Enhance your prompt first. We provide prompt enhancers — system prompts that expand a short story or idea into well-formed shot prompts: `prompts/long_story_writer_system_prompt.md` for long, multi-shot video, and `prompts/short_story_writer_system_prompt.md` for single-shot short video. We strongly recommend running your input through the matching enhancer before inference; un-enhanced prompts tend to produce noticeably weaker results.

	Create a JSON file under `prompts/`. Each file is a single object with a `prompts` list, where every string is one complete shot. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.

	Inside each string, write these parts in order:

	\| Part \| What to describe \|
	\| --- \| --- \|
	\| Roles & Subjects \| Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable. \|
	\| Action & Dialogue \| What the subject does and speaks. \|
	\| Style \| The overall visual and emotional aesthetic — e.g. realistic motorsport film language, cool daylight, restrained cinematic tension. \|
	\| Camera Movement \| The shot type and framing or movement — e.g. a stable close-up on the face, or a medium shot from the waist up. \|
	\| Background \| The setting and scene details behind the subject. \|
	\| Sound Effects & BGM \| The sounds in the scene and the background music — e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or no background music. \|

	### 4. Run

	```bash
	python inference.py
	```

	Outputs land in `inference_result/outputs/<prompt-name>/inference_<timestamp>/`.

	## Hardware

	Peak GPU memory is ~46–50 GB at the default 1280 × 736 × 241 frame setting — a single H100/A100 (80 GB) or 48 GB GPU is sufficient. For smaller GPUs, lower resolution or frame count:

	```bash
	python inference.py --num-frames 121 --video-height 480 --video-width 832
	```

	## Results

	### Reported Scale

	\| Item \| Value \|
	\| --- \| ---: \|
	\| 🎬 Long-form coherent story length \| 5 min \|
	\| ⚡ Generation speedup over the original multi-step pipeline \| 7.5× \|
	\| 📚 Benchmark stories \| 100 \|
	\| 🎞️ Generated evaluation shots \| 3,000 \|
	\| 🕒 Frames per shot \| 241 @ 25 fps \|

	### Human Evaluation

	GSB user study on long- and short-video generation. The numbers denote the percentage of user preferences.

	\| Aspect (Long Video) \| JoyAI-Echo \| Tie \| HappyOyster (Directing) \|
	\| --- \| ---: \| ---: \| ---: \|
	\| Visual aesthetics \| 63.6% \| 8.8% \| 27.6% \|
	\| Audio quality \| 81.7% \| 6.5% \| 11.8% \|
	\| Prompt following \| 80.6% \| 13.5% \| 5.9% \|
	\| IP consistency \| 59.4% \| 12.9% \| 27.7% \|

	\| Aspect (Short Video) \| JoyAI-Echo \| Tie \| Wan 2.6 \|
	\| --- \| ---: \| ---: \| ---: \|
	\| Visual aesthetics \| 58.8% \| 14.7% \| 26.5% \|
	\| Audio quality \| 32.3% \| 30.9% \| 36.8% \|
	\| Prompt following \| 33.8% \| 36.8% \| 29.4% \|

	## Links

	- Project page: [`https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/`](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/)
	- Inference code: [`https://github.com/jd-opensource/JoyAI-Echo`](https://github.com/jd-opensource/JoyAI-Echo)
	- HuggingFace: [`https://huggingface.co/jdopensource/JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo)

	## Acknowledgements

	We gratefully acknowledge the open-source projects this work builds upon — in particular [LTX2.3](https://huggingface.co/Lightricks/LTX-2.3) for the base video generator and [Gemma](https://huggingface.co/google/gemma-3-12b-it) for the text encoder. Thanks to the broader research community whose contributions made this release possible.

	## Citation

	If JoyAI-Echo helps your research or products, please cite:

	```bibtex
	@techreport{echo2026JoyEcho,
	title = {JoyAI-Echo: Pushing the Frontier of Long Video Generation},
	author = {{Echo Team @ Joy Future Academy, JD}},
	institution = {Joy Future Academy, JD},
	year = {2026},
	month = {May}
	}
	```

	## License

	This project is based on LTX-2 by Lightricks Ltd.

	Portions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only.
	This project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd.

	All original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained.
	This project remains subject to the LTX-2 Community License Agreement.

Xet Storage Details

Size:: 9.15 kB
Xet hash:: 4cec311fadc7e8bf7e08fe3b482d878765db5e9f780113ee236fa0276f2c4a0c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.