Update README.md

6a15bdf verified about 18 hours ago

10.1 kB

	---
	license: other
	license_name: ltx-2-community-license-agreement
	license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
	pipeline_tag: text-to-video
	tags:
	- text-to-video
	- video-generation
	- audio-video-generation
	- long-video
	- multi-shot
	- dmd
	library_name: ltx-video
	---
	<p align="center">
	<img src="assets/image.png" alt="JoyAI-Echo generated video gallery" width="100%">
	</p>

	<div align="center">

	<h1>JoyAI-Echo</h1>

	<p><strong>🎬 Pushing the Frontier of Long Video Generation</strong></p>

	<p>Standalone, inference-only release for <strong>minute-level multi-shot audio-video generation</strong> with a distilled DMD generator, paired cross-modal memory, and story-level consistency.</p>

	<p><strong>For academic research and non-commercial use only.</strong></p>

	<p>
	<a href="https://github.com/jd-opensource/JoyAI-Echo/blob/main/joyai-echo%20tech%20report.pdf"><b>📄 Paper</b></a> \|
	<a href="https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/"><b>🌐 Project Page</b></a> \|
	<a href="#quickstart"><b>🚀 Quickstart</b></a> \|
	<a href="#results"><b>📊 Results</b></a> \|
	<a href="#citation"><b>📝 Citation</b></a>
	</p>

	<p>
	<img src="https://img.shields.io/badge/Python-3.11-3776AB?style=flat-square&logo=python&logoColor=white" alt="Python 3.11">
	<img src="https://img.shields.io/badge/PyTorch-2.8-EE4C2C?style=flat-square&logo=pytorch&logoColor=white" alt="PyTorch 2.8">
	<img src="https://img.shields.io/badge/CUDA-12.8-76B900?style=flat-square&logo=nvidia&logoColor=white" alt="CUDA 12.8">
	<img src="https://img.shields.io/badge/Release-Inference--Only-black?style=flat-square" alt="Inference">
	<img src="https://img.shields.io/badge/Long%20Video-5%20min-d61f2c?style=flat-square" alt="5 minute long video">
	</p>

	</div>

	## Abstract

	Long video generation still suffers from error accumulation, weak temporal coherence, and prohibitive latency, limiting its applicability to interactive scenarios. We present JoyAI-Echo, a framework that breaks these barriers through four key advances.
	Central to its performance, a cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combines memory-based reinforcement learning with distribution matching distillation for a 7.5× speedup to substantially boost visual quality and alignment.
	Empowered by these two components, JoyAI-Echo decisively outperforms HappyOyster (directing mode) on long-form generation and even surpasses the short-video specialist Wan 2.6 on human-centric tasks.
	Beyond raw generation quality, an interactive agent enables real-time user editing through conversational instructions, and a lightweight super-resolution module maintains high definition under streaming latency, further elevating the overall experience and delivering instantly editable, conversation-speed video creation.
	For the first time, JoyAI-Echo simultaneously achieves long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output — without compromise, inaugurating a new era of interactive video generation.
	Codes and weights will be open-sourced.

	## Highlights

	- 🎞️ Minute-level multi-shot stories: generate a sequence of coherent shots from one prompt JSON.
	- ⚡ DMD-distilled few-step inference: ~7.5x faster than the original pipeline.
	- 🔊 Joint audio-video generation: one pipeline produces synchronized video and audio.
	- 🧠 Paired cross-modal memory bank: conditions each new shot on prior visual identity and voice context for story-level consistency.

	## Current Release Scope

	JoyAI-Echo currently focuses on text-to-video (T2V) and multi-shot long-video generation with paired audio-video memory. The memory used in our official pipeline is built from generated T2V shots.

	Please note that image-to-video (I2V) is not supported in the current release.

	We are actively working on I2V support and plan to release it in a future version.

	## Demo Gallery

	Explore long-form and short-form JoyAI-Echo cases on the [Project Page](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/). 🍿

	## Results

	### Reported Scale

	\| Item \| Value \|
	\| --- \| ---: \|
	\| 🎬 Long-form coherent story length \| 5 min \|
	\| ⚡ Generation speedup over the original multi-step pipeline \| 7.5x \|
	\| 📚 Benchmark stories \| 100 \|
	\| 🎞️ Generated evaluation shots \| 3,000 \|
	\| 🕒 Frames per shot \| 241 @ 25 fps \|

	### Human Evaluation

	GSB user study on long- and short-video generation. The numbers denote the percentage of user preferences.

	\| Aspect<br>(Long Video) \| JoyAI-Echo \| Tie \| HappyOyster<br> (Directing) \|
	\| --- \| ---: \| ---: \| ---: \|
	\| Visual aesthetics \| 63.6% \| 8.8% \| 27.6% \|
	\| Audio quality \| 81.7% \| 6.5% \| 11.8% \|
	\| Prompt following \| 80.6% \| 13.5% \| 5.9% \|
	\| IP consistency \| 59.4% \| 12.9% \| 27.7% \|

	\| Aspect<br>(Short Video) \| JoyAI-Echo \| Tie \| Wan 2.6 \|
	\| --- \| ---: \| ---: \| ---: \|
	\| Visual aesthetics \| 58.8% \| 14.7% \| 26.5% \|
	\| Audio quality \| 32.3% \| 30.9% \| 36.8% \|
	\| Prompt following \| 33.8% \| 36.8% \| 29.4% \|


	## Quickstart

	### 1. Clone

	Get the Repo at first!

	```bash

	git clone https://github.com/jd-opensource/JoyAI-Echo.git
	cd JoyAI-Echo
	```

	### 2. Create the environment

	The reference environment is Python 3.11 + PyTorch 2.8 + CUDA 12.8.

	With conda:

	```bash
	conda env create -f environment.yml
	conda activate echo-long
	```

	With `uv`:

	```bash
	uv venv --python 3.11 .venv
	source .venv/bin/activate
	uv pip install --extra-index-url https://download.pytorch.org/whl/cu128 -r requirements.txt
	```

	[`ffmpeg`](https://ffmpeg.org/download.html) must be available on `PATH` for shot concatenation. The conda recipe includes it. If you use `uv`, install it with your system package manager:

	```bash
	sudo apt install ffmpeg
	# macOS:
	brew install ffmpeg
	```

	### 3. Download checkpoint

	Download the JoyAI-Echo release checkpoint and Gemma text encoder:

	\| File \| Description \| Size \| Link \|
	\| --- \| --- \| --- \| --- \|
	\| `echo-longvideo-release.safetensors` \| Full model (transformer + VAE + vocoder) \| ~46 GB \|[`JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo) \|
	\| `gemma-3-12b/` \| Instruction-tuned model (text encoder) \| ~24 GB \| [`gemma-3-12b-it`](https://huggingface.co/google/gemma-3-12b-it) \|

	Place them under `checkpoints/`:

	```text
	checkpoints/
	+-- echo-longvideo-release.safetensors
	`-- gemma-3-12b/
	```

	### 4. Write a story prompt

	Create a JSON file under `prompts/`.

	Each string is one complete shot description. A single prompt creates a single shot. Multiple prompts create a multi-shot story conditioned through the paired audio-video memory bank.

	### 5. Run inference

	```bash
	python inference.py
	```

	This loads the model once and processes all prompt files under `prompts/`.

	> 💡 Note: The inference pipeline is optimized to run on lower-VRAM
	> GPUs. Peak GPU usage is around 46–50 GB, at the cost of slightly
	> longer per-shot inference time.

	Outputs are written to:

	```text
	inference_result/outputs/<prompt-name>/inference_<timestamp>/
	```

	## Configuration

	All inference parameters are managed in `configs/inference.yaml`. The file is organized into sections:

	\| Section \| Contents \|
	\| --- \| --- \|
	\| `paths` \| Checkpoint path, prompts directory, output root \|
	\| `video` \| Resolution, frame count, FPS, seed \|
	\| `denoising` \| Step list and sigma schedule \|
	\| `memory` \| Memory bank size, save mode, LoRA settings \|
	\| `audio_memory` \| Audio window, mel-spectrogram params \|
	\| `inference` \| Device, dtype, grad scale \|

	### Override via CLI

	Any YAML parameter can be overridden from the command line:

	```bash
	python inference.py --seed 42 --num-frames 121 --video-height 480 --video-width 832
	```

	Use a custom config file:

	```bash
	python inference.py --config configs/my_experiment.yaml
	```

	The Python entrypoint exposes the full configuration surface:

	```bash
	python inference.py --help
	```

	## Hardware

	Peak GPU usage is around 46–50 GB for the default 25 fps x 241 frames x 1280 x 736 setting, so a single H100/A100-class (80 GB) or 48 GB GPU is sufficient.

	For smaller GPUs, reduce resolution/frames:

	```bash
	python inference.py --num-frames 121 --video-height 480 --video-width 832
	```

	## TODO List

	- [x] Release inference code
	- [x] Release model checkpoints
	- [x] Add prompt examples
	- [ ] Release Director Agent

	## Links

	- Project page: [`https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/`](https://echo-team-joy-future-academy-jd.github.io/Echo-LongVideo-Page/)
	- Repository: [`https://github.com/jd-opensource/JoyAI-Echo`](https://github.com/jd-opensource/JoyAI-Echo)
	- huggingface: [`https://huggingface.co/jdopensource/JoyAI-Echo`](https://huggingface.co/jdopensource/JoyAI-Echo)

	## Acknowledgements

	We gratefully acknowledge the open-source projects this work builds upon — in particular [LTX2.3](https://huggingface.co/Lightricks/LTX-2.3) for the base video generator and [Gemma](https://huggingface.co/google/gemma-3-12b-it) for the text encoder. Thanks to the broader research community whose contributions made this release possible.

	## Citation

	If JoyAI-Echo helps your research or products, please cite:

	```bibtex
	@techreport{echo2026longvideo,
	title = {JoyAI-Echo: Pushing the Frontier of Long Video Generation},
	author = {{Echo Team @ Joy Future Academy, JD}},
	institution = {Joy Future Academy, JD},
	year = {2026},
	month = {May}
	}
	```

	## License

	This project is based on LTX-2 by Lightricks Ltd.

	Portions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only.
	This project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd.

	All original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained.
	This project remains subject to the LTX-2 Community License Agreement.