| --- |
| license: apache-2.0 |
| tags: |
| - video |
| - video genration |
| base_model: |
| - Wan-AI/Wan2.1-I2V-14B-480P |
| pipeline_tags: |
| - image-to-video |
| library_name: diffusers |
| pipeline_tag: image-to-video |
| --- |
| <div align="center"> |
|
|
| <img src="./assets/logo.png" alt="LiveAct Logo" width="30%"> |
|
|
| # SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory |
|
|
| [Dingcheng Zhen*<sup>✉</sup>](https://scholar.google.com/citations?user=jSLx3CcAAAAJ) · [Xu Zheng*](https://scholar.google.com/citations?user=Ii1c51QAAAAJ) · [Ruixin Zhang*](https://openreview.net/profile?id=~Ruixin_Zhang5) · [Zhiqi Jiang*](https://openreview.net/profile?id=~Zhiqi_Jiang3) |
|
|
| [Yichao Yan]() · [Ming Tao]() · [Shunshun Yin]() |
|
|
| </div> |
|
|
| **SoulX-LiveAct** presents a novel framework that enables **lifelike, multimodal-controlled, high-fidelity** human animation video generation for real-time streaming interactions. |
|
|
| (I) We identify diffusion-step-aligned neighbor latents as a key inductive bias for AR diffusion, providing a principled and theoretically grounded **Neighbor Forcing** for step-consistent AR video generation. |
|
|
| (II) We introduce **ConvKV Memory**, a lightweight plug-in compression mechanism that enables constant-memory hour-scale video generation with negligible overhead. |
|
|
| (III) We develop an optimized real-time system that achieves **20 FPS using only two H100/H200 GPUs** with end-end adaptive FP8 precision, sequence parallelism, and operator fusion at 720×416 or 512×512 resolution. |
|
|
|
|
| <div align="center"> |
| <a href='http://arxiv.org/abs/2603.11746'><img src='https://img.shields.io/badge/Technical-Report-red'></a> |
| <a href='https://soul-ailab.github.io/soulx-liveact/'><img src='https://img.shields.io/badge/Project-Page-green'></a> |
| <a href='https://github.com/Soul-AILab/SoulX-LiveAct'><img src='https://img.shields.io/badge/Github-Home-blue'></a> |
| <a href='https://huggingface.co/Soul-AILab/LiveAct'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow'></a> |
| </div> |
|
|
|
|
| ## 🔥🔥🔥 News |
|
|
| * 📢 Mar 18, 2026: We now support consumer GPUs (e.g., RTX 4090, RTX 5090) with FP8 KV cache and CPU model offloading. In our tests, the 18B model (14B Wan2.1 + 4B audio module) achieves a throughput of 6 FPS on a single RTX 5090. |
| * 👋 Mar 16, 2026: We release the inference code and model weights of SoulX-LiveAct. |
|
|
|
|
| ## 🎥 Demo |
|
|
| [//]: # (**Note:** Due to GitHub limitations, the videos are heavily compressed. Please refer to the [demo page](https://demopagedemo.github.io/LiveAct/) for the original results.) |
|
|
| ### 👫 Podcast |
| <table> |
| <tr> |
| <td><video controls playsinline width="666" src="https://github.com/user-attachments/assets/7d50441c-2a90-48c7-a557-c375936f2b65"></video></td> |
| </tr> |
| </table> |
| |
|
|
| ### 🎤 Music & Talk Show |
| <table> |
| <tr> |
| <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/9fd4fbcf-3e76-48ca-a8e0-2a46da18da5c"></video></td> |
| <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/9ac3ad4b-db6a-470b-9f4f-6ab9d1c8d998"></video></td> |
| </tr> |
| </table> |
| |
| ### 📱 FaceTime |
| <table> |
| <tr> |
| <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/143bb565-078a-48ba-8daa-f2fb56616189"></video></td> |
| <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/5619381e-bd8c-4aac-a1d6-2a1fdfe9d673"></video></td> |
| </tr> |
| </table> |
| |
|
|
| ## 📑 Open-source Plan |
|
|
| - [x] Release inference code and checkpoints |
| - [x] GUI demo Support |
| - [x] End-end adaptive FP8 precision |
| - [x] Support model offloading for consumer GPUs (e.g., RTX 4090, RTX 5090) to reduce memory usage |
| - [ ] Support FP4 precision for B-series GPUs (e.g., RTX 5090, B100, B200) |
| - [ ] Release training code |
|
|
| ## ▶️ Quick Start |
|
|
| ### 🛠️ Dependencies and Installation |
|
|
| #### Step 1: Install Basic Dependencies |
|
|
| ```bash |
| conda create -n liveact python=3.10 |
| conda activate liveact |
| pip install -r requirements.txt |
| conda install conda-forge::sox -y |
| ``` |
|
|
| #### Step 2: Install SageAttention |
| To enable fp8 attention kernel, you need to install SageAttention: |
| * Install SageAttention: |
| ```bash |
| git clone https://github.com/thu-ml/SageAttention.git |
| cd SageAttention |
| git checkout v2.2.0 |
| python setup.py install |
| ``` |
|
|
| * (Optional) Install the modified version of SageAttention: |
| To enable SageAttention for QKV's operator fusion, you need to install it by the following command: |
|
|
| ```bash |
| git clone https://github.com/ZhiqiJiang/SageAttentionFusion.git |
| cd SageAttentionFusion |
| python setup.py install |
| ``` |
|
|
| #### Step 3: Install vllm: |
| To enable fp8 gemm kernel, you need to install vllm: |
| ```bash |
| pip install vllm==0.11.0 |
| ``` |
|
|
| #### Step 4 Install LightVAE:: |
|
|
| ```bash |
| git clone https://github.com/ModelTC/LightX2V |
| cd LightX2V |
| python setup_vae.py install |
| ``` |
|
|
|
|
| ### 🤗 Download Checkpoints |
|
|
| ### Model Cards |
| | ModelName | Download | |
| |-----------------------|--------------------------------------------------------------------------------| |
| | SoulX-LiveAct | [🤗 Huggingface](https://huggingface.co/Soul-AILab/LiveAct) | |
| | chinese-wav2vec2-base | [🤗 Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base) | |
|
|
|
|
| ### 🔑 Inference |
|
|
| #### Usage of LiveAct |
|
|
| #### 1. Run real-time streaming inference on two H100/H200 GPUs |
|
|
| ```bash |
| USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \ |
| torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535) \ |
| generate.py \ |
| --size 416*720 \ |
| --ckpt_dir MODEL_PATH \ |
| --wav2vec_dir chinese-wav2vec2-base \ |
| --fps 20 \ |
| --dura_print \ |
| --input_json examples/example.json \ |
| --steam_audio |
| ``` |
|
|
| #### 2. Run with the best performance settings |
|
|
| ```bash |
| USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \ |
| torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535) \ |
| generate.py \ |
| --size 480*832 \ |
| --ckpt_dir MODEL_PATH \ |
| --wav2vec_dir chinese-wav2vec2-base \ |
| --fps 24 \ |
| --input_json examples/example.json |
| ``` |
|
|
| #### 3. Run with action or emotion editing |
|
|
| ```bash |
| USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \ |
| torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535) \ |
| generate.py \ |
| --size 512*512 \ |
| --ckpt_dir MODEL_PATH \ |
| --wav2vec_dir chinese-wav2vec2-base \ |
| --fps 24 \ |
| --input_json examples/example_edit.json |
| ``` |
|
|
| #### 4. Run on RTX 4090/RTX 5090 GPUs |
| **Note:** FP8 KV cache may slightly affect generation quality. |
| ```bash |
| USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0 \ |
| python generate.py \ |
| --size 416*720 \ |
| --ckpt_dir MODEL_PATH \ |
| --wav2vec_dir chinese-wav2vec2-base \ |
| --fps 24 \ |
| --input_json examples/example.json \ |
| --fp8_kv_cache \ |
| --block_offload \ |
| --t5_cpu |
| ``` |
|
|
| #### 5. Run with single GPU for Eval |
|
|
| ```bash |
| USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0 \ |
| python generate.py \ |
| --size 480*832 \ |
| --ckpt_dir MODEL_PATH \ |
| --wav2vec_dir chinese-wav2vec2-base \ |
| --fps 24 \ |
| --input_json examples/example.json \ |
| --audio_cfg 1.7 \ |
| --t5_cpu |
| ``` |
|
|
|
|
| ### Command Line Arguments |
|
|
| | Argument | Type | Required | Default | Description | |
| |-------------------|-------|----------|---------|-----------------------------------------------------------------------------------------------| |
| | `--size` | str | Yes | - | The width and height of the generated video. | |
| | `--t5_cpu` | bool | No | false | Whether to place T5 model on CPU. | |
| | `--offload_cache` | bool | No | - | Whether to place kv cache on CPU. | |
| | `--fps` | int | Yes | - | The target fps of the generated video. | |
| | `--audio_cfg` | float | No | 1.0 | Classifier free guidance scale for audio control. | |
| | `--dura_print` | bool | No | no | Whether print duration for every block. | |
| | `--input_json` | str | Yes | _ | The condition json file path to generate the video. | |
| | `--seed` | int | No | 42 | The seed to use for generating the image or video. | |
| | `--steam_audio` | bool | No | false | Whether inference with steaming audio. | |
| | `--mean_memory` | bool | No | false | Whether to use the mean memory strategy during inference for further performance improvement. | |
| | `--fp8_kv_cache` | bool | No | false | Whether to store kv cache in FP8 and dequantize to BF16 on use. FP8 KV cache may slightly affect generation quality.| |
| | `--block_offload` | bool | No | false | Whether to offload WanModel blocks to CPU between block forwards.| |
|
|
|
|
| ### 💻 GUI demo |
| Run SoulX-LiveAct inference on the GUI demo and evaluate real-time performance. |
|
|
| <div> |
| <video controls playsInline src="https://github.com/user-attachments/assets/7150345d-693f-4250-af07-e94daa6ef6ed" width="50%"></video> |
| </div> |
|
|
| **Note:** The first few blocks during the initial run require warm-up. Normal performance will be observed from the second run onward. |
|
|
| #### 1. Run real-time streaming inference on two H100/H200 GPUs |
|
|
| ```bash |
| USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \ |
| torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535) \ |
| demo.py \ |
| --ckpt_dir MODEL_PATH \ |
| --wav2vec_dir chinese-wav2vec2-base \ |
| --size 416*720 \ |
| --video_save_path ./generated_videos |
| ``` |
|
|
| #### 2. Run on RTX 4090/RTX 5090 GPUs |
| ```bash |
| USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0 \ |
| torchrun --nproc_per_node=1 --master_port=$(shuf -n 1 -i 10000-65535) \ |
| demo.py \ |
| --ckpt_dir MODEL_PATH \ |
| --wav2vec_dir chinese-wav2vec2-base \ |
| --size 416*720 \ |
| --fp8_kv_cache \ |
| --block_offload \ |
| --t5_cpu \ |
| --video_save_path ./generated_videos |
| ``` |
|
|
| ## 📚 Citation |
|
|
| ```bibtex |
| @misc{zhen2026soulxliveacthourscalerealtimehuman, |
| title={SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory}, |
| author={Dingcheng Zhen and Xu Zheng and Ruixin Zhang and Zhiqi Jiang and Yichao Yan and Ming Tao and Shunshun Yin}, |
| year={2026}, |
| eprint={2603.11746}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2603.11746}, |
| } |
| ``` |
| ## 📮 Contact Us |
| If you are interested in leaving a message to our work, feel free to email dingchengzhen@soulapp.cn. |
|
|
| You’re welcome to join our WeChat group or Soul group for technical discussions. |
| <p align="center"> |
| <span style="display: inline-block; margin-right: 10px;"> |
| <img src="assets/QRCode_WX.png" width="200" alt="WeChat Group QR Code"/> |
| </span> |
| <span style="display: inline-block;"> |
| <img src="assets/QRCode_Soul.png" width="300" alt="WeChat QR Code"/> |
| </span> |
| </p> |