xmuhtt
/

File size: 11,352 Bytes
f91952b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
license: apache-2.0
tags:
- video
- video genration
base_model:
- Wan-AI/Wan2.1-I2V-14B-480P
pipeline_tags:
- image-to-video
library_name: diffusers
pipeline_tag: image-to-video
---
<div align="center">

<img src="./assets/logo.png" alt="LiveAct Logo" width="30%">

# SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

[Dingcheng Zhen*<sup>&#9993;</sup>](https://scholar.google.com/citations?user=jSLx3CcAAAAJ) · [Xu Zheng*](https://scholar.google.com/citations?user=Ii1c51QAAAAJ) · [Ruixin Zhang*](https://openreview.net/profile?id=~Ruixin_Zhang5) · [Zhiqi Jiang*](https://openreview.net/profile?id=~Zhiqi_Jiang3)

[Yichao Yan]() · [Ming Tao]() · [Shunshun Yin]()

</div>

**SoulX-LiveAct** presents a novel framework that enables **lifelike, multimodal-controlled, high-fidelity** human animation video generation for real-time streaming interactions.

(I) We identify diffusion-step-aligned neighbor latents as a key inductive bias for AR diffusion, providing a principled and theoretically grounded **Neighbor Forcing** for step-consistent AR video generation.

(II) We introduce **ConvKV Memory**, a lightweight plug-in compression mechanism that enables constant-memory hour-scale video generation with negligible overhead.

(III) We develop an optimized real-time system that achieves **20 FPS using only two H100/H200 GPUs** with end-end adaptive FP8 precision, sequence parallelism, and operator fusion at 720×416 or 512×512 resolution.


<div align="center">
  <a href='http://arxiv.org/abs/2603.11746'><img src='https://img.shields.io/badge/Technical-Report-red'></a>
  <a href='https://soul-ailab.github.io/soulx-liveact/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
  <a href='https://github.com/Soul-AILab/SoulX-LiveAct'><img src='https://img.shields.io/badge/Github-Home-blue'></a>
  <a href='https://huggingface.co/Soul-AILab/LiveAct'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow'></a>
</div>


## 🔥🔥🔥 News

* 📢 Mar 18, 2026: We now support consumer GPUs (e.g., RTX 4090, RTX 5090) with FP8 KV cache and CPU model offloading. In our tests, the 18B model (14B Wan2.1 + 4B audio module) achieves a throughput of 6 FPS on a single RTX 5090.
* 👋 Mar 16, 2026: We release the inference code and model weights of SoulX-LiveAct.


## 🎥 Demo

[//]: # (**Note:** Due to GitHub limitations, the videos are heavily compressed. Please refer to the [demo page]&#40;https://demopagedemo.github.io/LiveAct/&#41; for the original results.)

### 👫 Podcast
<table>
  <tr>
    <td><video controls playsinline width="666" src="https://github.com/user-attachments/assets/7d50441c-2a90-48c7-a557-c375936f2b65"></video></td>
  </tr>
</table>


### 🎤 Music & Talk Show
<table>
  <tr>
    <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/9fd4fbcf-3e76-48ca-a8e0-2a46da18da5c"></video></td>
    <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/9ac3ad4b-db6a-470b-9f4f-6ab9d1c8d998"></video></td>
  </tr>
</table>

### 📱 FaceTime
<table>
  <tr>
    <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/143bb565-078a-48ba-8daa-f2fb56616189"></video></td>
    <td><video controls playsinline width="360" src="https://github.com/user-attachments/assets/5619381e-bd8c-4aac-a1d6-2a1fdfe9d673"></video></td>
  </tr>
</table>


## 📑 Open-source Plan

  - [x] Release inference code and checkpoints
  - [x] GUI demo Support
  - [x] End-end adaptive FP8 precision
  - [x] Support model offloading for consumer GPUs (e.g., RTX 4090, RTX 5090) to reduce memory usage
  - [ ] Support FP4 precision for B-series GPUs (e.g., RTX 5090, B100, B200)
  - [ ] Release training code

## ▶️ Quick Start

### 🛠️ Dependencies and Installation

#### Step 1: Install Basic Dependencies

```bash
conda create -n liveact python=3.10
conda activate liveact
pip install -r requirements.txt
conda install conda-forge::sox -y
```

#### Step 2: Install SageAttention
To enable fp8 attention kernel, you need to install SageAttention:
* Install SageAttention:
  ```bash
  git clone https://github.com/thu-ml/SageAttention.git
  cd SageAttention
  git checkout v2.2.0
  python setup.py install
  ```

* (Optional) Install the modified version of SageAttention: 
  To enable SageAttention for QKV's operator fusion, you need to install it by the following command:

  ```bash
  git clone https://github.com/ZhiqiJiang/SageAttentionFusion.git
  cd SageAttentionFusion
  python setup.py install
  ```

#### Step 3: Install vllm:
  To enable fp8 gemm kernel, you need to install vllm:
  ```bash
  pip install vllm==0.11.0
  ```

#### Step 4 Install LightVAE::

  ```bash
  git clone https://github.com/ModelTC/LightX2V
  cd LightX2V
  python setup_vae.py install
  ```


### 🤗 Download Checkpoints

### Model Cards
| ModelName             | Download                                                                       |
|-----------------------|--------------------------------------------------------------------------------| 
| SoulX-LiveAct         | [🤗 Huggingface](https://huggingface.co/Soul-AILab/LiveAct)                    |
| chinese-wav2vec2-base | [🤗 Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base) |


### 🔑 Inference

#### Usage of LiveAct

#### 1. Run real-time streaming inference on two H100/H200 GPUs

```bash
USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535)  \
    generate.py \
    --size 416*720 \
    --ckpt_dir MODEL_PATH \
    --wav2vec_dir chinese-wav2vec2-base \
    --fps 20 \
    --dura_print \
    --input_json examples/example.json \
    --steam_audio
```

#### 2. Run with the best performance settings

```bash
USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535)  \
    generate.py \
    --size 480*832 \
    --ckpt_dir MODEL_PATH \
    --wav2vec_dir chinese-wav2vec2-base \
    --fps 24 \
    --input_json examples/example.json
```

#### 3. Run with action or emotion editing

```bash
USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535)  \
    generate.py \
    --size 512*512 \
    --ckpt_dir MODEL_PATH \
    --wav2vec_dir chinese-wav2vec2-base \
    --fps 24 \
    --input_json examples/example_edit.json
```

#### 4. Run on RTX 4090/RTX 5090 GPUs
**Note:** FP8 KV cache may slightly affect generation quality.
```bash
USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0 \
python generate.py \
    --size 416*720 \
    --ckpt_dir MODEL_PATH \
    --wav2vec_dir chinese-wav2vec2-base \
    --fps 24 \
    --input_json examples/example.json \
    --fp8_kv_cache \
    --block_offload \
    --t5_cpu
```

#### 5. Run with single GPU for Eval

```bash
USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0 \
python generate.py \
    --size 480*832 \
    --ckpt_dir MODEL_PATH \
    --wav2vec_dir chinese-wav2vec2-base \
    --fps 24 \
    --input_json examples/example.json \
    --audio_cfg 1.7 \
    --t5_cpu
```


### Command Line Arguments

| Argument          | Type  | Required | Default | Description                                                                                   |
|-------------------|-------|----------|---------|-----------------------------------------------------------------------------------------------|
| `--size`          | str   | Yes      | -       | The width and height of the generated video.                                                  |
| `--t5_cpu`        | bool  | No       | false   | Whether to place T5 model on CPU.                                                             |
| `--offload_cache` | bool  | No       | -       | Whether to place kv cache on CPU.                                                             |
| `--fps`           | int   | Yes      | -       | The target fps  of the generated video.                                                       |
| `--audio_cfg`     | float | No       | 1.0     | Classifier free guidance scale for audio control.                                             |
| `--dura_print`    | bool  | No       | no      | Whether print duration for every block.                                                       |
| `--input_json`    | str   | Yes      | _       | The condition json file path to generate the video.                                           |
| `--seed`          | int   | No       | 42      | The seed to use for generating the image or video.                                            |
| `--steam_audio`   | bool  | No       | false   | Whether inference with steaming audio.                                                        |
| `--mean_memory`   | bool  | No       | false   | Whether to use the mean memory strategy during inference for further performance improvement. |
| `--fp8_kv_cache`   | bool  | No       | false   | Whether to store kv cache in FP8 and dequantize to BF16 on use. FP8 KV cache may slightly affect generation quality.|
| `--block_offload`   | bool  | No       | false   | Whether to offload WanModel blocks to CPU between block forwards.|


### 💻 GUI demo
Run SoulX-LiveAct inference on the GUI demo and evaluate real-time performance.

<div>
  <video controls playsInline src="https://github.com/user-attachments/assets/7150345d-693f-4250-af07-e94daa6ef6ed" width="50%"></video>
</div>

**Note:** The first few blocks during the initial run require warm-up. Normal performance will be observed from the second run onward.

#### 1. Run real-time streaming inference on two H100/H200 GPUs

```bash
USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535) \
  demo.py \
  --ckpt_dir MODEL_PATH \
  --wav2vec_dir chinese-wav2vec2-base \
  --size 416*720 \
  --video_save_path ./generated_videos
```

#### 2. Run on RTX 4090/RTX 5090 GPUs
```bash
USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0 \
torchrun --nproc_per_node=1 --master_port=$(shuf -n 1 -i 10000-65535) \
  demo.py \
  --ckpt_dir MODEL_PATH \
  --wav2vec_dir chinese-wav2vec2-base \
  --size 416*720 \
  --fp8_kv_cache \
  --block_offload \
  --t5_cpu \
  --video_save_path ./generated_videos
```

## 📚 Citation

```bibtex
@misc{zhen2026soulxliveacthourscalerealtimehuman,
      title={SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory}, 
      author={Dingcheng Zhen and Xu Zheng and Ruixin Zhang and Zhiqi Jiang and Yichao Yan and Ming Tao and Shunshun Yin},
      year={2026},
      eprint={2603.11746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.11746}, 
}
```
## 📮 Contact Us
If you are interested in leaving a message to our work, feel free to email dingchengzhen@soulapp.cn.

You’re welcome to join our WeChat group or Soul group for technical discussions.
<p align="center">
  <span style="display: inline-block; margin-right: 10px;">
    <img src="assets/QRCode_WX.png" width="200" alt="WeChat Group QR Code"/>
  </span>
  <span style="display: inline-block;">
    <img src="assets/QRCode_Soul.png" width="300" alt="WeChat QR Code"/>
  </span>
</p>