Upload folder using huggingface_hub

Browse files

Files changed (12) hide show

.gitattributes +7 -0
.ipynb_checkpoints/README-checkpoint.md +198 -0
README.md +198 -0
assets.zip +3 -0
assets/.DS_Store +0 -0
assets/1.mp4 +3 -0
assets/2.mp4 +3 -0
assets/demo.mp4 +3 -0
assets/logo.png +3 -0
assets/podcast.mp4 +3 -0
assets/teaser1.mp4 +3 -0
assets/teaser2.mp4 +3 -0

.gitattributes CHANGED Viewed

@@ -35,3 +35,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 google/umt5-xxl/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 xlm-roberta-large/tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 google/umt5-xxl/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 xlm-roberta-large/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/2.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/demo.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/logo.png filter=lfs diff=lfs merge=lfs -text
+assets/podcast.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/teaser1.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/teaser2.mp4 filter=lfs diff=lfs merge=lfs -text

.ipynb_checkpoints/README-checkpoint.md ADDED Viewed

	@@ -0,0 +1,198 @@

+<div align="center">
+<img src="./assets/logo.png" alt="LiveAct Logo" width="30%">
+# SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
+[Dingcheng Zhen*<sup>&#9993;</sup>](https://scholar.google.com/citations?user=jSLx3CcAAAAJ) · [Xu Zheng*](https://scholar.google.com/citations?user=Ii1c51QAAAAJ) · [Ruixin Zhang*](https://openreview.net/profile?id=~Ruixin_Zhang5) · [Zhiqi Jiang*](https://openreview.net/profile?id=~Zhiqi_Jiang3)
+[Yichao Yan]() · [Ming Tao]() · [Shunshun Yin]()
+</div>
+**LiveAct** presents a novel framework that enables **lifelike, multimodal-controlled, high-fidelity** human animation video generation for real-time streaming interactions.
+(I) We identify diffusion-step-aligned neighbor latents as a key inductive bias for AR diffusion, providing a principled and theoretically grounded **Neighbor Forcing** for step-consistent AR video generation.
+(II) We introduce **ConvKV Memory**, a lightweight plug-in compression mechanism that enables constant-memory hour-scale video generation with negligible overhead.
+(III) We develop an optimized real-time system that achieves **20 FPS using only two H100/H200 GPUs** with end-end adaptive FP8 precision, sequence parallelism, and communication-computation parallelism at 720×416 or 512×512 resolution.
+<div align="center">
+  <a href='http://arxiv.org/abs/2603.11746'><img src='https://img.shields.io/badge/Technical-Report-red'></a>
+  <a href='https://demopagedemo.github.io/LiveAct/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
+  <a href='https://github.com/Soul-AILab/SoulX-LiveAct'><img src='https://img.shields.io/badge/Github-Home-blue'></a>
+  <a href='https://huggingface.co/Soul-AILab/LiveAct'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow'></a>
+</div>
+## 🔥🔥🔥 News
+* 👋 Mar 16, 2026: We release the inference code and model weights of LiveAct.
+## 🎥 Demo
+### 👫 Podcast
+<div>
+  <video controls playsInline src="./assets/podcast.mp4" width="40%"></video>
+</div>
+### 🎤 Music & Talk Show
+<table>
+  <tr>
+    <td><video controls playsinline width="360" src="./assets/teaser1.mp4"></video></td>
+    <td><video controls playsinline width="360" src="./assets/teaser2.mp4"></video></td>
+  </tr>
+</table>
+### 📱 FaceTime
+<table>
+  <tr>
+    <td><video controls playsinline width="360" src="./assets/1.mp4"></video></td>
+    <td><video controls playsinline width="360" src="./assets/2.mp4"></video></td>
+  </tr>
+</table>
+## 📑 Open-source Plan
+  - [x] Release inference code and checkpoints
+  - [x] GUI demo Support
+  - [x] End-end adaptive FP8 precision
+  - [ ] Support FP4 precision for B-series GPUs (e.g., RTX 5090, B100, B200)
+  - [ ] Release training code
+## ▶️ Quick Start
+### 🛠️ Dependencies and Installation
+#### Step 1: Install Basic Dependencies
+```bash
+conda create -n liveact python=3.10
+conda activate liveact
+pip install -r requirements.txt
+conda install conda-forge::sox -y
+```
+#### Step 2: Install SageAttention
+To enable fp8 attention kernel, you need to install SageAttention:
+* Install SageAttention:
+  ```bash
+  pip install sageattention==2.2.0 --no-build-isolation
+  ```
+* (Optional) Install the modified version of SageAttention:
+  To enable SageAttention for QKV communication–computation parallelism, you need to install it by the following command:
+  ```bash
+  git clone https://github.com/ZhiqiJiang/SageAttentionFusion.git
+  cd SageAttentionFusion
+  python setup.py install
+  ```
+#### Step 3: Install vllm:
+  To enable fp8 gemm kernel, you need to install vllm:
+  ```bash
+  pip install vllm==0.11.0
+  ```
+### 🤗 Download Checkpoints
+### Model Cards
+| ModelName             | Download                                                    |
+|-----------------------|-------------------------------------------------------------|
+| LiveAct               | [🤗 Huggingface](https://huggingface.co/Soul-AILab/LiveAct) |
+| chinese-wav2vec2-base |      🤗 [Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base)          |
+### 🔑 Inference
+#### Usage of LiveAct
+#### 1. Run real-time streaming inference on two H100/H200 GPUs
+```bash
+USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
+torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535)  \
+    generate.py \
+    --size 416*720 \
+    --ckpt_dir MODEL_PATH \
+    --wav2vec_dir chinese-wav2vec2-base \
+    --fps 20 \
+    --dura_print \
+    --input_json examples/example.json \
+    --steam_audio
+```
+#### 2. Run with single GPU for Eval
+```bash
+USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=7 \
+python generate.py \
+    --size 480*832 \
+    --ckpt_dir MODEL_PATH \
+    --wav2vec_dir chinese-wav2vec2-base \
+    --fps 24 \
+    --input_json examples/example.json \
+    --audio_cfg 1.7 \
+    --t5_cpu
+```
+### Command Line Arguments
+| Argument          | Type  | Required | Default | Description                                                                                   |
+|-------------------|-------|----------|---------|-----------------------------------------------------------------------------------------------|
+| `--size`          | str   | Yes      | -       | The width and height of the generated video.                                                  |
+| `--t5_cpu`        | bool  | No       | false   | Whether to place T5 model on CPU.                                                             |
+| `--offload_cache` | bool  | No       | -       | Whether to place kv cache on CPU.                                                             |
+| `--fps`           | int   | Yes      | -       | The target fps  of the generated video.                                                       |
+| `--audio_cfg`     | float | No       | 1.0     | Classifier free guidance scale for audio control.                                             |
+| `--dura_print`    | bool  | No       | no      | Whether print duration for every block.                                                       |
+| `--input_json`    | str   | Yes      | _       | The condition json file path to generate the video.                                           |
+| `--seed`          | int   | No       | 42      | The seed to use for generating the image or video.                                            |
+| `--steam_audio`   | bool  | No       | false   | Whether inference with steaming audio.                                                        |
+| `--mean_memory`   | bool  | No       | false   | Whether to use the mean memory strategy during inference for further performance improvement. |
+### 💻 GUI demo
+Run LiveAct inference on the GUI demo and evaluate real-time performance.
+<div>
+  <video controls playsInline src="./assets/demo.mp4" width="50%"></video>
+</div>
+**Note:** The first few blocks during the initial run require warm-up. Normal performance will be observed from the second run onward.
+```bash
+USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
+torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535) \
+  demo.py \
+  --ckpt_dir MODEL_PATH \
+  --wav2vec_dir chinese-wav2vec2-base \
+  --size 416*720 \
+  --video_save_path ./generated_videos
+```
+## 📚 Citation
+```bibtex
+@misc{zhen2026soulxliveacthourscalerealtimehuman,
+      title={SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory},
+      author={Dingcheng Zhen and Xu Zheng and Ruixin Zhang and Zhiqi Jiang and Yichao Yan and Ming Tao and Shunshun Yin},
+      year={2026},
+      eprint={2603.11746},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2603.11746},
+}
+```
+## 🙏 Acknowledgements
+We would like to thank the contributors to the [Transformers](https://github.com/huggingface/transformers), [Diffusers](https://github.com/huggingface/diffusers) , [HuggingFace](https://huggingface.co/) and [Qwen-VL](https://github.com/QwenLM/Qwen-VL), for their open research and exploration.

README.md ADDED Viewed

	@@ -0,0 +1,198 @@

+<div align="center">
+<img src="./assets/logo.png" alt="LiveAct Logo" width="30%">
+# SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory
+[Dingcheng Zhen*<sup>&#9993;</sup>](https://scholar.google.com/citations?user=jSLx3CcAAAAJ) · [Xu Zheng*](https://scholar.google.com/citations?user=Ii1c51QAAAAJ) · [Ruixin Zhang*](https://openreview.net/profile?id=~Ruixin_Zhang5) · [Zhiqi Jiang*](https://openreview.net/profile?id=~Zhiqi_Jiang3)
+[Yichao Yan]() · [Ming Tao]() · [Shunshun Yin]()
+</div>
+**LiveAct** presents a novel framework that enables **lifelike, multimodal-controlled, high-fidelity** human animation video generation for real-time streaming interactions.
+(I) We identify diffusion-step-aligned neighbor latents as a key inductive bias for AR diffusion, providing a principled and theoretically grounded **Neighbor Forcing** for step-consistent AR video generation.
+(II) We introduce **ConvKV Memory**, a lightweight plug-in compression mechanism that enables constant-memory hour-scale video generation with negligible overhead.
+(III) We develop an optimized real-time system that achieves **20 FPS using only two H100/H200 GPUs** with end-end adaptive FP8 precision, sequence parallelism, and communication-computation parallelism at 720×416 or 512×512 resolution.
+<div align="center">
+  <a href='http://arxiv.org/abs/2603.11746'><img src='https://img.shields.io/badge/Technical-Report-red'></a>
+  <a href='https://demopagedemo.github.io/LiveAct/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
+  <a href='https://github.com/Soul-AILab/SoulX-LiveAct'><img src='https://img.shields.io/badge/Github-Home-blue'></a>
+  <a href='https://huggingface.co/Soul-AILab/LiveAct'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow'></a>
+</div>
+## 🔥🔥🔥 News
+* 👋 Mar 16, 2026: We release the inference code and model weights of LiveAct.
+## 🎥 Demo
+### 👫 Podcast
+<div>
+  <video controls playsInline src="./assets/podcast.mp4" width="40%"></video>
+</div>
+### 🎤 Music & Talk Show
+<table>
+  <tr>
+    <td><video controls playsinline width="360" src="./assets/teaser1.mp4"></video></td>
+    <td><video controls playsinline width="360" src="./assets/teaser2.mp4"></video></td>
+  </tr>
+</table>
+### 📱 FaceTime
+<table>
+  <tr>
+    <td><video controls playsinline width="360" src="./assets/1.mp4"></video></td>
+    <td><video controls playsinline width="360" src="./assets/2.mp4"></video></td>
+  </tr>
+</table>
+## 📑 Open-source Plan
+  - [x] Release inference code and checkpoints
+  - [x] GUI demo Support
+  - [x] End-end adaptive FP8 precision
+  - [ ] Support FP4 precision for B-series GPUs (e.g., RTX 5090, B100, B200)
+  - [ ] Release training code
+## ▶️ Quick Start
+### 🛠️ Dependencies and Installation
+#### Step 1: Install Basic Dependencies
+```bash
+conda create -n liveact python=3.10
+conda activate liveact
+pip install -r requirements.txt
+conda install conda-forge::sox -y
+```
+#### Step 2: Install SageAttention
+To enable fp8 attention kernel, you need to install SageAttention:
+* Install SageAttention:
+  ```bash
+  pip install sageattention==2.2.0 --no-build-isolation
+  ```
+* (Optional) Install the modified version of SageAttention:
+  To enable SageAttention for QKV communication–computation parallelism, you need to install it by the following command:
+  ```bash
+  git clone https://github.com/ZhiqiJiang/SageAttentionFusion.git
+  cd SageAttentionFusion
+  python setup.py install
+  ```
+#### Step 3: Install vllm:
+  To enable fp8 gemm kernel, you need to install vllm:
+  ```bash
+  pip install vllm==0.11.0
+  ```
+### 🤗 Download Checkpoints
+### Model Cards
+| ModelName             | Download                                                    |
+|-----------------------|-------------------------------------------------------------|
+| LiveAct               | [🤗 Huggingface](https://huggingface.co/Soul-AILab/LiveAct) |
+| chinese-wav2vec2-base |      🤗 [Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base)          |
+### 🔑 Inference
+#### Usage of LiveAct
+#### 1. Run real-time streaming inference on two H100/H200 GPUs
+```bash
+USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
+torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535)  \
+    generate.py \
+    --size 416*720 \
+    --ckpt_dir MODEL_PATH \
+    --wav2vec_dir chinese-wav2vec2-base \
+    --fps 20 \
+    --dura_print \
+    --input_json examples/example.json \
+    --steam_audio
+```
+#### 2. Run with single GPU for Eval
+```bash
+USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=7 \
+python generate.py \
+    --size 480*832 \
+    --ckpt_dir MODEL_PATH \
+    --wav2vec_dir chinese-wav2vec2-base \
+    --fps 24 \
+    --input_json examples/example.json \
+    --audio_cfg 1.7 \
+    --t5_cpu
+```
+### Command Line Arguments
+| Argument          | Type  | Required | Default | Description                                                                                   |
+|-------------------|-------|----------|---------|-----------------------------------------------------------------------------------------------|
+| `--size`          | str   | Yes      | -       | The width and height of the generated video.                                                  |
+| `--t5_cpu`        | bool  | No       | false   | Whether to place T5 model on CPU.                                                             |
+| `--offload_cache` | bool  | No       | -       | Whether to place kv cache on CPU.                                                             |
+| `--fps`           | int   | Yes      | -       | The target fps  of the generated video.                                                       |
+| `--audio_cfg`     | float | No       | 1.0     | Classifier free guidance scale for audio control.                                             |
+| `--dura_print`    | bool  | No       | no      | Whether print duration for every block.                                                       |
+| `--input_json`    | str   | Yes      | _       | The condition json file path to generate the video.                                           |
+| `--seed`          | int   | No       | 42      | The seed to use for generating the image or video.                                            |
+| `--steam_audio`   | bool  | No       | false   | Whether inference with steaming audio.                                                        |
+| `--mean_memory`   | bool  | No       | false   | Whether to use the mean memory strategy during inference for further performance improvement. |
+### 💻 GUI demo
+Run LiveAct inference on the GUI demo and evaluate real-time performance.
+<div>
+  <video controls playsInline src="./assets/demo.mp4" width="50%"></video>
+</div>
+**Note:** The first few blocks during the initial run require warm-up. Normal performance will be observed from the second run onward.
+```bash
+USE_CHANNELS_LAST_3D=1 CUDA_VISIBLE_DEVICES=0,1 \
+torchrun --nproc_per_node=2 --master_port=$(shuf -n 1 -i 10000-65535) \
+  demo.py \
+  --ckpt_dir MODEL_PATH \
+  --wav2vec_dir chinese-wav2vec2-base \
+  --size 416*720 \
+  --video_save_path ./generated_videos
+```
+## 📚 Citation
+```bibtex
+@misc{zhen2026soulxliveacthourscalerealtimehuman,
+      title={SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory},
+      author={Dingcheng Zhen and Xu Zheng and Ruixin Zhang and Zhiqi Jiang and Yichao Yan and Ming Tao and Shunshun Yin},
+      year={2026},
+      eprint={2603.11746},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2603.11746},
+}
+```
+## 🙏 Acknowledgements
+We would like to thank the contributors to the [Transformers](https://github.com/huggingface/transformers), [Diffusers](https://github.com/huggingface/diffusers) , [HuggingFace](https://huggingface.co/) and [Qwen-VL](https://github.com/QwenLM/Qwen-VL), for their open research and exploration.

assets.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:03b17b75edb202cf88e7ea6e0de7b7ad46de14bd8c8b3f6ca0a7a38c48983c9d
+size 252691650

assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

assets/1.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90afc7f5b2c9e9d853722e64eed159729b7287b140830fef3d5a5fca10dda9c8
+size 32768594

assets/2.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92ec70a0fc071f42cfe1438aebd04c9ce392fee55edc47edaa1523f867ee1778
+size 45440570

assets/demo.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc3ea4b884e9838631ee449df39521cc71b414c5f63b966e0429520341c4535b
+size 73248443

assets/logo.png ADDED Viewed

Git LFS Details

SHA256: ca0311196eab34915718a0a98aa9d3add29d896e51ef0ee3821c7631f09f4df3
Pointer size: 132 Bytes
Size of remote file: 1.4 MB

assets/podcast.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25003d333c1e4c13359e527da2c06cce1c62bdf3318e91ab7d118e55c73c2e6e
+size 58516823

assets/teaser1.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68de74fc9bea923ea28d663a201924df6dda85f921c88e19e14e061267f8393c
+size 16793577

assets/teaser2.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d49627bc089f5b8bd8bea8fdd214c9ed0bd06a29f76e69983e498a1ae7ae449d
+size 25039362