Text-to-Video
English
ShotStream / README.md
yawenluo's picture
Update README.md
a7b30c1 verified
metadata
pipeline_tag: text-to-video
license: apache-2.0
language:
  - en
base_model:
  - Wan-AI/Wan2.1-T2V-1.3B

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

ShotStream is a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. It achieves sub-second latency and 16 FPS on a single NVIDIA GPU by reformulating the task as next-shot generation conditioned on historical context.

Project Page | Paper | Code

Introduction

Multi-shot video generation is crucial for long narrative storytelling. ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. It preserves visual coherence through a dual-cache memory mechanism and mitigates error accumulation using a two-stage self-forcing distillation strategy (Distribution Matching Distillation).

Usage

Training and inference code, as well as the models, are all released. For the full implementation and training details, please refer to the official GitHub repository.

1. Environment Setup

git clone https://github.com/KlingAIResearch/ShotStream.git
cd ShotStream
# Setup environment using the provided script
bash tools/setup/env.sh

2. Download Checkpoints

# Download the checkpoints of Wan-T2V-1.3B and ShotStream
bash tools/setup/download_ckpt.sh

3. Run Inference

To perform autoregressive 4-step long multi-shot video generation:

bash tools/inference/causal_fewsteps.sh

Citation

If you find our work helpful, please cite our paper:

@article{luo2026shotstream,
  title={ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling},
  author={Luo, Yawen and Shi, Xiaoyu and Zhuang, Junhao and Chen, Yutian and Liu, Quande and Wang, Xintao and Wan, Pengfei and Xue, Tianfan},
  journal={arXiv preprint arXiv:2603.25746},
  year={2026}
}