Text-to-Video
English
ShotStream / README.md
nielsr's picture
nielsr HF Staff
Improve model card and add pipeline tag
e1dd59f verified
|
raw
history blame
1.95 kB
metadata
pipeline_tag: text-to-video

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

ShotStream is a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. It achieves sub-second latency and 16 FPS on a single NVIDIA GPU by reformulating the task as next-shot generation conditioned on historical context.

Project Page | Paper | Code

Introduction

Multi-shot video generation is crucial for long narrative storytelling. ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. It preserves visual coherence through a dual-cache memory mechanism and mitigates error accumulation using a two-stage distillation strategy (Distribution Matching Distillation).

Usage

For the full implementation and training details, please refer to the official GitHub repository.

1. Environment Setup

git clone https://github.com/KlingAIResearch/ShotStream.git
cd ShotStream
# Setup environment using the provided script
bash tools/setup/env.sh

2. Download Checkpoints

# Download the checkpoints of Wan-T2V-1.3B and ShotStream
bash tools/setup/download_ckpt.sh

3. Run Inference

To perform autoregressive 4-step long multi-shot video generation:

bash tools/inference/causal_fewsteps.sh

Citation

If you find our work helpful, please cite our paper:

@article{luo2026shotstream,
  title={ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling},
  author={Luo, Yawen and Shi, Xiaoyu and Zhuang, Junhao and Chen, Yutian and Liu, Quande and Wang, Xintao and Pengfei Wan and Xue, Tianfan},
  journal={arXiv preprint arXiv:2603.25746},
  year={2026}
}