Papers
arxiv:2603.14331

AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

Published on Mar 15
Authors:
,
,
,
,
,

Abstract

Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future window with heterogeneous noise levels and emits one clean block per step under constant per-step cost. To stabilize unbounded streams, the method introduces dual-anchor temporal forcing: a style anchor that re-indexes RoPE to maintain a fixed relative position with respect to the active window and applies anchor-audio zero-padding, and a temporal anchor that reuses recently emitted clean blocks to ensure smooth transitions. Real-time one-step inference is enabled by two-stage streaming distillation with offline ODE backfill and distribution matching. Experiments on standard benchmarks and a new 400-video long-form benchmark show strong visual quality and lip synchronization at 34 ms/frame using a 1.3B-parameter student model for realtime streaming. Our page is available at: https://cuiliyuan121.github.io/AvatarForcing/

Community

AvatarForcing

Real-time Long-Form Talking Avatars

🚨 Problem

Current streaming avatar generation faces two major issues:

  • Long-horizon drift: identity, color, and motion gradually become unstable over time
  • Efficiency bottleneck: stronger diffusion-based generation is often too slow for real-time streaming

πŸ’‘ Our Strategy

AvatarForcing addresses this with three key ideas:

  • Sliding-window one-step denoising
    Jointly denoise a fixed local-future window instead of generating strictly frame by frame

  • Dual-anchor stabilization
    Use a style anchor for identity consistency and a temporal anchor for smooth motion continuity

  • Two-stage distillation
    Enable practical one-step streaming inference with low latency


✨ What It Brings

  • Real-time generation
  • Better long-form stability
  • Stronger identity preservation
  • Improved audio-visual synchronization

πŸ”— Links

Project Page β€’ Paper


Video 1 Video 2


πŸ“Œ Citation

 @article {cui2026avatarforcing,
  title={AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising},
  author={Liyuan Cui and Wentao Hu and Wenyuan Zhang and Zesong Yang and Fan Shi and Xiaoqiang Liu},
  journal={arXiv preprint},
  year={2026}
}

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.14331 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.14331 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.14331 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.