arxiv:2603.11647

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Published on Mar 12

· Submitted by

Xue Zeyue on Mar 16

Upvote

Authors:

Yaofeng Su ,

Zezhong Qian ,

Abstract

OmniForcing distills a dual-stream bidirectional diffusion model into a streaming autoregressive generator while addressing training instability and synchronization issues through asymmetric alignment and specialized token mechanisms.

AI-generated summary

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at sim25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.Project Page: https://omniforcing.com{https://omniforcing.com}

View arXiv page View PDF Project page GitHub 30 Add to collection

Community

Exploration

Paper author about 13 hours ago

•

edited about 13 hours ago

Hi everyone! Joint audio-visual generation models like LTX-2 and Veo 3 can produce stunning synchronized video and audio from text, but they require minutes of offline processing (e.g., 197s for a 5-second clip) due to bidirectional full-sequence attention — real-time or interactive use is simply out of reach. We present OmniForcing, the first framework to enable real-time streaming for general text-to-audio-visual (T2AV) generation, by distilling a heavy bidirectional dual-stream model into a causal autoregressive engine. OmniForcing achieves ~25 FPS on a single GPU with a first-chunk latency of only ~0.7s — a ~35× speedup — while preserving both visual and acoustic fidelity on par with the teacher across nearly all dimensions on JavisBench. Unlike prior streaming works that are limited to video-only, OmniForcing jointly streams synchronized audio and video, opening the door to truly interactive multi-modal generation. Project page with playable demos: https://omniforcing.com — code and weights coming in two weeks, https://github.com/OmniForcing/OmniForcing!

xzyhku

Paper submitter about 12 hours ago

The first framework to distill an offline, dual-stream audio-visual bidirectional diffusion model into a high-fidelity streaming autoregressive generator.

avahal

about 4 hours ago

the most interesting bit for me is the audio sink token with Identity RoPE, which seems to be the lever that keeps gradients sane when you shift from dense video tokens to sparse audio tokens. it's clever because it effectively widens the attention denominator for the early audio steps without inflating compute, a neat way to handle token sparsity in streaming cross-modal generation. i'd love to see an ablation on how critical the sink tokens are compared to a simple scaling or masking trick—does removing them break stability under long rollouts? btw, the arxivlens breakdown helped me parse the method details; they do a solid walkthrough on arxivlens that covers this well, especially the parts about block-causal alignment and rolling KV-caches (https://arxivlens.com/PaperView/Details/omniforcing-unleashing-real-time-joint-audio-visual-generation-8595-7ffcf122). if this approach holds up across longer sessions and different content styles, it could be a practical path to true real-time multimodal AI.