Papers
arxiv:2602.19163

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Published on Feb 22
· Submitted by
KAI LIU
on Feb 26
Authors:
,
,
,
,
,
,
,
,
,

Abstract

JavisDiT++ presents a unified framework for joint audio-video generation using modality-specific mixture-of-experts, temporal-aligned RoPE, and audio-video direct preference optimization to achieve high-quality, synchronized multimedia synthesis.

AI-generated summary

AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.

Community

Paper submitter

JavisDiT++ is a concise yet powerful DiT model to generate high-quality and synchronized sounding videos with textual conditions. Built upon the lightweight Wan2.1-1.3B-T2V backbone, JavisDiT++ addresses the key bottleneck of joint audio-video generation (JAVG) with a unified perspective of modeling and optimization:
(1) We model JAVG via joint self-attention to enable dense inter-modal interaction, with modality-specific MoE (MS-MoE) design to refine intra-modal representation.
(2) We propose a temporally aligned rotary position encoding (TA-RoPE) scheme to ensure explicit and fine-grained audio-video token synchronization.
(3) We devise the AV-DPO technique to consistently improve audio-video quality and synchronization by aligning generation with human preferences.
Project Page: https://javisverse.github.io/JavisDiT2-page/

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.19163 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.19163 in a Space README.md to link it from this page.

Collections including this paper 1