arxiv:2603.15614

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Published on Mar 16

· Submitted by

Zhenghong Zhou on Mar 17

Adobe Research

Upvote

Authors:

Abstract

Tri-Prompting presents a unified framework for video diffusion that enables joint control of scene composition, multi-view subject consistency, and motion, achieving superior performance in identity preservation and 3D consistency.

AI-generated summary

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

View arXiv page View PDF Project page Add to collection

Community

zhouzhenghong-gt

Paper submitter about 7 hours ago

🎬 Tri-Prompting: Scene (where), Subject (who), and Motion (how)—unified at last!

Current video diffusion models often struggle with fine-grained, joint control. We introduce Tri-Prompting, a unified framework that enables simultaneous control over scene composition, multi-view subject consistency, and motion.

Key Highlights:
🔹 Unified Control: Jointly manages scene, subject, and motion in one model.
🔹 Dual-Conditioning & multi-view subject consistency: Separates foreground/background motion while preserving identity across views.
🔹 3D-Aware Applications & strong results: Enables multi-view subject insertion and manipulation, and competitive performance against DaS and Phantom.

🔗 Demos: https://zhouzhenghong-gt.github.io/Tri-Prompting-Page/
🔗 paper: https://arxiv.org/abs/2603.15614

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.15614 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.15614 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.15614 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.