arxiv:2512.15340

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Published on Dec 17

· Submitted by

Junjie Chen on Dec 18

Upvote

Authors:

Junjie Chen ,

Abstract

TIMAR, a causal framework for 3D conversational head generation, models dialogue as interleaved audio-visual contexts and predicts continuous 3D head dynamics, improving coherence and expressive variability.

AI-generated summary

Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.

View arXiv page View PDF Project page GitHub 10 Add to collection

Community

coderchen01

Paper author Paper submitter 7 days ago

•

edited 7 days ago

Human conversation is a continuous exchange of speech and nonverbal cues—including head nods, gaze shifts, and subtle expressions. Most existing approaches, however, treat talking-head and listening-head generation as separate problems, or rely on non-causal full-sequence modeling that is unsuitable for real-time interaction.

We propose a causal, turn-level framework for interactive 3D conversational head generation. Our method models dialogue as a sequence of causally linked turns, where each turn accumulates multimodal context from both participants to produce coherent, responsive, and humanlike 3D head dynamics.

avahal

7 days ago

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/towards-seamless-interaction-causal-turn-level-modeling-of-interactive-3d-conversational-head-dynamics-8512-6dd480bb

Executive Summary
Detailed Breakdown
Practical Applications

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.15340 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.15340 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.15340 in a Space README.md to link it from this page.