Papers
arxiv:2604.07786

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Published on Apr 9
· Submitted by
Chanhyuk Choi
on Apr 13
Authors:
,
,
,

Abstract

A novel cross-modal emotion transfer approach generates expressive talking face videos by modeling emotion semantic vectors between speech and visual feature spaces, achieving superior emotion accuracy compared to existing methods.

AI-generated summary

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

Community

Paper author Paper submitter
edited about 18 hours ago

Talking face generation is impressive — but making the face express the desired emotion is still an open problem. Label-based methods are too coarse, audio-based methods tangle emotion with speech content, and image-based methods need hard-to-get reference photos.

C-MET solves this by learning cross-modal emotion semantic vectors that bridge speech and visual feature spaces — so the model can transfer emotions from speech to face without needing any reference image, even for extended emotions like sarcasm that does not appear in training data.

On MEAD and CREMA-D benchmarks, C-MET improves emotion accuracy by 14% over state-of-the-art methods.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.07786
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.07786 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.07786 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.07786 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.