File size: 8,997 Bytes
e9b996f
 
 
e642c60
 
 
 
 
 
 
 
 
 
 
 
 
e9b996f
e642c60
e9b996f
e642c60
 
 
 
 
 
 
 
e9b996f
e642c60
e9b996f
e642c60
e9b996f
e642c60
e9b996f
e642c60
e9b996f
e642c60
 
 
e9b996f
e642c60
e9b996f
e642c60
 
 
e9b996f
e642c60
 
 
e9b996f
e642c60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9b996f
e642c60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6f0155
e642c60
 
 
 
 
 
e6f0155
 
 
 
e642c60
 
 
 
 
 
 
 
 
 
 
e6f0155
e642c60
 
 
 
e6f0155
e642c60
 
 
 
 
e6f0155
 
7ac8788
e6f0155
 
7ac8788
e6f0155
7ac8788
e6f0155
7ac8788
e6f0155
 
7ac8788
e6f0155
7ac8788
e6f0155
e642c60
e6f0155
7ac8788
e6f0155
7ac8788
e6f0155
 
 
 
 
 
 
 
 
7ac8788
e6f0155
 
 
 
 
 
7ac8788
 
e6f0155
 
 
 
 
 
 
7ac8788
e642c60
 
 
 
 
f2bbafa
 
 
 
 
 
 
 
e642c60
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
license: apache-2.0
---
# Perception Encoder Audio-Visual (PE-AV)

PE-AV is a state-of-the-art multimodal model that embeds audio, video, audio-video, and text into a joint embedding space. The model enables powerful cross-modal retrieval and understanding across audio, video, and text modalities.

## Model Description

PE-AV is trained using contrastive learning to align audio, video, and text representations in a shared embedding space. The model can encode:
- **Audio only**: Extract audio embeddings from audio waveforms
- **Video only**: Extract visual embeddings from video frames
- **Audio-Video**: Extract joint audio-visual embeddings
- **Text**: Extract text embeddings optimized for different modality pairs

## Model Variants

We release 6 model checkpoints with varying sizes and capabilities:

| Model | Avg Retrieval | Video Frames used |
|-------|---------------|-------------------|
| [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame) | 45.2 | 16 frames |
| [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame) | 47.0 | 16 frames |
| [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame) | 48.2 | 16 frames |
| [`pe-av-small`](https://huggingface.co/facebook/pe-av-small) | 48.1 | all frames |
| [`pe-av-base`](https://huggingface.co/facebook/pe-av-base) | 50.2 | all frames |
| [`pe-av-large`](https://huggingface.co/facebook/pe-av-large) | 51.6 | all frames |

The `-16-frame` variants sample exactly 16 frames (evenly spaced apart) from each video, while the base variants support variable-length videos.

## Quick Start

The model is available in both [`transformers`](https://github.com/huggingface/transformers/tree/main) as well as [`perception_models`](https://github.com/facebookresearch/perception_models/tree/main) libraries

## `perception_models` Usage

```python
import torch
from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and transform
model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device)
transform = PEAudioVisualTransform.from_config("pe-av-large")

video_files = ["video1.mp4", "video2.mp4"]
descriptions = ["description1", "description2"]
audio_files = ["audio1.wav", "audio2.wav"]

# Process inputs and get embeddings
inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device)

with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
    outputs = model(**inputs)

# Access different embeddings
audio_embeds = outputs.audio_embeds  # Audio-only embeddings
visual_embeds = outputs.visual_embeds  # Video-only embeddings
audio_visual_embeds = outputs.audio_visual_embeds  # Joint audio-visual embeddings
audio_text_embeds = outputs.audio_text_embeds  # Text embeddings aligned to audio
visual_text_embeds = outputs.visual_text_embeds  # Text embeddings aligned to video
audio_visual_text_embeds = outputs.audio_visual_text_embeds  # Text embeddings aligned to audio-visual
audio_plus_text_embeds = outputs.audio_plus_text_embeds  # Joint audio and text embedding
visual_plus_text_embeds = outputs.visual_plus_text_embeds  # Joint video and text embedding

# Compute the dot product to get their similarities
audio_visual_similarity = audio_embeds @ visual_embeds.T
# When computing similarity against text embeddings, use the
# appropriate text embedding based on the other modality
audio_text_similarity = audio_embeds @ audio_text_embeds.T
video_text_similarity = visual_embeds @ visual_text_embeds.T
```

Note that you can omit any of the modalities, and use the same `forward` method.  The corresponding embeddings in `output` will be `None`.  For example:

```python
inputs = transform(videos=video_files, text=descriptions).to(device)

with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
    outputs = model(**inputs)

audio_embeds = outputs.audio_embeds  # None
visual_embeds = outputs.visual_embeds  # available
audio_visual_embeds = outputs.audio_visual_embeds # None
audio_visual_text_embeds = outputs.audio_visual_text_embeds # None
audio_text_embeds = outputs.audio_text_embeds  # None
visual_text_embeds = outputs.visual_text_embeds  # available
audio_plus_text_embeds = outputs.audio_plus_text_embeds  # None
visual_plus_text_embeds = outputs.visual_plus_text_embeds  # Available
```

We also provide methods for directly encoding an individual modality:

```python
def encode_video_text(self, input_ids, attention_mask=None)
def encode_audio_text(self, input_ids, attention_mask=None)
def encode_audio_video_text(self, input_ids, attention_mask=None)
def encode_audio(self, input_values, padding_mask=None, input_features=None)
def encode_video(self, pixel_values_videos, padding_mask_videos=None, pe_features=None)
def encode_audio_video(
    self,
    input_values,
    pixel_values_videos,
    padding_mask=None,
    padding_mask_videos=None,
    pe_features=None,  # optionally re-use pre-computed PE features
    input_features=None,  # Optionally re-use pre-computed audio codec features
)
def encode_audio_plus_text(
    self,
    input_ids,
    input_values,
    attention_mask=None,
    padding_mask=None,
    input_features=None  # Optionally re-use pre-computed audio codec features
)
def encode_video_plus_text(
    self,
    input_ids,
    pixel_values_videos,
    attention_mask=None,
    padding_mask_videos=None,
    pe_features=None,  # optionally re-use pre-computed PE features
)
```

## `transformers` Usage

```python
from transformers import PeAudioVideoModel, PeAudioVideoProcessor
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large")
processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large")

model = model.to(device)

video_files = ["video1.mp4", "video2.mp4"]
descriptions = ["description1", "description2"]
audio_files = ["audio1.wav", "audio2.wav"]

# Process inputs and get embeddings
inputs = processor(
    videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True
)

with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
    outputs = model(**inputs.to(device), return_loss=True)

audio_embeds = outputs.audio_embeds  # Audio-only embeddings
video_embeds = outputs.video_embeds  # Video-only embeddings
audio_video_embeds = outputs.audio_video_embeds  # Joint audio-video embeddings
text_audio_video_embeds = outputs.audio_video_text_embeds  # Text embeddings aligned to audio-video
text_audio_embeds = outputs.text_audio_embeds  # Text embeddings aligned to audio
text_video_embeds = outputs.text_video_embeds  # Text embeddings aligned to video
audio_plus_text_embeds = outputs.audio_plus_text_embeds  # Joint audio and text embedding
video_plus_text_embeds = outputs.video_plus_text_embeds  # Joint video and text embedding

# For classification, you can use the logits_* fields of the output
audio_text_preds = outputs.logits_audio_text.sigmoid()

# The overall loss is also available in the output (requires passing return_loss=True)
loss = outputs.loss

```

We also provide methods for directly encoding an individual modality:

```python
def get_text_audio_embeds(self, input_ids, attention_mask=None)

def get_text_video_embeds(self, input_ids, attention_mask=None)

def get_text_audio_video_embeds(self, input_ids, attention_mask=None)

def get_audio_embeds(self, input_values, padding_mask=None)

def get_video_embeds(self, pixel_values_videos, padding_mask_videos=None)

def get_audio_video_embeds(
    self,
    input_values: torch.Tensor,
    pixel_values_videos: torch.Tensor,
    padding_mask: Optional[torch.Tensor] = None,
    padding_mask_videos: Optional[torch.Tensor] = None,
    return_audio_embeds: bool = False,
    return_video_embeds: bool = False,
)

def get_audio_plus_text_embeds(
    self,
    input_ids: torch.Tensor,
    input_values: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    padding_mask: Optional[torch.Tensor] = None,
)

def get_video_plus_text_embeds(
    self,
    input_ids: torch.Tensor,
    pixel_values_videos: torch.Tensor,
    attention_mask: Optional[torch.Tensor] = None,
    padding_mask_videos: Optional[torch.Tensor] = None,
)

```

## Citation

```bibtex
@misc{vyas2025pushingfrontieraudiovisualperception,
      title={Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning},
      author={Apoorv Vyas and Heng-Jui Chang and Cheng-Fu Yang and Po-Yao Huang and Luya Gao and Julius Richter and Sanyuan Chen and Matt Le and Piotr Dollár and Christoph Feichtenhofer and Ann Lee and Wei-Ning Hsu},
      year={2025},
      eprint={2512.19687},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2512.19687},
}
```

## License

This model is released under the Apache 2.0 license.