lematt1991 commited on
Commit
d4c04ed
·
verified ·
1 Parent(s): 7a902bb

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +166 -54
README.md CHANGED
@@ -1,111 +1,223 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
4
 
5
- # Perception Encoder Audio-Video
6
 
7
- ## Model Summary
8
 
9
- Perception Encoder Audio-Video (PE-AV) is a family of state-of-the-art encoders for audio and video understanding trained via scaled contrastive learning, built on top of the [PE image/video encoder](https://arxiv.org/abs/2504.13181) (PE)
 
 
 
 
10
 
11
- The model is available in the following sizes:
12
 
13
- - [`pe-av-small`](https://huggingface.co/facebook/pe-av-small): 12 layers, 209M parameters
14
- - [`pe-av-base`](https://huggingface.co/facebook/pe-av-base): 16 layers, 396M parameters
15
- - [`pe-av-large`](https://huggingface.co/facebook/pe-av-large): 28L, 1.597B parameters
16
 
17
- For each size we additionally provide a version that samples a fixed 16-frames for the video branch for efficiency:
 
 
 
 
 
 
 
18
 
19
- - [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame): 12 layers, 209M parameters
20
- - [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame): 16 layers, 396M parameters
21
- - [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame): 28L, 1.597B parameters
22
 
23
- ## `transformers` Usage 🤗
24
 
25
- Install `transformers` from source
26
 
27
- ```bash
28
- pip install git+https://github.com/huggingface/transformers
29
- ```
30
-
31
- For more information, check the documentation [here](https://huggingface.co/docs/transformers/main/en/model_doc/pe_audio_video).
32
 
33
  ```python
34
- from transformers import PeAudioVideoModel, PeAudioVideoProcessor
35
  import torch
 
36
 
37
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
38
- model = PeAudioVideoModel.from_pretrained("facebook/pe-av-base-16-frame", device_map=device, dtype=torch.bfloat16)
39
- processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-base-16-frame")
 
 
40
 
41
  video_files = ["video1.mp4", "video2.mp4"]
42
  descriptions = ["description1", "description2"]
43
  audio_files = ["audio1.wav", "audio2.wav"]
44
 
45
  # Process inputs and get embeddings
46
- inputs = processor(
47
- videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True
48
- )
49
 
50
  with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
51
- outputs = model(**inputs.to(device, dtype=model.dtype))
52
 
 
53
  audio_embeds = outputs.audio_embeds # Audio-only embeddings
54
- video_embeds = outputs.video_embeds # Video-only embeddings
55
- audio_video_embeds = outputs.audio_video_embeds # Joint audio-video embeddings
56
- text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio
57
- text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video
58
- text_audio_video_embeds = outputs.text_audio_video_embeds # Text embeddings aligned to audio-video
59
  audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
60
- video_plus_text_embeds = outputs.video_plus_text_embeds # Joint video and text embedding
 
 
 
 
 
 
 
61
  ```
62
 
63
- Note that you can omit any of the modalities, and use the same `forward` method. The corresponding embeddings in `output` will be `None`.
64
 
65
- Moreover using `transformers`, one can load only the sub-model of interest to avoid loading the full model if for example one only wants to retrieve audio embeddgins:
66
  ```python
67
- from transformers import PeAudioModel, PeAudioProcessor
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  import torch
69
 
70
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
71
- model = PeAudioVideoModel.from_pretrained("facebook/pe-av-base-16-frame", device_map=device, dtype=torch.bfloat16)
72
- processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-base-16-frame")
73
 
 
 
 
74
  descriptions = ["description1", "description2"]
75
  audio_files = ["audio1.wav", "audio2.wav"]
76
 
77
  # Process inputs and get embeddings
78
  inputs = processor(
79
- text=descriptions, audio=audio_files, return_tensors="pt", padding=True
80
  )
81
 
82
  with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
83
- outputs = model(**inputs.to(device, dtype=model.dtype))
84
 
85
  audio_embeds = outputs.audio_embeds # Audio-only embeddings
 
 
 
86
  text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio
 
 
 
 
 
 
 
 
 
 
87
  ```
88
 
89
- likewise for video embeddings:
 
90
  ```python
91
- from transformers import PeVideoModel, PeVideoProcessor
92
- import torch
93
 
94
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
95
- model = PeVideoModel.from_pretrained("facebook/pe-av-base-16-frame", device_map=device, dtype=torch.bfloat16)
96
- processor = PeVideoProcessor.from_pretrained("facebook/pe-av-base-16-frame")
97
 
98
- descriptions = ["description1", "description2"]
99
- video_files = ["video1.mp4", "video2.mp4"]
100
 
101
- # Process inputs and get embeddings
102
- inputs = processor(
103
- text=descriptions, videos=video_files, return_tensors="pt", padding=True
 
 
 
 
 
 
 
 
 
104
  )
105
 
106
- with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
107
- outputs = model(**inputs.to(device, dtype=model.dtype))
 
 
 
 
 
108
 
109
- video_embeds = outputs.video_embeds # Video-only embeddings
110
- text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video
111
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # Perception Encoder Audio-Visual (PE-AV)
5
 
6
+ PE-AV is a state-of-the-art multimodal model that embeds audio, video, audio-video, and text into a joint embedding space. The model enables powerful cross-modal retrieval and understanding across audio, video, and text modalities.
7
 
8
+ ## Model Description
9
 
10
+ PE-AV is trained using contrastive learning to align audio, video, and text representations in a shared embedding space. The model can encode:
11
+ - **Audio only**: Extract audio embeddings from audio waveforms
12
+ - **Video only**: Extract visual embeddings from video frames
13
+ - **Audio-Video**: Extract joint audio-visual embeddings
14
+ - **Text**: Extract text embeddings optimized for different modality pairs
15
 
16
+ ## Model Variants
17
 
18
+ We release 6 model checkpoints with varying sizes and capabilities:
 
 
19
 
20
+ | Model | Avg Retrieval | Video Frames used |
21
+ |-------|---------------|-------------------|
22
+ | [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame) | 45.2 | 16 frames |
23
+ | [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame) | 47.0 | 16 frames |
24
+ | [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame) | 48.2 | 16 frames |
25
+ | [`pe-av-small`](https://huggingface.co/facebook/pe-av-small) | 48.1 | all frames |
26
+ | [`pe-av-base`](https://huggingface.co/facebook/pe-av-base) | 50.2 | all frames |
27
+ | [`pe-av-large`](https://huggingface.co/facebook/pe-av-large) | 51.6 | all frames |
28
 
29
+ The `-16-frame` variants sample exactly 16 frames (evenly spaced apart) from each video, while the base variants support variable-length videos.
 
 
30
 
31
+ ## Quick Start
32
 
33
+ The model is available in both [`transformers`](https://github.com/huggingface/transformers/tree/main) as well as [`perception_models`](https://github.com/facebookresearch/perception_models/tree/main) libraries
34
 
35
+ ## `perception_models` Usage
 
 
 
 
36
 
37
  ```python
 
38
  import torch
39
+ from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform
40
 
41
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
42
+
43
+ # Load model and transform
44
+ model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device)
45
+ transform = PEAudioVisualTransform.from_config("pe-av-large")
46
 
47
  video_files = ["video1.mp4", "video2.mp4"]
48
  descriptions = ["description1", "description2"]
49
  audio_files = ["audio1.wav", "audio2.wav"]
50
 
51
  # Process inputs and get embeddings
52
+ inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device)
 
 
53
 
54
  with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
55
+ outputs = model(**inputs)
56
 
57
+ # Access different embeddings
58
  audio_embeds = outputs.audio_embeds # Audio-only embeddings
59
+ visual_embeds = outputs.visual_embeds # Video-only embeddings
60
+ audio_visual_embeds = outputs.audio_visual_embeds # Joint audio-visual embeddings
61
+ audio_text_embeds = outputs.audio_text_embeds # Text embeddings aligned to audio
62
+ visual_text_embeds = outputs.visual_text_embeds # Text embeddings aligned to video
63
+ audio_visual_text_embeds = outputs.audio_visual_text_embeds # Text embeddings aligned to audio-visual
64
  audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
65
+ visual_plus_text_embeds = outputs.visual_plus_text_embeds # Joint video and text embedding
66
+
67
+ # Compute the dot product to get their similarities
68
+ audio_visual_similarity = audio_embeds @ visual_embeds.T
69
+ # When computing similarity against text embeddings, use the
70
+ # appropriate text embedding based on the other modality
71
+ audio_text_similarity = audio_embeds @ audio_text_embeds.T
72
+ video_text_similarity = visual_embeds @ visual_text_embeds.T
73
  ```
74
 
75
+ Note that you can omit any of the modalities, and use the same `forward` method. The corresponding embeddings in `output` will be `None`. For example:
76
 
 
77
  ```python
78
+ inputs = transform(videos=video_files, text=descriptions).to(device)
79
+
80
+ with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
81
+ outputs = model(**inputs)
82
+
83
+ audio_embeds = outputs.audio_embeds # None
84
+ visual_embeds = outputs.visual_embeds # available
85
+ audio_visual_embeds = outputs.audio_visual_embeds # None
86
+ audio_visual_text_embeds = outputs.audio_visual_text_embeds # None
87
+ audio_text_embeds = outputs.audio_text_embeds # None
88
+ visual_text_embeds = outputs.visual_text_embeds # available
89
+ audio_plus_text_embeds = outputs.audio_plus_text_embeds # None
90
+ visual_plus_text_embeds = outputs.visual_plus_text_embeds # Available
91
+ ```
92
+
93
+ We also provide methods for directly encoding an individual modality:
94
+
95
+ ```python
96
+ def encode_video_text(self, input_ids, attention_mask=None)
97
+ def encode_audio_text(self, input_ids, attention_mask=None)
98
+ def encode_audio_video_text(self, input_ids, attention_mask=None)
99
+ def encode_audio(self, input_values, padding_mask=None, input_features=None)
100
+ def encode_video(self, pixel_values_videos, padding_mask_videos=None, pe_features=None)
101
+ def encode_audio_video(
102
+ self,
103
+ input_values,
104
+ pixel_values_videos,
105
+ padding_mask=None,
106
+ padding_mask_videos=None,
107
+ pe_features=None, # optionally re-use pre-computed PE features
108
+ input_features=None, # Optionally re-use pre-computed audio codec features
109
+ )
110
+ def encode_audio_plus_text(
111
+ self,
112
+ input_ids,
113
+ input_values,
114
+ attention_mask=None,
115
+ padding_mask=None,
116
+ input_features=None # Optionally re-use pre-computed audio codec features
117
+ )
118
+ def encode_video_plus_text(
119
+ self,
120
+ input_ids,
121
+ pixel_values_videos,
122
+ attention_mask=None,
123
+ padding_mask_videos=None,
124
+ pe_features=None, # optionally re-use pre-computed PE features
125
+ )
126
+ ```
127
+
128
+ ## `transformers` Usage
129
+
130
+ ```python
131
+ from transformers import PeAudioVideoModel, PeAudioVideoProcessor
132
  import torch
133
 
134
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
135
+ model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large")
136
+ processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large")
137
 
138
+ model = model.to(device)
139
+
140
+ video_files = ["video1.mp4", "video2.mp4"]
141
  descriptions = ["description1", "description2"]
142
  audio_files = ["audio1.wav", "audio2.wav"]
143
 
144
  # Process inputs and get embeddings
145
  inputs = processor(
146
+ videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True
147
  )
148
 
149
  with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
150
+ outputs = model(**inputs.to(device), return_loss=True)
151
 
152
  audio_embeds = outputs.audio_embeds # Audio-only embeddings
153
+ video_embeds = outputs.video_embeds # Video-only embeddings
154
+ audio_video_embeds = outputs.audio_video_embeds # Joint audio-video embeddings
155
+ text_audio_video_embeds = outputs.audio_video_text_embeds # Text embeddings aligned to audio-video
156
  text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio
157
+ text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video
158
+ audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
159
+ video_plus_text_embeds = outputs.video_plus_text_embeds # Joint video and text embedding
160
+
161
+ # For classification, you can use the logits_* fields of the output
162
+ audio_text_preds = outputs.logits_audio_text.sigmoid()
163
+
164
+ # The overall loss is also available in the output (requires passing return_loss=True)
165
+ loss = outputs.loss
166
+
167
  ```
168
 
169
+ We also provide methods for directly encoding an individual modality:
170
+
171
  ```python
172
+ def get_text_audio_embeds(self, input_ids, attention_mask=None)
 
173
 
174
+ def get_text_video_embeds(self, input_ids, attention_mask=None)
 
 
175
 
176
+ def get_text_audio_video_embeds(self, input_ids, attention_mask=None)
 
177
 
178
+ def get_audio_embeds(self, input_values, padding_mask=None)
179
+
180
+ def get_video_embeds(self, pixel_values_videos, padding_mask_videos=None)
181
+
182
+ def get_audio_video_embeds(
183
+ self,
184
+ input_values: torch.Tensor,
185
+ pixel_values_videos: torch.Tensor,
186
+ padding_mask: Optional[torch.Tensor] = None,
187
+ padding_mask_videos: Optional[torch.Tensor] = None,
188
+ return_audio_embeds: bool = False,
189
+ return_video_embeds: bool = False,
190
  )
191
 
192
+ def get_audio_plus_text_embeds(
193
+ self,
194
+ input_ids: torch.Tensor,
195
+ input_values: torch.Tensor,
196
+ attention_mask: Optional[torch.Tensor] = None,
197
+ padding_mask: Optional[torch.Tensor] = None,
198
+ )
199
 
200
+ def get_video_plus_text_embeds(
201
+ self,
202
+ input_ids: torch.Tensor,
203
+ pixel_values_videos: torch.Tensor,
204
+ attention_mask: Optional[torch.Tensor] = None,
205
+ padding_mask_videos: Optional[torch.Tensor] = None,
206
+ )
207
+
208
+ ```
209
+
210
+ ## Citation
211
+
212
+ ```bibtex
213
+ @article{pe-av2025,
214
+ title={PEAV: An Audiovisual Perception Encoder via Large-Scale Multimodal Correspondence Learning},
215
+ author={Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu},
216
+ url={arxiv link coming soon}
217
+ year={2025}
218
+ }
219
+ ```
220
+
221
+ ## License
222
+
223
+ This model is released under the Apache 2.0 license.