|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# Perception Encoder Audio-Visual (PE-AV) |
|
|
|
|
|
PE-AV is a state-of-the-art multimodal model that embeds audio, video, audio-video, and text into a joint embedding space. The model enables powerful cross-modal retrieval and understanding across audio, video, and text modalities. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
PE-AV is trained using contrastive learning to align audio, video, and text representations in a shared embedding space. The model can encode: |
|
|
- **Audio only**: Extract audio embeddings from audio waveforms |
|
|
- **Video only**: Extract visual embeddings from video frames |
|
|
- **Audio-Video**: Extract joint audio-visual embeddings |
|
|
- **Text**: Extract text embeddings optimized for different modality pairs |
|
|
|
|
|
## Model Variants |
|
|
|
|
|
We release 6 model checkpoints with varying sizes and capabilities: |
|
|
|
|
|
| Model | Avg Retrieval | Video Frames used | |
|
|
|-------|---------------|-------------------| |
|
|
| [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame) | 45.2 | 16 frames | |
|
|
| [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame) | 47.0 | 16 frames | |
|
|
| [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame) | 48.2 | 16 frames | |
|
|
| [`pe-av-small`](https://huggingface.co/facebook/pe-av-small) | 48.1 | all frames | |
|
|
| [`pe-av-base`](https://huggingface.co/facebook/pe-av-base) | 50.2 | all frames | |
|
|
| [`pe-av-large`](https://huggingface.co/facebook/pe-av-large) | 51.6 | all frames | |
|
|
|
|
|
The `-16-frame` variants sample exactly 16 frames (evenly spaced apart) from each video, while the base variants support variable-length videos. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
The model is available in both [`transformers`](https://github.com/huggingface/transformers/tree/main) as well as [`perception_models`](https://github.com/facebookresearch/perception_models/tree/main) libraries |
|
|
|
|
|
## `perception_models` Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Load model and transform |
|
|
model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device) |
|
|
transform = PEAudioVisualTransform.from_config("pe-av-large") |
|
|
|
|
|
video_files = ["video1.mp4", "video2.mp4"] |
|
|
descriptions = ["description1", "description2"] |
|
|
audio_files = ["audio1.wav", "audio2.wav"] |
|
|
|
|
|
# Process inputs and get embeddings |
|
|
inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device) |
|
|
|
|
|
with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Access different embeddings |
|
|
audio_embeds = outputs.audio_embeds # Audio-only embeddings |
|
|
visual_embeds = outputs.visual_embeds # Video-only embeddings |
|
|
audio_visual_embeds = outputs.audio_visual_embeds # Joint audio-visual embeddings |
|
|
audio_text_embeds = outputs.audio_text_embeds # Text embeddings aligned to audio |
|
|
visual_text_embeds = outputs.visual_text_embeds # Text embeddings aligned to video |
|
|
audio_visual_text_embeds = outputs.audio_visual_text_embeds # Text embeddings aligned to audio-visual |
|
|
audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding |
|
|
visual_plus_text_embeds = outputs.visual_plus_text_embeds # Joint video and text embedding |
|
|
|
|
|
# Compute the dot product to get their similarities |
|
|
audio_visual_similarity = audio_embeds @ visual_embeds.T |
|
|
# When computing similarity against text embeddings, use the |
|
|
# appropriate text embedding based on the other modality |
|
|
audio_text_similarity = audio_embeds @ audio_text_embeds.T |
|
|
video_text_similarity = visual_embeds @ visual_text_embeds.T |
|
|
``` |
|
|
|
|
|
Note that you can omit any of the modalities, and use the same `forward` method. The corresponding embeddings in `output` will be `None`. For example: |
|
|
|
|
|
```python |
|
|
inputs = transform(videos=video_files, text=descriptions).to(device) |
|
|
|
|
|
with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
audio_embeds = outputs.audio_embeds # None |
|
|
visual_embeds = outputs.visual_embeds # available |
|
|
audio_visual_embeds = outputs.audio_visual_embeds # None |
|
|
audio_visual_text_embeds = outputs.audio_visual_text_embeds # None |
|
|
audio_text_embeds = outputs.audio_text_embeds # None |
|
|
visual_text_embeds = outputs.visual_text_embeds # available |
|
|
audio_plus_text_embeds = outputs.audio_plus_text_embeds # None |
|
|
visual_plus_text_embeds = outputs.visual_plus_text_embeds # Available |
|
|
``` |
|
|
|
|
|
We also provide methods for directly encoding an individual modality: |
|
|
|
|
|
```python |
|
|
def encode_video_text(self, input_ids, attention_mask=None) |
|
|
def encode_audio_text(self, input_ids, attention_mask=None) |
|
|
def encode_audio_video_text(self, input_ids, attention_mask=None) |
|
|
def encode_audio(self, input_values, padding_mask=None, input_features=None) |
|
|
def encode_video(self, pixel_values_videos, padding_mask_videos=None, pe_features=None) |
|
|
def encode_audio_video( |
|
|
self, |
|
|
input_values, |
|
|
pixel_values_videos, |
|
|
padding_mask=None, |
|
|
padding_mask_videos=None, |
|
|
pe_features=None, # optionally re-use pre-computed PE features |
|
|
input_features=None, # Optionally re-use pre-computed audio codec features |
|
|
) |
|
|
def encode_audio_plus_text( |
|
|
self, |
|
|
input_ids, |
|
|
input_values, |
|
|
attention_mask=None, |
|
|
padding_mask=None, |
|
|
input_features=None # Optionally re-use pre-computed audio codec features |
|
|
) |
|
|
def encode_video_plus_text( |
|
|
self, |
|
|
input_ids, |
|
|
pixel_values_videos, |
|
|
attention_mask=None, |
|
|
padding_mask_videos=None, |
|
|
pe_features=None, # optionally re-use pre-computed PE features |
|
|
) |
|
|
``` |
|
|
|
|
|
## `transformers` Usage |
|
|
|
|
|
```python |
|
|
from transformers import PeAudioVideoModel, PeAudioVideoProcessor |
|
|
import torch |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large") |
|
|
processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large") |
|
|
|
|
|
model = model.to(device) |
|
|
|
|
|
video_files = ["video1.mp4", "video2.mp4"] |
|
|
descriptions = ["description1", "description2"] |
|
|
audio_files = ["audio1.wav", "audio2.wav"] |
|
|
|
|
|
# Process inputs and get embeddings |
|
|
inputs = processor( |
|
|
videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True |
|
|
) |
|
|
|
|
|
with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16): |
|
|
outputs = model(**inputs.to(device), return_loss=True) |
|
|
|
|
|
audio_embeds = outputs.audio_embeds # Audio-only embeddings |
|
|
video_embeds = outputs.video_embeds # Video-only embeddings |
|
|
audio_video_embeds = outputs.audio_video_embeds # Joint audio-video embeddings |
|
|
text_audio_video_embeds = outputs.audio_video_text_embeds # Text embeddings aligned to audio-video |
|
|
text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio |
|
|
text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video |
|
|
audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding |
|
|
video_plus_text_embeds = outputs.video_plus_text_embeds # Joint video and text embedding |
|
|
|
|
|
# For classification, you can use the logits_* fields of the output |
|
|
audio_text_preds = outputs.logits_audio_text.sigmoid() |
|
|
|
|
|
# The overall loss is also available in the output (requires passing return_loss=True) |
|
|
loss = outputs.loss |
|
|
|
|
|
``` |
|
|
|
|
|
We also provide methods for directly encoding an individual modality: |
|
|
|
|
|
```python |
|
|
def get_text_audio_embeds(self, input_ids, attention_mask=None) |
|
|
|
|
|
def get_text_video_embeds(self, input_ids, attention_mask=None) |
|
|
|
|
|
def get_text_audio_video_embeds(self, input_ids, attention_mask=None) |
|
|
|
|
|
def get_audio_embeds(self, input_values, padding_mask=None) |
|
|
|
|
|
def get_video_embeds(self, pixel_values_videos, padding_mask_videos=None) |
|
|
|
|
|
def get_audio_video_embeds( |
|
|
self, |
|
|
input_values: torch.Tensor, |
|
|
pixel_values_videos: torch.Tensor, |
|
|
padding_mask: Optional[torch.Tensor] = None, |
|
|
padding_mask_videos: Optional[torch.Tensor] = None, |
|
|
return_audio_embeds: bool = False, |
|
|
return_video_embeds: bool = False, |
|
|
) |
|
|
|
|
|
def get_audio_plus_text_embeds( |
|
|
self, |
|
|
input_ids: torch.Tensor, |
|
|
input_values: torch.Tensor, |
|
|
attention_mask: Optional[torch.Tensor] = None, |
|
|
padding_mask: Optional[torch.Tensor] = None, |
|
|
) |
|
|
|
|
|
def get_video_plus_text_embeds( |
|
|
self, |
|
|
input_ids: torch.Tensor, |
|
|
pixel_values_videos: torch.Tensor, |
|
|
attention_mask: Optional[torch.Tensor] = None, |
|
|
padding_mask_videos: Optional[torch.Tensor] = None, |
|
|
) |
|
|
|
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{vyas2025pushingfrontieraudiovisualperception, |
|
|
title={Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning}, |
|
|
author={Apoorv Vyas and Heng-Jui Chang and Cheng-Fu Yang and Po-Yao Huang and Luya Gao and Julius Richter and Sanyuan Chen and Matt Le and Piotr Dollár and Christoph Feichtenhofer and Ann Lee and Wei-Ning Hsu}, |
|
|
year={2025}, |
|
|
eprint={2512.19687}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.SD}, |
|
|
url={https://arxiv.org/abs/2512.19687}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 license. |
|
|
|