| | --- |
| | license: apache-2.0 |
| | --- |
| | # Perception Encoder Audio-Visual (PE-AV) |
| |
|
| | PE-AV is a state-of-the-art multimodal model that embeds audio, video, audio-video, and text into a joint embedding space. The model enables powerful cross-modal retrieval and understanding across audio, video, and text modalities. |
| |
|
| | ## Model Description |
| |
|
| | PE-AV is trained using contrastive learning to align audio, video, and text representations in a shared embedding space. The model can encode: |
| | - **Audio only**: Extract audio embeddings from audio waveforms |
| | - **Video only**: Extract visual embeddings from video frames |
| | - **Audio-Video**: Extract joint audio-visual embeddings |
| | - **Text**: Extract text embeddings optimized for different modality pairs |
| |
|
| | ## Model Variants |
| |
|
| | We release 6 model checkpoints with varying sizes and capabilities: |
| |
|
| | | Model | Avg Retrieval | Video Frames used | |
| | |-------|---------------|-------------------| |
| | | [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame) | 45.2 | 16 frames | |
| | | [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame) | 47.0 | 16 frames | |
| | | [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame) | 48.2 | 16 frames | |
| | | [`pe-av-small`](https://huggingface.co/facebook/pe-av-small) | 48.1 | all frames | |
| | | [`pe-av-base`](https://huggingface.co/facebook/pe-av-base) | 50.2 | all frames | |
| | | [`pe-av-large`](https://huggingface.co/facebook/pe-av-large) | 51.6 | all frames | |
| |
|
| | The `-16-frame` variants sample exactly 16 frames (evenly spaced apart) from each video, while the base variants support variable-length videos. |
| |
|
| | ## Quick Start |
| |
|
| | The model is available in both [`transformers`](https://github.com/huggingface/transformers/tree/main) as well as [`perception_models`](https://github.com/facebookresearch/perception_models/tree/main) libraries |
| |
|
| | ## `perception_models` Usage |
| | |
| | ```python |
| | import torch |
| | from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| |
|
| | # Load model and transform |
| | model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device) |
| | transform = PEAudioVisualTransform.from_config("pe-av-large") |
| |
|
| | video_files = ["video1.mp4", "video2.mp4"] |
| | descriptions = ["description1", "description2"] |
| | audio_files = ["audio1.wav", "audio2.wav"] |
| |
|
| | # Process inputs and get embeddings |
| | inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device) |
| |
|
| | with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16): |
| | outputs = model(**inputs) |
| | |
| | # Access different embeddings |
| | audio_embeds = outputs.audio_embeds # Audio-only embeddings |
| | visual_embeds = outputs.visual_embeds # Video-only embeddings |
| | audio_visual_embeds = outputs.audio_visual_embeds # Joint audio-visual embeddings |
| | audio_text_embeds = outputs.audio_text_embeds # Text embeddings aligned to audio |
| | visual_text_embeds = outputs.visual_text_embeds # Text embeddings aligned to video |
| | audio_visual_text_embeds = outputs.audio_visual_text_embeds # Text embeddings aligned to audio-visual |
| | audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding |
| | visual_plus_text_embeds = outputs.visual_plus_text_embeds # Joint video and text embedding |
| | |
| | # Compute the dot product to get their similarities |
| | audio_visual_similarity = audio_embeds @ visual_embeds.T |
| | # When computing similarity against text embeddings, use the |
| | # appropriate text embedding based on the other modality |
| | audio_text_similarity = audio_embeds @ audio_text_embeds.T |
| | video_text_similarity = visual_embeds @ visual_text_embeds.T |
| | ``` |
| | |
| | Note that you can omit any of the modalities, and use the same `forward` method. The corresponding embeddings in `output` will be `None`. For example: |
| | |
| | ```python |
| | inputs = transform(videos=video_files, text=descriptions).to(device) |
| | |
| | with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16): |
| | outputs = model(**inputs) |
| | |
| | audio_embeds = outputs.audio_embeds # None |
| | visual_embeds = outputs.visual_embeds # available |
| | audio_visual_embeds = outputs.audio_visual_embeds # None |
| | audio_visual_text_embeds = outputs.audio_visual_text_embeds # None |
| | audio_text_embeds = outputs.audio_text_embeds # None |
| | visual_text_embeds = outputs.visual_text_embeds # available |
| | audio_plus_text_embeds = outputs.audio_plus_text_embeds # None |
| | visual_plus_text_embeds = outputs.visual_plus_text_embeds # Available |
| | ``` |
| | |
| | We also provide methods for directly encoding an individual modality: |
| | |
| | ```python |
| | def encode_video_text(self, input_ids, attention_mask=None) |
| | def encode_audio_text(self, input_ids, attention_mask=None) |
| | def encode_audio_video_text(self, input_ids, attention_mask=None) |
| | def encode_audio(self, input_values, padding_mask=None, input_features=None) |
| | def encode_video(self, pixel_values_videos, padding_mask_videos=None, pe_features=None) |
| | def encode_audio_video( |
| | self, |
| | input_values, |
| | pixel_values_videos, |
| | padding_mask=None, |
| | padding_mask_videos=None, |
| | pe_features=None, # optionally re-use pre-computed PE features |
| | input_features=None, # Optionally re-use pre-computed audio codec features |
| | ) |
| | def encode_audio_plus_text( |
| | self, |
| | input_ids, |
| | input_values, |
| | attention_mask=None, |
| | padding_mask=None, |
| | input_features=None # Optionally re-use pre-computed audio codec features |
| | ) |
| | def encode_video_plus_text( |
| | self, |
| | input_ids, |
| | pixel_values_videos, |
| | attention_mask=None, |
| | padding_mask_videos=None, |
| | pe_features=None, # optionally re-use pre-computed PE features |
| | ) |
| | ``` |
| | |
| | ## `transformers` Usage |
| |
|
| | ```python |
| | from transformers import PeAudioVideoModel, PeAudioVideoProcessor |
| | import torch |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large") |
| | processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large") |
| | |
| | model = model.to(device) |
| | |
| | video_files = ["video1.mp4", "video2.mp4"] |
| | descriptions = ["description1", "description2"] |
| | audio_files = ["audio1.wav", "audio2.wav"] |
| | |
| | # Process inputs and get embeddings |
| | inputs = processor( |
| | videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True |
| | ) |
| | |
| | with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16): |
| | outputs = model(**inputs.to(device), return_loss=True) |
| | |
| | audio_embeds = outputs.audio_embeds # Audio-only embeddings |
| | video_embeds = outputs.video_embeds # Video-only embeddings |
| | audio_video_embeds = outputs.audio_video_embeds # Joint audio-video embeddings |
| | text_audio_video_embeds = outputs.audio_video_text_embeds # Text embeddings aligned to audio-video |
| | text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio |
| | text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video |
| | audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding |
| | video_plus_text_embeds = outputs.video_plus_text_embeds # Joint video and text embedding |
| | |
| | # For classification, you can use the logits_* fields of the output |
| | audio_text_preds = outputs.logits_audio_text.sigmoid() |
| | |
| | # The overall loss is also available in the output (requires passing return_loss=True) |
| | loss = outputs.loss |
| | |
| | ``` |
| |
|
| | We also provide methods for directly encoding an individual modality: |
| |
|
| | ```python |
| | def get_text_audio_embeds(self, input_ids, attention_mask=None) |
| | |
| | def get_text_video_embeds(self, input_ids, attention_mask=None) |
| | |
| | def get_text_audio_video_embeds(self, input_ids, attention_mask=None) |
| | |
| | def get_audio_embeds(self, input_values, padding_mask=None) |
| | |
| | def get_video_embeds(self, pixel_values_videos, padding_mask_videos=None) |
| | |
| | def get_audio_video_embeds( |
| | self, |
| | input_values: torch.Tensor, |
| | pixel_values_videos: torch.Tensor, |
| | padding_mask: Optional[torch.Tensor] = None, |
| | padding_mask_videos: Optional[torch.Tensor] = None, |
| | return_audio_embeds: bool = False, |
| | return_video_embeds: bool = False, |
| | ) |
| | |
| | def get_audio_plus_text_embeds( |
| | self, |
| | input_ids: torch.Tensor, |
| | input_values: torch.Tensor, |
| | attention_mask: Optional[torch.Tensor] = None, |
| | padding_mask: Optional[torch.Tensor] = None, |
| | ) |
| | |
| | def get_video_plus_text_embeds( |
| | self, |
| | input_ids: torch.Tensor, |
| | pixel_values_videos: torch.Tensor, |
| | attention_mask: Optional[torch.Tensor] = None, |
| | padding_mask_videos: Optional[torch.Tensor] = None, |
| | ) |
| | |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{vyas2025pushingfrontieraudiovisualperception, |
| | title={Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning}, |
| | author={Apoorv Vyas and Heng-Jui Chang and Cheng-Fu Yang and Po-Yao Huang and Luya Gao and Julius Richter and Sanyuan Chen and Matt Le and Piotr Dollár and Christoph Feichtenhofer and Ann Lee and Wei-Ning Hsu}, |
| | year={2025}, |
| | eprint={2512.19687}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.SD}, |
| | url={https://arxiv.org/abs/2512.19687}, |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | This model is released under the Apache 2.0 license. |
| |
|