lematt1991

Upload README.md with huggingface_hub

f2bbafa verified 3 months ago

9 kB

	---
	license: apache-2.0
	---
	# Perception Encoder Audio-Visual (PE-AV)

	PE-AV is a state-of-the-art multimodal model that embeds audio, video, audio-video, and text into a joint embedding space. The model enables powerful cross-modal retrieval and understanding across audio, video, and text modalities.

	## Model Description

	PE-AV is trained using contrastive learning to align audio, video, and text representations in a shared embedding space. The model can encode:
	- Audio only: Extract audio embeddings from audio waveforms
	- Video only: Extract visual embeddings from video frames
	- Audio-Video: Extract joint audio-visual embeddings
	- Text: Extract text embeddings optimized for different modality pairs

	## Model Variants

	We release 6 model checkpoints with varying sizes and capabilities:

	\| Model \| Avg Retrieval \| Video Frames used \|
	\|-------\|---------------\|-------------------\|
	\| [`pe-av-small-16-frame`](https://huggingface.co/facebook/pe-av-small-16-frame) \| 45.2 \| 16 frames \|
	\| [`pe-av-base-16-frame`](https://huggingface.co/facebook/pe-av-base-16-frame) \| 47.0 \| 16 frames \|
	\| [`pe-av-large-16-frame`](https://huggingface.co/facebook/pe-av-large-16-frame) \| 48.2 \| 16 frames \|
	\| [`pe-av-small`](https://huggingface.co/facebook/pe-av-small) \| 48.1 \| all frames \|
	\| [`pe-av-base`](https://huggingface.co/facebook/pe-av-base) \| 50.2 \| all frames \|
	\| [`pe-av-large`](https://huggingface.co/facebook/pe-av-large) \| 51.6 \| all frames \|

	The `-16-frame` variants sample exactly 16 frames (evenly spaced apart) from each video, while the base variants support variable-length videos.

	## Quick Start

	The model is available in both [`transformers`](https://github.com/huggingface/transformers/tree/main) as well as [`perception_models`](https://github.com/facebookresearch/perception_models/tree/main) libraries

	## `perception_models` Usage

	```python
	import torch
	from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load model and transform
	model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device)
	transform = PEAudioVisualTransform.from_config("pe-av-large")

	video_files = ["video1.mp4", "video2.mp4"]
	descriptions = ["description1", "description2"]
	audio_files = ["audio1.wav", "audio2.wav"]

	# Process inputs and get embeddings
	inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device)

	with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
	outputs = model(**inputs)

	# Access different embeddings
	audio_embeds = outputs.audio_embeds # Audio-only embeddings
	visual_embeds = outputs.visual_embeds # Video-only embeddings
	audio_visual_embeds = outputs.audio_visual_embeds # Joint audio-visual embeddings
	audio_text_embeds = outputs.audio_text_embeds # Text embeddings aligned to audio
	visual_text_embeds = outputs.visual_text_embeds # Text embeddings aligned to video
	audio_visual_text_embeds = outputs.audio_visual_text_embeds # Text embeddings aligned to audio-visual
	audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
	visual_plus_text_embeds = outputs.visual_plus_text_embeds # Joint video and text embedding

	# Compute the dot product to get their similarities
	audio_visual_similarity = audio_embeds @ visual_embeds.T
	# When computing similarity against text embeddings, use the
	# appropriate text embedding based on the other modality
	audio_text_similarity = audio_embeds @ audio_text_embeds.T
	video_text_similarity = visual_embeds @ visual_text_embeds.T
	```

	Note that you can omit any of the modalities, and use the same `forward` method. The corresponding embeddings in `output` will be `None`. For example:

	```python
	inputs = transform(videos=video_files, text=descriptions).to(device)

	with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
	outputs = model(**inputs)

	audio_embeds = outputs.audio_embeds # None
	visual_embeds = outputs.visual_embeds # available
	audio_visual_embeds = outputs.audio_visual_embeds # None
	audio_visual_text_embeds = outputs.audio_visual_text_embeds # None
	audio_text_embeds = outputs.audio_text_embeds # None
	visual_text_embeds = outputs.visual_text_embeds # available
	audio_plus_text_embeds = outputs.audio_plus_text_embeds # None
	visual_plus_text_embeds = outputs.visual_plus_text_embeds # Available
	```

	We also provide methods for directly encoding an individual modality:

	```python
	def encode_video_text(self, input_ids, attention_mask=None)
	def encode_audio_text(self, input_ids, attention_mask=None)
	def encode_audio_video_text(self, input_ids, attention_mask=None)
	def encode_audio(self, input_values, padding_mask=None, input_features=None)
	def encode_video(self, pixel_values_videos, padding_mask_videos=None, pe_features=None)
	def encode_audio_video(
	self,
	input_values,
	pixel_values_videos,
	padding_mask=None,
	padding_mask_videos=None,
	pe_features=None, # optionally re-use pre-computed PE features
	input_features=None, # Optionally re-use pre-computed audio codec features
	)
	def encode_audio_plus_text(
	self,
	input_ids,
	input_values,
	attention_mask=None,
	padding_mask=None,
	input_features=None # Optionally re-use pre-computed audio codec features
	)
	def encode_video_plus_text(
	self,
	input_ids,
	pixel_values_videos,
	attention_mask=None,
	padding_mask_videos=None,
	pe_features=None, # optionally re-use pre-computed PE features
	)
	```

	## `transformers` Usage

	```python
	from transformers import PeAudioVideoModel, PeAudioVideoProcessor
	import torch

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large")
	processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large")

	model = model.to(device)

	video_files = ["video1.mp4", "video2.mp4"]
	descriptions = ["description1", "description2"]
	audio_files = ["audio1.wav", "audio2.wav"]

	# Process inputs and get embeddings
	inputs = processor(
	videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True
	)

	with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
	outputs = model(**inputs.to(device), return_loss=True)

	audio_embeds = outputs.audio_embeds # Audio-only embeddings
	video_embeds = outputs.video_embeds # Video-only embeddings
	audio_video_embeds = outputs.audio_video_embeds # Joint audio-video embeddings
	text_audio_video_embeds = outputs.audio_video_text_embeds # Text embeddings aligned to audio-video
	text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio
	text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video
	audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
	video_plus_text_embeds = outputs.video_plus_text_embeds # Joint video and text embedding

	# For classification, you can use the logits_* fields of the output
	audio_text_preds = outputs.logits_audio_text.sigmoid()

	# The overall loss is also available in the output (requires passing return_loss=True)
	loss = outputs.loss

	```

	We also provide methods for directly encoding an individual modality:

	```python
	def get_text_audio_embeds(self, input_ids, attention_mask=None)

	def get_text_video_embeds(self, input_ids, attention_mask=None)

	def get_text_audio_video_embeds(self, input_ids, attention_mask=None)

	def get_audio_embeds(self, input_values, padding_mask=None)

	def get_video_embeds(self, pixel_values_videos, padding_mask_videos=None)

	def get_audio_video_embeds(
	self,
	input_values: torch.Tensor,
	pixel_values_videos: torch.Tensor,
	padding_mask: Optional[torch.Tensor] = None,
	padding_mask_videos: Optional[torch.Tensor] = None,
	return_audio_embeds: bool = False,
	return_video_embeds: bool = False,
	)

	def get_audio_plus_text_embeds(
	self,
	input_ids: torch.Tensor,
	input_values: torch.Tensor,
	attention_mask: Optional[torch.Tensor] = None,
	padding_mask: Optional[torch.Tensor] = None,
	)

	def get_video_plus_text_embeds(
	self,
	input_ids: torch.Tensor,
	pixel_values_videos: torch.Tensor,
	attention_mask: Optional[torch.Tensor] = None,
	padding_mask_videos: Optional[torch.Tensor] = None,
	)

	```

	## Citation

	```bibtex
	@misc{vyas2025pushingfrontieraudiovisualperception,
	title={Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning},
	author={Apoorv Vyas and Heng-Jui Chang and Cheng-Fu Yang and Po-Yao Huang and Luya Gao and Julius Richter and Sanyuan Chen and Matt Le and Piotr Dollár and Christoph Feichtenhofer and Ann Lee and Wei-Ning Hsu},
	year={2025},
	eprint={2512.19687},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2512.19687},
	}
	```

	## License

	This model is released under the Apache 2.0 license.