Safetensors
pe_audio
lematt1991 commited on
Commit
69bd1a0
·
verified ·
1 Parent(s): 0174520

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +137 -3
README.md CHANGED
@@ -1,3 +1,137 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Perception Encoder Audio Frame (PE-A-Frame)
5
+
6
+ PE-A-Frame is a state-of-the-art audio-text embedding model. For text, the model produces a single embedding. For audio, it produces a sequence of embeddings (onen for every 40ms of audio). These embeddings can then be used for audio event localization. For convienience, model outputs temporal spans (start and end timestamps) indicating when that event (freeform audio description) occurs in the audio.
7
+
8
+ ## Model Description
9
+
10
+ PE-A-Frame uses contrastive learning to align frame-level audio representations with text descriptions. The model can identify precise time ranges when described audio events occur
11
+
12
+ ## Model Variants
13
+
14
+ We release multiple model checkpoints with varying sizes:
15
+
16
+ | Model | Parameters |
17
+ |-------|------------|
18
+ | [`pe-a-frame-small`](https://huggingface.co/facebook/pe-a-frame-small) | 450M |
19
+ | [`pe-a-frame-base`](https://huggingface.co/facebook/pe-a-frame-base) | 560M |
20
+ | [`pe-a-frame-large`](https://huggingface.co/facebook/pe-a-frame-large) | 1.4B |
21
+
22
+ ## Quick Start
23
+
24
+ ### Basic Usage: Audio Event Localization
25
+
26
+ ```python
27
+ import torch
28
+ from core.audio_visual_encoder import PEAudioFrame, PEAudioFrameTransform
29
+
30
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
31
+
32
+ # Load model and transform
33
+ model = PEAudioFrame.from_config("pe-a-frame-large", pretrained=True).to(device)
34
+ transform = PEAudioFrameTransform.from_config("pe-a-frame-large")
35
+
36
+ # Define audio file and event descriptions
37
+ audio_file = "office_conversation.wav"
38
+ descriptions = ["a person talking", "keyboard typing", "phone ringing"]
39
+
40
+ # Process inputs
41
+ inputs = transform(audio=[audio_file], text=descriptions).to(device)
42
+
43
+ # Run inference
44
+ with torch.inference_mode():
45
+ outputs = model(**inputs, return_spans=True)
46
+
47
+ # Print detected time spans for each event
48
+ for description, spans in zip(descriptions, outputs.spans):
49
+ if spans:
50
+ span_str = ", ".join([f"({start:.2f}s, {end:.2f}s)" for start, end in spans])
51
+ print(f'"{description}": [{span_str}]')
52
+ else:
53
+ print(f'"{description}": No events detected')
54
+ ```
55
+
56
+ **Example Output:**
57
+ ```
58
+ "a person talking": [(2.34s, 5.67s), (8.90s, 12.45s)]
59
+ "keyboard typing": [(1.20s, 3.40s), (6.78s, 9.12s)]
60
+ "phone ringing": No events detected
61
+ ```
62
+
63
+ ### Batch Processing Multiple Audio Files
64
+
65
+ ```python
66
+ import torch
67
+ from core.audio_visual_encoder import PEAudioFrame, PEAudioFrameTransform
68
+
69
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
70
+ model = PEAudioFrame.from_config("pe-a-frame-large", pretrained=True).to(device)
71
+ transform = PEAudioFrameTransform.from_config("pe-a-frame-large")
72
+
73
+ # Process multiple audio files with different descriptions
74
+ audio_files = ["meeting.wav", "street.wav", "kitchen.wav"]
75
+ descriptions = [
76
+ "people discussing in a meeting",
77
+ "cars passing by",
78
+ "water running from a faucet"
79
+ ]
80
+
81
+ inputs = transform(audio=audio_files, text=descriptions).to(device)
82
+
83
+ with torch.inference_mode():
84
+ outputs = model(**inputs, return_spans=True)
85
+
86
+ # Each audio-text pair gets its own span predictions
87
+ for audio, description, spans in zip(audio_files, descriptions, outputs.spans):
88
+ if spans:
89
+ span_str = ", ".join([f"({start:.2f}s, {end:.2f}s)" for start, end in spans])
90
+ print(f'"{description}": [{span_str}] in {audio}')
91
+ else:
92
+ print(f'"{description}": No events detected in {audio}')
93
+ ```
94
+
95
+ ### Adjusting Detection Threshold
96
+
97
+ The `threshold` parameter controls sensitivity for event detection. Lower values detect more events (higher recall), while higher values are more selective (higher precision):
98
+
99
+ ```python
100
+ # High sensitivity - detect more events (may include false positives)
101
+ outputs_sensitive = model(**inputs, threshold=0.2)
102
+ ```
103
+
104
+ ### Extracting Embeddings Without Spans
105
+
106
+ If you only need embeddings without temporal localization:
107
+
108
+ ```python
109
+ inputs = transform(audio=[audio_file], text=descriptions).to(device)
110
+
111
+ with torch.inference_mode():
112
+ outputs = model(**inputs, return_spans=False)
113
+
114
+ # Access embeddings
115
+ audio_embeds = outputs.audio_embeds # Shape: [batch_size, num_frames, embed_dim]
116
+ text_embeds = outputs.text_embeds # Shape: [batch_size, embed_dim]
117
+
118
+ # Compute similarity between audio frames and text
119
+ # audio_embeds is frame-level, so you can see which frames match the description
120
+ similarities = torch.einsum("btd,bd->bt", audio_embeds, text_embeds)
121
+ # similarities shape: [batch_size, num_frames]
122
+ ```
123
+
124
+ ## Citation
125
+
126
+ ```bibtex
127
+ @article{pe-av2025,
128
+ title={PEAV: An Audiovisual Perception Encoder via Large-Scale Multimodal Correspondence Learning},
129
+ author={Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu},
130
+ url={arxiv link coming soon}
131
+ year={2025}
132
+ }
133
+ ```
134
+
135
+ ## License
136
+
137
+ This model is released under the Apache 2.0 license.