Feature Extraction
Transformers
Safetensors
finelap
audio grounding
audio-text retrieval
sound-event-detection
multimodal
clap
custom_code
Instructions to use AndreasXi/FineLAP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AndreasXi/FineLAP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="AndreasXi/FineLAP", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AndreasXi/FineLAP", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 2,208 Bytes
fd006ca ef0907b fd006ca ef0907b 4e1c15b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | ---
license: mit
library_name: transformers
tags:
- audio grounding
- audio-text retrieval
- sound-event-detection
- multimodal
- clap
pipeline_tag: feature-extraction
---
# FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
[](https://arxiv.org/abs/2604.01155)
[](https://huggingface.co/AndreasXi/FineLAP)
[](https://huggingface.co/datasets/AndreasXi/FineLAP-100k)
FineLAP is a strong contrastively pre-trained audio-language model that excels in both clip- and frame-level audio understanding tasks
You can use the script below to extract frame- and clip-level features or calculate similarity:
```python
import torch
from transformers import AutoModel
audio_path = ['resources/1.wav', 'resources/2.wav'] # (B,)
caption = ["A woman speaks, dishes clanking, food frying, and music plays", 'A power tool is heard with male speech.'] # (B,)
phrases = ['Speech', 'Dog', 'Cat', 'Frying', 'Dishes', 'Music', 'Vacuum', 'Type', 'Power tool'] # (N,)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("AndreasXi/FineLAP", trust_remote_code=True).to(device)
model.eval()
with torch.no_grad():
global_text_embeds = model.get_global_text_embeds(caption) # (B, d)
print(global_text_embeds.shape)
global_audio_embeds = model.get_global_audio_embeds(audio_path) # (B, d)
print(global_audio_embeds.shape)
dense_audio_embeds = model.get_dense_audio_embeds(audio_path) # (B, T, d)
print(dense_audio_embeds.shape)
clip_scores = model.get_clip_level_score(audio_path, caption) # (B, B)
print(clip_scores.shape)
frame_scores = model.get_frame_level_score(audio_path, phrases) # (B, N, T)
print(frame_scores.shape)
## (Optional) Plot frame-level similarity, only supprt single audio file
model.plot_frame_level_score(audio_path[1], phrases, output_path="output/output_plot.png")
``` |