File size: 2,208 Bytes

fd006ca
 
 
 
 
 
 
 
 
 
 
 
 
 
ef0907b
 
 
 
 
fd006ca
 
ef0907b
 
 
 
4e1c15b

---
license: mit
library_name: transformers
tags:
- audio grounding
- audio-text retrieval
- sound-event-detection
- multimodal
- clap
pipeline_tag: feature-extraction
---


# FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

  [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2604.01155)
  [![Hugging Face Model](https://img.shields.io/badge/Model-HuggingFace-yellow?logo=huggingface)](https://huggingface.co/AndreasXi/FineLAP)
  [![Hugging Face Dataset](https://img.shields.io/badge/Dataset-HuggingFace-blue?logo=huggingface)](https://huggingface.co/datasets/AndreasXi/FineLAP-100k)

FineLAP is a strong contrastively pre-trained audio-language model that excels in both clip- and frame-level audio understanding tasks



You can use the script below to extract frame- and clip-level features or calculate similarity: 

```python
import torch
from transformers import AutoModel

audio_path = ['resources/1.wav', 'resources/2.wav']  # (B,)
caption = ["A woman speaks, dishes clanking, food frying, and music plays", 'A power tool is heard with male speech.']  # (B,)
phrases = ['Speech', 'Dog', 'Cat', 'Frying', 'Dishes', 'Music', 'Vacuum', 'Type', 'Power tool']  # (N,)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModel.from_pretrained("AndreasXi/FineLAP", trust_remote_code=True).to(device)
model.eval()

with torch.no_grad():
    global_text_embeds = model.get_global_text_embeds(caption)  # (B, d)
    print(global_text_embeds.shape)

    global_audio_embeds = model.get_global_audio_embeds(audio_path)   # (B, d)
    print(global_audio_embeds.shape)

    dense_audio_embeds = model.get_dense_audio_embeds(audio_path)  # (B, T, d)
    print(dense_audio_embeds.shape)

    clip_scores = model.get_clip_level_score(audio_path, caption)  # (B, B)
    print(clip_scores.shape)

    frame_scores = model.get_frame_level_score(audio_path, phrases)  # (B, N, T)
    print(frame_scores.shape)
    
    ## (Optional) Plot frame-level similarity, only supprt single audio file
    model.plot_frame_level_score(audio_path[1], phrases, output_path="output/output_plot.png")
```