Jazzcharles
/

AuroLA-3B

Safetensors

qwen2_5_omni_thinker

Model card Files Files and versions

xet

Community

Add metadata and improve model card

by nielsr HF Staff - opened Feb 23

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+20

-14

Files changed (1) hide show

README.md +20 -14

README.md CHANGED Viewed

@@ -1,11 +1,19 @@
-# AuroLA
-This repo contains the checkpoint of the following paper:
-**Scaling Audio-Text Retrieval with Multimodal Large Language Model**
 ## Quick Start
-Try the model to extract audio and text features.
 ```python
 import torch
@@ -44,8 +52,8 @@ def get_embed_feature(hidden_states, input_ids, embed_index):
     embed_features = hidden_states[torch.arange(len(embed_indices)), embed_indices - 1]
     return embed_features
-# 1) Load model + processor (same style as Qwen2.5-Omni)
-model_path = "Jazzcharles/AuroLA-3B"  # or your HF repo id
 device = "cuda" if torch.cuda.is_available() else "cpu"
 dtype = torch.bfloat16 if device == "cuda" else torch.float32
@@ -62,16 +70,13 @@ emb_token_ids = add_embed_token(tokenizer, model)
 # 2) Prepare retrieval inputs
-# audio paths and text queries can be any same-batch lists
 audio_files = [
-    "/mnt/data/AudioCaps/audio/--0w1YA1Hm4_30.wav",
-    "/mnt/data/AudioCaps/audio/-AheI8Epim4_30.wav",
-    "/mnt/data/AudioCaps/audio/-BUWGM7qeUM_10.wav",
 ]
 text_queries = [
     "A vehicle driving as a man and woman are talking and laughing",
     "Muffled sounds followed by metal being hit",
-    "Wind is blowing and heavy rain is falling and splashing",
 ]
 # Build audio-side messages
@@ -97,7 +102,8 @@ text_messages = [
     [
         {
             "role": "user",
-            "content": [{"type": "text", "text": f"{t}\nSummarize above sentence in one word:"}],
         },
         {
             "role": "assistant",
@@ -119,10 +125,10 @@ with torch.inference_mode():
     text_out = model(**text_inputs, output_hidden_states=True, return_dict=True, use_audio_in_video=False)
     text_feat = get_embed_feature(text_out.hidden_states[-1], text_inputs['input_ids'], emb_token_ids)
-# 5) Similarity + top-k retrieval
 audio_feat = F.normalize(audio_feat, dim=-1)
 text_feat = F.normalize(text_feat, dim=-1)
-score = text_feat @ audio_feat.T  # [N_text, N_audio]
 print(score.shape, score)
 ```

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: feature-extraction
+---
+# AuroLA: Scaling Audio-Text Retrieval with Multimodal Large Language Models
+AuroLA is a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. It treats retrieval as a summarization task, using specific token hidden states as embeddings to align audio and text.
+- **Paper:** [Scaling Audio-Text Retrieval with Multimodal Large Language Models](https://huggingface.co/papers/2602.18010)
+- **Code:** [GitHub - Jazzcharles/AuroLA](https://github.com/Jazzcharles/AuroLA)
 ## Quick Start
+Try the model to extract audio and text features. This requires the `qwen_omni_utils.py` script from the [official repository](https://github.com/Jazzcharles/AuroLA).
 ```python
 import torch
     embed_features = hidden_states[torch.arange(len(embed_indices)), embed_indices - 1]
     return embed_features
+# 1) Load model + processor
+model_path = "Jazzcharles/AuroLA-3B"
 device = "cuda" if torch.cuda.is_available() else "cpu"
 dtype = torch.bfloat16 if device == "cuda" else torch.float32
 # 2) Prepare retrieval inputs
 audio_files = [
+    "/path/to/audio1.wav",
+    "/path/to/audio2.wav",
 ]
 text_queries = [
     "A vehicle driving as a man and woman are talking and laughing",
     "Muffled sounds followed by metal being hit",
 ]
 # Build audio-side messages
     [
         {
             "role": "user",
+            "content": [{"type": "text", "text": f"{t}
+Summarize above sentence in one word:"}],
         },
         {
             "role": "assistant",
     text_out = model(**text_inputs, output_hidden_states=True, return_dict=True, use_audio_in_video=False)
     text_feat = get_embed_feature(text_out.hidden_states[-1], text_inputs['input_ids'], emb_token_ids)
+# 5) Similarity
 audio_feat = F.normalize(audio_feat, dim=-1)
 text_feat = F.normalize(text_feat, dim=-1)
+score = text_feat @ audio_feat.T
 print(score.shape, score)
 ```