|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- audio |
|
|
- speech |
|
|
- transcription |
|
|
- librispeech |
|
|
- llama-3 |
|
|
datasets: |
|
|
- librispeech_asr |
|
|
--- |
|
|
|
|
|
# Audio-LLaMA: LoRA Adapter for Audio Understanding |
|
|
|
|
|
<img src="https://cdn-lfs-us-1.hf.co/repos/64/30/6430a20288b1b07674f3ab60cdfa7cb9483852464dfbac93f3b96b3a002959f8/948b61345b8d64ce4edff3df13d4e3a6e373e58759f20cb34ce565c2b389d27e?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Screenshot%2525202025-03-18%252520at%2525202.26.36%2525E2%252580%2525AFPM.png%3B+filename%3D%22Screenshot%25202025-03-18%2520at%25202.26.36%25E2%2580%25AFPM.png%22%3B&response-content-type=image%2Fpng&Expires=1742329810&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0MjMyOTgxMH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzY0LzMwLzY0MzBhMjAyODhiMWIwNzY3NGYzYWI2MGNkZmE3Y2I5NDgzODUyNDY0ZGZiYWM5M2YzYjk2YjNhMDAyOTU5ZjgvOTQ4YjYxMzQ1YjhkNjRjZTRlZGZmM2RmMTNkNGUzYTZlMzczZTU4NzU5ZjIwY2IzNGNlNTY1YzJiMzg5ZDI3ZT9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=mkayensxNAMol1pHzs7mnTUzo%7E3G0gHGMULx4zPwHuMg1BD6iQOfxjtTYNFohTCAaS2XNQ%7ElWB3uxS-X4JwO2rhqVmR37jN8xtNc%7E7vHedwUd4rfTyrosOCV6wF9yVGlqSvHgQQe4aMQa9O67qb6jm2c-6I1LreuZwaWgBnAHKDW%7Em7zjH0y%7EwjFlqG%7EgzOalWJs2jwzyel7zHIhfKaozDsCxFZHuu0Y8J-4Sv%7EvOjE%7Ey8QNadYHm31-4d1KwyONWcO3dZ31ndOeLkmWjv%7EIqKhZgKDIX-L4CjZPtwHIglEhYp0dJyXbJT%7EGQYgmVuPtETT2SjwW689crMSfpj2m2A__&Key-Pair-Id=K24J24Z295AEI9" width="500"> |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: meta-llama/Llama-3.2-3B-Instruct |
|
|
- **Audio Model**: openai/whisper-large-v3-turbo |
|
|
- **LoRA Rank**: 32 |
|
|
- **Task**: Audio transcription from LibriSpeech dataset |
|
|
- **Training Framework**: PEFT (Parameter-Efficient Fine-Tuning) |
|
|
|
|
|
## Usage |
|
|
|
|
|
This is a PEFT (LoRA) adapter that needs to be combined with the base Llama model to work: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from peft import PeftModel, PeftConfig |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# Load the LoRA configuration |
|
|
config = PeftConfig.from_pretrained("cdreetz/audio-llama") |
|
|
|
|
|
# Load the base model |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
config.base_model_name_or_path, |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Load the tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path) |
|
|
|
|
|
# Load the LoRA adapter |
|
|
model = PeftModel.from_pretrained(model, "cdreetz/audio-llama") |
|
|
|
|
|
# Run inference |
|
|
prompt = "Transcribe this audio:" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Training |
|
|
|
|
|
This model was fine-tuned using LoRA on audio transcription tasks. It starts with a Llama 3 base model and uses Whisper-processed audio features for audio understanding. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
This model requires special code for audio processing with Whisper before passing to the Llama model. See the [Audio-LLaMA repository](https://github.com/cdreetz/audio-llama) for full usage instructions. |
|
|
|