audio-llama-hf / README.md
cdreetz's picture
Upload converted AudioLLM model
ed60929 verified
metadata
license: mit
tags:
  - audio-llm
  - whisper
  - llama
  - speech
  - audio
  - multimodal
pipeline_tag: text-generation

AudioLLM

AudioLLM is a multimodal model that combines Whisper's audio encoding capabilities with LLaMA's text generation to create a model that can understand audio and respond with text.

Model Details

  • Model Type: AudioLLM
  • Base LLM Model: meta-llama/Llama-3.2-3B-Instruct
  • Audio Encoder: openai/whisper-large-v3-turbo
  • License: MIT

Usage

This model integrates with the standard Hugging Face Pipeline API:

import torch
from transformers import pipeline

# Load the pipeline
audio_llm = pipeline(
    "text-generation",
    model="cdreetz/audio-llama-hf",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

# Process audio file
result = audio_llm("path/to/audio.wav")
print(result[0]["generated_text"])

# Process audio with custom prompt
result = audio_llm(("path/to/audio.wav", "Describe the music in this audio:"))
print(result[0]["generated_text"])

# Text-only generation
result = audio_llm("Write a poem about sound:")
print(result[0]["generated_text"])

See example.py for more advanced usage examples.

Limitations

  • Maximum audio length is limited to 30 seconds
  • Audio quality may affect performance

Credits

This model combines the architecture of OpenAI's Whisper for audio understanding and Meta's LLaMA for text generation.