Music Whisper (Whisper Small Fine-tune)

This is a fine-tuned version of OpenAI Whisper Small specifically trained for Music Captioning.

Model Details

Base Model: OpenAI Whisper Small
Task: Music Captioning (generating textual descriptions of audio tracks)

Reference Samples

This repository includes an HTML file with embedded reference samples that you can download and view to hear the audio alongside the generated captions.

Example Predictions

Below are examples of captions generated by this model (Ground Truth/Prediction comparison):

Example 1

"The listener hears a track that immediately establishes itself with a distorted guitar riff, setting a raw and edgy tone. A driving drum beat soon joins, characterized by a prominent snare drum and a heavy kick drum, solidifying the rhythmic foundation. The tempo is approximately 140 beats per minute, contributing to the track's energetic feel. A male vocalist then begins to sing, employing a slightly strained vocal delivery. The vocal profile is that of an adult male, with a timbre and quality described as slightly raspy and aggressive. There are no non-lyrical vocal sounds present. The genre of the music is identified as either Alternative Rock or Hard Rock. The overall mood of the piece is energetic, rebellious, and carries a slight sense of angst. The combination of distorted guitars, the driving drum beat, and the strained vocal delivery are all characteristic elements of these genres. The production quality is relatively clean, yet it retains a raw edge, which enhances the overall intensity of the track. This track would be well-suited for a high-energy action scene, a rock concert, or a rebellious youth anthem."

Example 2

"The listener hears a piece of music primarily characterized by a slow tempo and a melancholic melody. The primary instrumentation consists of a piano, which carries the main melodic line, and a string section that provides a subtle harmonic foundation. The piano's timbre is described as warm and slightly muffled, contributing to the overall sonic texture. The string section is played with a gentle touch, adding a layer of emotional depth to the piece. The tempo is slow, contributing to the overall mood. The genre of the music is identified as either Classical or Film Score. The mood is predominantly melancholic, reflective, and peaceful. The combination of the piano's melody, the string section's presence, and the string section's delicate sound creates a sense of intimacy and vulnerability. This musical composition would be well-suited for a dramatic scene in a film, a quiet moment of reflection, or as background music for a relaxing activity."

Downloads last month: 31

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for laion/music-whisper

Base model

openai/whisper-small

Finetuned

(3630)

this model

Finetunes

1 model