LAION Sound-Effect Captioning Whisper
A Whisper-Small-sized audio-captioning model that writes rich, natural-language descriptions of general-purpose sound effects, ambiences, vocal bursts, and music snippets. Given ≤ 30 s of audio it produces a paragraph-length caption describing the content, timbre, and likely source of the sound.
This is the final stage of a multi-step training lineage that stacks emotional speech captioning, generative-audio pre-training, and sound-effect fine-tuning on top of OpenAI's Whisper-Small.
TL;DR
- Architecture:
WhisperForConditionalGeneration· Whisper-Small · 241.7 M params - Input: mono audio, 16 kHz, up to 30 s
- Output: free-form English caption describing the sound
- Best validation loss (held-out val set, ~500 samples across 5 source datasets): 1.431 — down from 1.494 after stage 2, 1.689 after stage 1, and from ~2.0 before any sound-effect training
- Training compute: 8 × H100 · DDP via
torchrun· ~9 h wall-clock end-to-end across three fine-tuning stages
Model genealogy
This checkpoint is the end of a multi-step lineage; every stage starts from the previous one.
- OpenAI Whisper-Small – ASR pre-training on 680 k hours of labelled speech.
- laion/BUD-E-Whisper – LAION's emotion-aware speech captioning fine-tune. BUD-E-Whisper was trained on LAION's Got Talent plus ~5 k hours of public vlogs, with emotion scores generated by Gemini Flash 2.0 (40 emotion dimensions + 15 auxiliary dimensions such as age, arousal, valence, harshness and vocal-burst cues), templated into captions and then paraphrased for semantic richness. The resulting model can caption how a voice sounds, not just what it says.
- laion/captioning-whisper-proof_of_concept
(
checkpoint-35000) – a continued pre-training run on a broad mix of non-speech audio with Gemini-generated captions:- Re-captioned AudioSet (
mitermix/audioset-with-grounded-captions) - Re-captioned Freesound
(
laion/freesound-commercially-permissive-subset-with-captions) - TangoFlux-generated sound events – synthetic sound effects produced with TangoFlux and re-captioned
- LAION AI-music snippets – machine-generated music clips with Gemini-style captions This stage teaches the decoder the "sound-effect paragraph" writing style and expands its vocabulary from speech events to environmental, mechanical, musical and abstract audio events.
- Re-captioned AudioSet (
- This checkpoint – Sound-Effect Captioning Whisper – three additional
fine-tuning stages on a fresh local mix of five public sound-effect /
vocal-burst datasets (details below). Stage 1 = 1 epoch @
5e-6linear, Stage 2 = 3 epochs @1e-5cosine (warmup 2 %), Stage 3 = 5 more epochs @5e-6cosine (warmup 5 %). Each stage starts from the best checkpoint of the previous one, and the validation split is held fixed across all three so val-loss numbers are directly comparable.
Training data (this model)
All datasets were downloaded, normalised to {mp3, caption} pairs, and pooled
into a single flat directory. Per-dataset counts after filtering:
| Prefix | HuggingFace dataset | Kind | # pairs |
|---|---|---|---|
ad |
mitermix/audioset-with-grounded-captions |
Re-captioned AudioSet (grounded captions) | ~575 k |
fs |
laion/freesound-commercially-permissive-subset-with-captions |
Re-captioned Freesound (commercially permissive) | ~265 k |
gn |
laion/generated-sound-events |
TangoFlux-generated sound events, Gemini-captioned | ~39 k |
iw |
laion/in-the-wild-sound-events |
In-the-wild sound events | ~28 k |
vb |
laion/synthetic_vocal_bursts + other vocal-burst subsets |
Synthetic / recorded vocal bursts | ~9 k |
| Total | ~916 k |
From each prefix, 100 random samples were held out as a stable validation split (~500 total), leaving ~915.6 k for training.
Training setup
| Knob | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|
| Init from | captioning-whisper-proof_of_concept checkpoint-35000 |
stage-1 final-best |
stage-2 final-best |
| Epochs | 1 | 3 | 5 |
| Peak LR | 5e-6 | 1e-5 | 5e-6 |
| LR schedule | linear | cosine | cosine |
| Warmup ratio | 3 % | 2 % | 5 % |
| Per-device batch | 10 | 10 | 10 |
| GPUs | 8 | 8 | 8 |
| Effective batch | 80 | 80 | 80 |
| Precision | fp16 | fp16 | fp16 |
| Eval / save cadence | ~250 k samples | ~500 k samples | ~500 k samples |
| Total training samples | ~915 k | ~2.75 M | ~4.58 M |
| Best val loss | 1.6894 @ step 9 375 | 1.4939 @ step 31 250 | 1.4314 @ step 56 250 |
| Wall clock | ~35 min | ~3 h | ~4 h 47 min |
The stage-3 loss descended monotonically at every eval step — it still had not plateaued at the end of epoch 5, so a further continuation run could likely shave off more.
Stage-3 eval curve (lower is better):
| step | samples seen | val loss |
|---|---|---|
| 6 250 | 500 000 | 1.4898 |
| 12 500 | 1 000 000 | 1.4742 |
| 18 750 | 1 500 000 | 1.4620 |
| 25 000 | 2 000 000 | 1.4512 |
| 31 250 | 2 500 000 | 1.4423 |
| 37 500 | 3 000 000 | 1.4374 |
| 43 750 | 3 500 000 | 1.4336 |
| 50 000 | 4 000 000 | 1.4317 |
| 56 250 | 4 500 000 | 1.4314 |
For reference, the earlier stage-2 run's eval curve on the same held-out set:
| step | samples seen | val loss |
|---|---|---|
| 6 250 | 500 000 | 1.5959 |
| 12 500 | 1 000 000 | 1.5464 |
| 18 750 | 1 500 000 | 1.5144 |
| 25 000 | 2 000 000 | 1.4993 |
| 31 250 | 2 500 000 | 1.4940 |
The final checkpoint uploaded here (model.safetensors) is the stage-3 best
weights from step 56 250, restored via load_best_model_at_end=True at the end
of stage-3 training.
Usage
import torch, torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
REPO = "laion/sound-effect-captioning-whisper"
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained(REPO).eval().to("cuda")
model.generation_config.forced_decoder_ids = None
# Load any audio ≤ 30 s
wav, sr = torchaudio.load("your_sound.mp3")
if wav.shape[0] > 1: # → mono
wav = wav.mean(dim=0, keepdim=True)
if sr != 16_000: # → 16 kHz
wav = torchaudio.functional.resample(wav, sr, 16_000)
arr = wav.squeeze(0).numpy()[: 16_000 * 30]
feats = processor.feature_extractor(arr, sampling_rate=16_000,
return_tensors="pt").input_features.to("cuda")
with torch.no_grad():
ids = model.generate(feats, max_new_tokens=400, num_beams=1, do_sample=False)
print(processor.batch_decode(ids, skip_special_tokens=True)[0].strip())
Notes:
- Use the base
openai/whisper-smallprocessor (this repo only shipstokenizer.json+preprocessor_config.json, which are the Whisper-Small defaults – the processor class needs the bundledchat_template.json/ special-tokens files that live on the base repo). forced_decoder_ids=Noneis required, otherwise Whisper's transcription language prompt clamps the decoder into ASR mode and suppresses captioning.- Greedy decoding works well; beam=4 trades a bit of repetition for a small quality bump but is not necessary.
Example predictions
All four clips below are drawn from the fixed held-out validation split
(these exact files were never seen during training). They were captioned with
this released checkpoint (stage-3 final-best, step 56 250, val loss 1.4314).
Example 1 — Vocal burst · groan
Reference: This file features a low, rumbling vocalization, possibly a groan. There's a sense of strain in the sound, and it's drawn out, not abrupt. The recording appears to be of moderate quality, and no other sound events are discernible. The vocalization does not seem to be musical or speech-related.
Prediction: The audio features a low, guttural vocalization, possibly a groan or a moan. The sound is drawn out and has a somewhat strained quality, suggesting discomfort or effort. The recording is clear, with no noticeable background noise. The vocalization is the primary focus of the audio. The speaker sounds like an adult male.
Example 2 — Sound effect · sonar-style beeps
Reference: Repetitive, high-pitched electronic beeps with a reverberant quality, resembling sonar or a digital alert. This sound could indicate a warning, a scanning process (like sonar), or a futuristic interface sound in a sci-fi context.
Prediction: The audio features a distinct, high-pitched, and sustained electronic tone that rapidly sweeps upwards in frequency, creating a sweeping effect. This is followed by a series of shorter, distinct electronic beeps or pulses, which then transition into a continuous, high-pitched, and somewhat distorted electronic tone. This sounds like a synthesized sound effect, possibly used in electronic music, video games, or as a sound design element to create a sense of tension, transition, or an otherworldly atmosphere.
Example 3 — Music · upbeat electronic
Reference: The audio features a high-energy, upbeat piece of music. The melody is carried by a synthesizer, with a driving beat provided by electronic drums. A heavy bassline is prominent. There are sections with and without a strong beat. This recording is a song likely meant for dancing, especially as there is a synth melody.
Prediction: The audio features a fast-paced, energetic electronic dance track. A prominent, repetitive synth melody is layered over a driving beat. The overall sound is bright and upbeat. This is a clip from an electronic dance music track, likely intended for dancing or club environments. The hint confirms the presence of music.
Example 4 — Machinery · mechanical whine (TangoFlux-generated)
Reference: The audio features a continuous, high-pitched whirring sound, which is then joined by a distinct, rhythmic clanking or grinding noise. This clanking/grinding sound is accompanied by a squealing sound that rises in pitch and then quickly descends. The sounds are consistent and repetitive. This sound is indicative of heavy machinery in operation, possibly a large engine or industrial equipment, given the combination of continuous whirring and repetitive clanking/grinding sounds, along with the squealing of moving parts.
Prediction: The audio features a continuous, high-pitched mechanical whine, characteristic of a jet engine. This sound is accompanied by a distinct, rhythmic clanking or grinding noise, suggesting the movement of mechanical parts. The overall sound is loud and sustained. This soundscape is consistent with the operation of a large industrial fan or a powerful engine, possibly from a jet engine or a large vehicle. The combination of whine, grinding, and clanking suggests the movement of heavy machinery or a large mechanical system.
Intended use
- Automatic tagging / describing of sound-effect libraries and stock-audio collections.
- Audio-understanding component in agents that need to talk about what they hear.
- Research baseline for open-vocabulary audio captioning on top of Whisper.
Limitations
- The model has no speech-to-text output: the base ASR behaviour of Whisper is intentionally suppressed, and transcribing spoken content is out of scope.
- Trained on ≤ 30 s clips; longer inputs are truncated. For long recordings, chunk them yourself and run the model per chunk.
- Captioning style is strongly biased toward the Gemini-style "the audio features …" phrasing inherited from the training data.
- Some mechanical / music / everyday-sound confusions remain; for difficult clips the model tends to fall back to generic genre templates ("electronic dance track", "jet engine", etc.).
- Training data is predominantly English and primarily captures the tastes of the AudioSet + Freesound + LAION-curated distributions.
Acknowledgements
- OpenAI Whisper for the base model.
- LAION for curating the BUD-E-Whisper emotion dataset, the captioning-whisper proof-of-concept, the freesound / vocal-burst / in-the-wild / generated sound-event corpora, and the AI-music caption snippets used in the intermediate pre-training stage.
- TangoFlux for generating the synthetic sound-event data that fed the intermediate pre-training step.
- Gemini Flash 2.0 for the underlying caption generation and paraphrasing across multiple datasets.
- mitermix/audioset-with-grounded-captions for the grounded AudioSet captions.
License
Released under Apache-2.0, following the upstream Whisper license.
The audio clips in examples/ are samples from the five training datasets and
retain their respective upstream licences (see the dataset pages linked
above).
- Downloads last month
- -
Model tree for laion/sound-effect-captioning-whisper
Base model
openai/whisper-small