LAION Sound-Effect Captioning Whisper

A Whisper-Small-sized audio-captioning model that writes rich, natural-language descriptions of general-purpose sound effects, ambiences, vocal bursts, and music snippets. Given ≤ 30 s of audio it produces a paragraph-length caption describing the content, timbre, and likely source of the sound.

This is the final stage of a multi-step training lineage that stacks emotional speech captioning, generative-audio pre-training, and sound-effect fine-tuning on top of OpenAI's Whisper-Small.

TL;DR

Architecture: WhisperForConditionalGeneration · Whisper-Small · 241.7 M params
Input: mono audio, 16 kHz, up to 30 s
Output: free-form English caption describing the sound
Best validation loss (held-out val set, ~500 samples across 5 source datasets): 1.431 — down from 1.494 after stage 2, 1.689 after stage 1, and from ~2.0 before any sound-effect training
Training compute: 8 × H100 · DDP via torchrun · ~9 h wall-clock end-to-end across three fine-tuning stages

Model genealogy

This checkpoint is the end of a multi-step lineage; every stage starts from the previous one.

OpenAI Whisper-Small – ASR pre-training on 680 k hours of labelled speech.
laion/BUD-E-Whisper – LAION's emotion-aware speech captioning fine-tune. BUD-E-Whisper was trained on LAION's Got Talent plus ~5 k hours of public vlogs, with emotion scores generated by Gemini Flash 2.0 (40 emotion dimensions + 15 auxiliary dimensions such as age, arousal, valence, harshness and vocal-burst cues), templated into captions and then paraphrased for semantic richness. The resulting model can caption how a voice sounds, not just what it says.
laion/captioning-whisper-proof_of_concept (checkpoint-35000) – a continued pre-training run on a broad mix of non-speech audio with Gemini-generated captions:
- Re-captioned AudioSet (mitermix/audioset-with-grounded-captions)
- Re-captioned Freesound (laion/freesound-commercially-permissive-subset-with-captions)
- TangoFlux-generated sound events – synthetic sound effects produced with TangoFlux and re-captioned
- LAION AI-music snippets – machine-generated music clips with Gemini-style captions This stage teaches the decoder the "sound-effect paragraph" writing style and expands its vocabulary from speech events to environmental, mechanical, musical and abstract audio events.
This checkpoint – Sound-Effect Captioning Whisper – three additional fine-tuning stages on a fresh local mix of five public sound-effect / vocal-burst datasets (details below). Stage 1 = 1 epoch @ 5e-6 linear, Stage 2 = 3 epochs @ 1e-5 cosine (warmup 2 %), Stage 3 = 5 more epochs @ 5e-6 cosine (warmup 5 %). Each stage starts from the best checkpoint of the previous one, and the validation split is held fixed across all three so val-loss numbers are directly comparable.

Training data (this model)

All datasets were downloaded, normalised to {mp3, caption} pairs, and pooled into a single flat directory. Per-dataset counts after filtering:

Prefix	HuggingFace dataset	Kind	# pairs
`ad`	`mitermix/audioset-with-grounded-captions`	Re-captioned AudioSet (grounded captions)	~575 k
`fs`	`laion/freesound-commercially-permissive-subset-with-captions`	Re-captioned Freesound (commercially permissive)	~265 k
`gn`	`laion/generated-sound-events`	TangoFlux-generated sound events, Gemini-captioned	~39 k
`iw`	`laion/in-the-wild-sound-events`	In-the-wild sound events	~28 k
`vb`	`laion/synthetic_vocal_bursts` + other vocal-burst subsets	Synthetic / recorded vocal bursts	~9 k
	Total		~916 k

From each prefix, 100 random samples were held out as a stable validation split (~500 total), leaving ~915.6 k for training.

Training setup

Knob	Stage 1	Stage 2	Stage 3
Init from	`captioning-whisper-proof_of_concept` `checkpoint-35000`	stage-1 `final-best`	stage-2 `final-best`
Epochs	1	3	5
Peak LR	5e-6	1e-5	5e-6
LR schedule	linear	cosine	cosine
Warmup ratio	3 %	2 %	5 %
Per-device batch	10	10	10
GPUs	8	8	8
Effective batch	80	80	80
Precision	fp16	fp16	fp16
Eval / save cadence	~250 k samples	~500 k samples	~500 k samples
Total training samples	~915 k	~2.75 M	~4.58 M
Best val loss	1.6894 @ step 9 375	1.4939 @ step 31 250	1.4314 @ step 56 250
Wall clock	~35 min	~3 h	~4 h 47 min

The stage-3 loss descended monotonically at every eval step — it still had not plateaued at the end of epoch 5, so a further continuation run could likely shave off more.

Stage-3 eval curve (lower is better):

step	samples seen	val loss
6 250	500 000	1.4898
12 500	1 000 000	1.4742
18 750	1 500 000	1.4620
25 000	2 000 000	1.4512
31 250	2 500 000	1.4423
37 500	3 000 000	1.4374
43 750	3 500 000	1.4336
50 000	4 000 000	1.4317
56 250	4 500 000	1.4314

For reference, the earlier stage-2 run's eval curve on the same held-out set:

step	samples seen	val loss
6 250	500 000	1.5959
12 500	1 000 000	1.5464
18 750	1 500 000	1.5144
25 000	2 000 000	1.4993
31 250	2 500 000	1.4940

The final checkpoint uploaded here (model.safetensors) is the stage-3 best weights from step 56 250, restored via load_best_model_at_end=True at the end of stage-3 training.

Usage

import torch, torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

REPO = "laion/sound-effect-captioning-whisper"
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model     = WhisperForConditionalGeneration.from_pretrained(REPO).eval().to("cuda")
model.generation_config.forced_decoder_ids = None

# Load any audio ≤ 30 s
wav, sr = torchaudio.load("your_sound.mp3")
if wav.shape[0] > 1:                 # → mono
    wav = wav.mean(dim=0, keepdim=True)
if sr != 16_000:                     # → 16 kHz
    wav = torchaudio.functional.resample(wav, sr, 16_000)
arr = wav.squeeze(0).numpy()[: 16_000 * 30]

feats = processor.feature_extractor(arr, sampling_rate=16_000,
                                    return_tensors="pt").input_features.to("cuda")
with torch.no_grad():
    ids = model.generate(feats, max_new_tokens=400, num_beams=1, do_sample=False)
print(processor.batch_decode(ids, skip_special_tokens=True)[0].strip())

Notes:

Use the base openai/whisper-small processor (this repo only ships tokenizer.json + preprocessor_config.json, which are the Whisper-Small defaults – the processor class needs the bundled chat_template.json / special-tokens files that live on the base repo).
forced_decoder_ids=None is required, otherwise Whisper's transcription language prompt clamps the decoder into ASR mode and suppresses captioning.
Greedy decoding works well; beam=4 trades a bit of repetition for a small quality bump but is not necessary.

Example predictions

All four clips below are drawn from the fixed held-out validation split (these exact files were never seen during training). They were captioned with this released checkpoint (stage-3 final-best, step 56 250, val loss 1.4314).

Example 1 — Vocal burst · groan

Reference: This file features a low, rumbling vocalization, possibly a groan. There's a sense of strain in the sound, and it's drawn out, not abrupt. The recording appears to be of moderate quality, and no other sound events are discernible. The vocalization does not seem to be musical or speech-related.

Prediction: The audio features a low, guttural vocalization, possibly a groan or a moan. The sound is drawn out and has a somewhat strained quality, suggesting discomfort or effort. The recording is clear, with no noticeable background noise. The vocalization is the primary focus of the audio. The speaker sounds like an adult male.

Example 2 — Sound effect · sonar-style beeps

Reference: Repetitive, high-pitched electronic beeps with a reverberant quality, resembling sonar or a digital alert. This sound could indicate a warning, a scanning process (like sonar), or a futuristic interface sound in a sci-fi context.

Prediction: The audio features a distinct, high-pitched, and sustained electronic tone that rapidly sweeps upwards in frequency, creating a sweeping effect. This is followed by a series of shorter, distinct electronic beeps or pulses, which then transition into a continuous, high-pitched, and somewhat distorted electronic tone. This sounds like a synthesized sound effect, possibly used in electronic music, video games, or as a sound design element to create a sense of tension, transition, or an otherworldly atmosphere.

Example 3 — Music · upbeat electronic

Reference: The audio features a high-energy, upbeat piece of music. The melody is carried by a synthesizer, with a driving beat provided by electronic drums. A heavy bassline is prominent. There are sections with and without a strong beat. This recording is a song likely meant for dancing, especially as there is a synth melody.

Prediction: The audio features a fast-paced, energetic electronic dance track. A prominent, repetitive synth melody is layered over a driving beat. The overall sound is bright and upbeat. This is a clip from an electronic dance music track, likely intended for dancing or club environments. The hint confirms the presence of music.

Example 4 — Machinery · mechanical whine (TangoFlux-generated)

Reference: The audio features a continuous, high-pitched whirring sound, which is then joined by a distinct, rhythmic clanking or grinding noise. This clanking/grinding sound is accompanied by a squealing sound that rises in pitch and then quickly descends. The sounds are consistent and repetitive. This sound is indicative of heavy machinery in operation, possibly a large engine or industrial equipment, given the combination of continuous whirring and repetitive clanking/grinding sounds, along with the squealing of moving parts.

Prediction: The audio features a continuous, high-pitched mechanical whine, characteristic of a jet engine. This sound is accompanied by a distinct, rhythmic clanking or grinding noise, suggesting the movement of mechanical parts. The overall sound is loud and sustained. This soundscape is consistent with the operation of a large industrial fan or a powerful engine, possibly from a jet engine or a large vehicle. The combination of whine, grinding, and clanking suggests the movement of heavy machinery or a large mechanical system.

Intended use

Automatic tagging / describing of sound-effect libraries and stock-audio collections.
Audio-understanding component in agents that need to talk about what they hear.
Research baseline for open-vocabulary audio captioning on top of Whisper.

Limitations

The model has no speech-to-text output: the base ASR behaviour of Whisper is intentionally suppressed, and transcribing spoken content is out of scope.
Trained on ≤ 30 s clips; longer inputs are truncated. For long recordings, chunk them yourself and run the model per chunk.
Captioning style is strongly biased toward the Gemini-style "the audio features …" phrasing inherited from the training data.
Some mechanical / music / everyday-sound confusions remain; for difficult clips the model tends to fall back to generic genre templates ("electronic dance track", "jet engine", etc.).
Training data is predominantly English and primarily captures the tastes of the AudioSet + Freesound + LAION-curated distributions.

Acknowledgements

OpenAI Whisper for the base model.
LAION for curating the BUD-E-Whisper emotion dataset, the captioning-whisper proof-of-concept, the freesound / vocal-burst / in-the-wild / generated sound-event corpora, and the AI-music caption snippets used in the intermediate pre-training stage.
TangoFlux for generating the synthetic sound-event data that fed the intermediate pre-training step.
Gemini Flash 2.0 for the underlying caption generation and paraphrasing across multiple datasets.
mitermix/audioset-with-grounded-captions for the grounded AudioSet captions.

License

Released under Apache-2.0, following the upstream Whisper license. The audio clips in examples/ are samples from the five training datasets and retain their respective upstream licences (see the dataset pages linked above).

Downloads last month: 59

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for laion/sound-effect-captioning-whisper

Base model

openai/whisper-small

Finetuned

(3548)

this model