Upgrade to stage-3 checkpoint (val loss 1.4314, down from 1.4940)
Browse filesReplaces stage-2 weights with stage-3 weights. Stage 3 is 5 more epochs of cosine fine-tuning at peak LR 5e-6 (half of stage-2) with warmup_ratio 0.05, starting from the stage-2 final-best checkpoint. Evaluation on the identical 500-sample held-out split improved monotonically at every eval step, from 1.4898 after 500k samples to 1.4314 after 4.5M samples (final best). README updated with the new training-setup table, stage-3 eval curve, and refreshed example predictions generated by the newly uploaded checkpoint. Example 2 was replaced (test tone -> sonar-style beeps) to better showcase stage-3 captioning quality.
- .gitattributes +1 -0
- README.md +84 -53
- examples/example_02_sonar_beeps.mp3 +3 -0
- examples/example_02_test_tone.mp3 +0 -0
- model.safetensors +1 -1
.gitattributes
CHANGED
|
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
examples/example_01_vocal_burst_groan.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 37 |
examples/example_03_upbeat_electronic_music.mp3 filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
examples/example_01_vocal_burst_groan.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 37 |
examples/example_03_upbeat_electronic_music.mp3 filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
examples/example_02_sonar_beeps.mp3 filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -30,13 +30,14 @@ on top of OpenAI's Whisper-Small.
|
|
| 30 |
- **Input**: mono audio, 16 kHz, up to 30 s
|
| 31 |
- **Output**: free-form English caption describing the sound
|
| 32 |
- **Best validation loss** (held-out val set, ~500 samples across 5 source datasets):
|
| 33 |
-
**1.
|
| 34 |
-
training
|
| 35 |
-
- **Training compute**: 8 × H100 · DDP via `torchrun` · ~
|
|
|
|
| 36 |
|
| 37 |
## Model genealogy
|
| 38 |
|
| 39 |
-
This checkpoint is the end of a
|
| 40 |
previous one.
|
| 41 |
|
| 42 |
1. **OpenAI Whisper-Small** – ASR pre-training on 680 k hours of labelled speech.
|
|
@@ -60,10 +61,13 @@ previous one.
|
|
| 60 |
This stage teaches the decoder the "sound-effect paragraph" writing style
|
| 61 |
and expands its vocabulary from speech events to environmental, mechanical,
|
| 62 |
musical and abstract audio events.
|
| 63 |
-
4. **This checkpoint – Sound-Effect Captioning Whisper** –
|
| 64 |
fine-tuning stages on a fresh local mix of *five* public sound-effect /
|
| 65 |
vocal-burst datasets (details below). Stage 1 = 1 epoch @ `5e-6` linear,
|
| 66 |
-
Stage 2 = 3 epochs @ `1e-5` cosine
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
## Training data (this model)
|
| 69 |
|
|
@@ -84,23 +88,41 @@ split (~500 total), leaving ~915.6 k for training.
|
|
| 84 |
|
| 85 |
## Training setup
|
| 86 |
|
| 87 |
-
| Knob | Stage 1 | Stage 2 |
|
| 88 |
-
|------|---------|---------|
|
| 89 |
-
| Init from |
|
| 90 |
-
| Epochs | 1 | 3 |
|
| 91 |
-
| Peak LR | 5e-6 | 1e-5 |
|
| 92 |
-
| LR schedule | linear | cosine |
|
| 93 |
-
| Warmup ratio | 3 % | 2 % |
|
| 94 |
-
| Per-device batch | 10 | 10 |
|
| 95 |
-
| GPUs | 8 | 8 |
|
| 96 |
-
| Effective batch | 80 | 80 |
|
| 97 |
-
| Precision | fp16 | fp16 |
|
| 98 |
-
| Eval / save cadence | ~250 k samples | ~500 k samples |
|
| 99 |
-
| Total training samples | ~915 k | ~2.75 M |
|
| 100 |
-
| Best val loss | **1.6894** @ step 9 375 | **1.4939** @ step 31 250 |
|
| 101 |
-
| Wall clock | ~35 min | ~3 h |
|
| 102 |
-
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
| step | samples seen | val loss |
|
| 106 |
|-----:|-------------:|---------:|
|
|
@@ -108,11 +130,11 @@ Eval curve for stage 2 (lower is better):
|
|
| 108 |
| 12 500 | 1 000 000 | 1.5464 |
|
| 109 |
| 18 750 | 1 500 000 | 1.5144 |
|
| 110 |
| 25 000 | 2 000 000 | 1.4993 |
|
| 111 |
-
| 31 250 | 2 500 000 |
|
| 112 |
|
| 113 |
-
The final checkpoint uploaded here (`model.safetensors`) is the
|
| 114 |
-
from step
|
| 115 |
-
stage-
|
| 116 |
|
| 117 |
## Usage
|
| 118 |
|
|
@@ -155,7 +177,7 @@ Notes:
|
|
| 155 |
|
| 156 |
All four clips below are drawn from the fixed **held-out validation split**
|
| 157 |
(these exact files were never seen during training). They were captioned with
|
| 158 |
-
this released checkpoint.
|
| 159 |
|
| 160 |
### Example 1 — Vocal burst · groan
|
| 161 |
|
|
@@ -170,25 +192,30 @@ this released checkpoint.
|
|
| 170 |
> speech-related.
|
| 171 |
>
|
| 172 |
> **Prediction**: The audio features a low, guttural vocalization, possibly a
|
| 173 |
-
> groan or a moan
|
| 174 |
-
>
|
| 175 |
-
>
|
| 176 |
-
>
|
| 177 |
|
| 178 |
-
### Example 2 — Sound effect ·
|
| 179 |
|
| 180 |
<audio controls preload="none"
|
| 181 |
-
src="https://huggingface.co/laion/sound-effect-captioning-whisper/resolve/main/examples/
|
| 182 |
</audio>
|
| 183 |
|
| 184 |
-
> *Reference*:
|
| 185 |
-
>
|
|
|
|
|
|
|
| 186 |
>
|
| 187 |
-
> **Prediction**: The audio features a
|
| 188 |
-
> electronic tone
|
| 189 |
-
>
|
| 190 |
-
> electronic
|
| 191 |
-
>
|
|
|
|
|
|
|
|
|
|
| 192 |
|
| 193 |
### Example 3 — Music · upbeat electronic
|
| 194 |
|
|
@@ -204,10 +231,9 @@ this released checkpoint.
|
|
| 204 |
>
|
| 205 |
> **Prediction**: The audio features a fast-paced, energetic electronic dance
|
| 206 |
> track. A prominent, repetitive synth melody is layered over a driving
|
| 207 |
-
> beat. The overall sound is bright and
|
| 208 |
> electronic dance music track, likely intended for dancing or club
|
| 209 |
-
> environments. The
|
| 210 |
-
> genre, confirming the hint of music.
|
| 211 |
|
| 212 |
### Example 4 — Machinery · mechanical whine (TangoFlux-generated)
|
| 213 |
|
|
@@ -217,16 +243,21 @@ this released checkpoint.
|
|
| 217 |
|
| 218 |
> *Reference*: The audio features a continuous, high-pitched whirring sound,
|
| 219 |
> which is then joined by a distinct, rhythmic clanking or grinding noise.
|
| 220 |
-
>
|
| 221 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
>
|
| 223 |
-
> **Prediction**: The audio features a
|
| 224 |
-
>
|
| 225 |
-
>
|
| 226 |
-
>
|
| 227 |
-
> is
|
| 228 |
-
> engine,
|
| 229 |
-
>
|
|
|
|
| 230 |
|
| 231 |
## Intended use
|
| 232 |
|
|
|
|
| 30 |
- **Input**: mono audio, 16 kHz, up to 30 s
|
| 31 |
- **Output**: free-form English caption describing the sound
|
| 32 |
- **Best validation loss** (held-out val set, ~500 samples across 5 source datasets):
|
| 33 |
+
**1.431** — down from 1.494 after stage 2, 1.689 after stage 1, and from ~2.0
|
| 34 |
+
before any sound-effect training
|
| 35 |
+
- **Training compute**: 8 × H100 · DDP via `torchrun` · ~9 h wall-clock end-to-end
|
| 36 |
+
across three fine-tuning stages
|
| 37 |
|
| 38 |
## Model genealogy
|
| 39 |
|
| 40 |
+
This checkpoint is the end of a multi-step lineage; every stage starts from the
|
| 41 |
previous one.
|
| 42 |
|
| 43 |
1. **OpenAI Whisper-Small** – ASR pre-training on 680 k hours of labelled speech.
|
|
|
|
| 61 |
This stage teaches the decoder the "sound-effect paragraph" writing style
|
| 62 |
and expands its vocabulary from speech events to environmental, mechanical,
|
| 63 |
musical and abstract audio events.
|
| 64 |
+
4. **This checkpoint – Sound-Effect Captioning Whisper** – three additional
|
| 65 |
fine-tuning stages on a fresh local mix of *five* public sound-effect /
|
| 66 |
vocal-burst datasets (details below). Stage 1 = 1 epoch @ `5e-6` linear,
|
| 67 |
+
Stage 2 = 3 epochs @ `1e-5` cosine (warmup 2 %), Stage 3 = 5 more epochs @
|
| 68 |
+
`5e-6` cosine (warmup 5 %). Each stage starts from the best checkpoint of
|
| 69 |
+
the previous one, and the validation split is held fixed across all three
|
| 70 |
+
so val-loss numbers are directly comparable.
|
| 71 |
|
| 72 |
## Training data (this model)
|
| 73 |
|
|
|
|
| 88 |
|
| 89 |
## Training setup
|
| 90 |
|
| 91 |
+
| Knob | Stage 1 | Stage 2 | Stage 3 |
|
| 92 |
+
|------|---------|---------|---------|
|
| 93 |
+
| Init from | `captioning-whisper-proof_of_concept` `checkpoint-35000` | stage-1 `final-best` | stage-2 `final-best` |
|
| 94 |
+
| Epochs | 1 | 3 | 5 |
|
| 95 |
+
| Peak LR | 5e-6 | 1e-5 | 5e-6 |
|
| 96 |
+
| LR schedule | linear | cosine | cosine |
|
| 97 |
+
| Warmup ratio | 3 % | 2 % | 5 % |
|
| 98 |
+
| Per-device batch | 10 | 10 | 10 |
|
| 99 |
+
| GPUs | 8 | 8 | 8 |
|
| 100 |
+
| Effective batch | 80 | 80 | 80 |
|
| 101 |
+
| Precision | fp16 | fp16 | fp16 |
|
| 102 |
+
| Eval / save cadence | ~250 k samples | ~500 k samples | ~500 k samples |
|
| 103 |
+
| Total training samples | ~915 k | ~2.75 M | ~4.58 M |
|
| 104 |
+
| Best val loss | **1.6894** @ step 9 375 | **1.4939** @ step 31 250 | **1.4314** @ step 56 250 |
|
| 105 |
+
| Wall clock | ~35 min | ~3 h | ~4 h 47 min |
|
| 106 |
+
|
| 107 |
+
The stage-3 loss descended monotonically at every eval step — it still had not
|
| 108 |
+
plateaued at the end of epoch 5, so a further continuation run could likely
|
| 109 |
+
shave off more.
|
| 110 |
+
|
| 111 |
+
Stage-3 eval curve (lower is better):
|
| 112 |
+
|
| 113 |
+
| step | samples seen | val loss |
|
| 114 |
+
|-----:|-------------:|---------:|
|
| 115 |
+
| 6 250 | 500 000 | 1.4898 |
|
| 116 |
+
| 12 500 | 1 000 000 | 1.4742 |
|
| 117 |
+
| 18 750 | 1 500 000 | 1.4620 |
|
| 118 |
+
| 25 000 | 2 000 000 | 1.4512 |
|
| 119 |
+
| 31 250 | 2 500 000 | 1.4423 |
|
| 120 |
+
| 37 500 | 3 000 000 | 1.4374 |
|
| 121 |
+
| 43 750 | 3 500 000 | 1.4336 |
|
| 122 |
+
| 50 000 | 4 000 000 | 1.4317 |
|
| 123 |
+
| 56 250 | 4 500 000 | **1.4314** |
|
| 124 |
+
|
| 125 |
+
For reference, the earlier stage-2 run's eval curve on the *same* held-out set:
|
| 126 |
|
| 127 |
| step | samples seen | val loss |
|
| 128 |
|-----:|-------------:|---------:|
|
|
|
|
| 130 |
| 12 500 | 1 000 000 | 1.5464 |
|
| 131 |
| 18 750 | 1 500 000 | 1.5144 |
|
| 132 |
| 25 000 | 2 000 000 | 1.4993 |
|
| 133 |
+
| 31 250 | 2 500 000 | 1.4940 |
|
| 134 |
|
| 135 |
+
The final checkpoint uploaded here (`model.safetensors`) is the stage-3 best
|
| 136 |
+
weights from step 56 250, restored via `load_best_model_at_end=True` at the end
|
| 137 |
+
of stage-3 training.
|
| 138 |
|
| 139 |
## Usage
|
| 140 |
|
|
|
|
| 177 |
|
| 178 |
All four clips below are drawn from the fixed **held-out validation split**
|
| 179 |
(these exact files were never seen during training). They were captioned with
|
| 180 |
+
this released checkpoint (stage-3 `final-best`, step 56 250, val loss 1.4314).
|
| 181 |
|
| 182 |
### Example 1 — Vocal burst · groan
|
| 183 |
|
|
|
|
| 192 |
> speech-related.
|
| 193 |
>
|
| 194 |
> **Prediction**: The audio features a low, guttural vocalization, possibly a
|
| 195 |
+
> groan or a moan. The sound is drawn out and has a somewhat strained
|
| 196 |
+
> quality, suggesting discomfort or effort. The recording is clear, with no
|
| 197 |
+
> noticeable background noise. The vocalization is the primary focus of the
|
| 198 |
+
> audio. The speaker sounds like an adult male.
|
| 199 |
|
| 200 |
+
### Example 2 — Sound effect · sonar-style beeps
|
| 201 |
|
| 202 |
<audio controls preload="none"
|
| 203 |
+
src="https://huggingface.co/laion/sound-effect-captioning-whisper/resolve/main/examples/example_02_sonar_beeps.mp3">
|
| 204 |
</audio>
|
| 205 |
|
| 206 |
+
> *Reference*: Repetitive, high-pitched electronic beeps with a reverberant
|
| 207 |
+
> quality, resembling sonar or a digital alert. This sound could indicate a
|
| 208 |
+
> warning, a scanning process (like sonar), or a futuristic interface sound
|
| 209 |
+
> in a sci-fi context.
|
| 210 |
>
|
| 211 |
+
> **Prediction**: The audio features a distinct, high-pitched, and sustained
|
| 212 |
+
> electronic tone that rapidly sweeps upwards in frequency, creating a
|
| 213 |
+
> sweeping effect. This is followed by a series of shorter, distinct
|
| 214 |
+
> electronic beeps or pulses, which then transition into a continuous,
|
| 215 |
+
> high-pitched, and somewhat distorted electronic tone. This sounds like a
|
| 216 |
+
> synthesized sound effect, possibly used in electronic music, video games,
|
| 217 |
+
> or as a sound design element to create a sense of tension, transition, or
|
| 218 |
+
> an otherworldly atmosphere.
|
| 219 |
|
| 220 |
### Example 3 — Music · upbeat electronic
|
| 221 |
|
|
|
|
| 231 |
>
|
| 232 |
> **Prediction**: The audio features a fast-paced, energetic electronic dance
|
| 233 |
> track. A prominent, repetitive synth melody is layered over a driving
|
| 234 |
+
> beat. The overall sound is bright and upbeat. This is a clip from an
|
| 235 |
> electronic dance music track, likely intended for dancing or club
|
| 236 |
+
> environments. The hint confirms the presence of music.
|
|
|
|
| 237 |
|
| 238 |
### Example 4 — Machinery · mechanical whine (TangoFlux-generated)
|
| 239 |
|
|
|
|
| 243 |
|
| 244 |
> *Reference*: The audio features a continuous, high-pitched whirring sound,
|
| 245 |
> which is then joined by a distinct, rhythmic clanking or grinding noise.
|
| 246 |
+
> This clanking/grinding sound is accompanied by a squealing sound that
|
| 247 |
+
> rises in pitch and then quickly descends. The sounds are consistent and
|
| 248 |
+
> repetitive. This sound is indicative of heavy machinery in operation,
|
| 249 |
+
> possibly a large engine or industrial equipment, given the combination of
|
| 250 |
+
> continuous whirring and repetitive clanking/grinding sounds, along with
|
| 251 |
+
> the squealing of moving parts.
|
| 252 |
>
|
| 253 |
+
> **Prediction**: The audio features a continuous, high-pitched mechanical
|
| 254 |
+
> whine, characteristic of a jet engine. This sound is accompanied by a
|
| 255 |
+
> distinct, rhythmic clanking or grinding noise, suggesting the movement of
|
| 256 |
+
> mechanical parts. The overall sound is loud and sustained. This soundscape
|
| 257 |
+
> is consistent with the operation of a large industrial fan or a powerful
|
| 258 |
+
> engine, possibly from a jet engine or a large vehicle. The combination of
|
| 259 |
+
> whine, grinding, and clanking suggests the movement of heavy machinery or
|
| 260 |
+
> a large mechanical system.
|
| 261 |
|
| 262 |
## Intended use
|
| 263 |
|
examples/example_02_sonar_beeps.mp3
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8513897fb20d90e148a77216e1a4bb97b3e8ac3b1e7452bb770716e6c97e01a2
|
| 3 |
+
size 139581
|
examples/example_02_test_tone.mp3
DELETED
|
Binary file (2.21 kB)
|
|
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 966995080
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9e15eab5eda6c51c8bc6fb60208a0b4ab3e74b25cdf200a4ccb773b093921da5
|
| 3 |
size 966995080
|