ChristophSchuhmann commited on
Commit
bb36f9c
·
verified ·
1 Parent(s): 8440d36

Upgrade to stage-3 checkpoint (val loss 1.4314, down from 1.4940)

Browse files

Replaces stage-2 weights with stage-3 weights. Stage 3 is 5 more epochs of cosine fine-tuning at peak LR 5e-6 (half of stage-2) with warmup_ratio 0.05, starting from the stage-2 final-best checkpoint. Evaluation on the identical 500-sample held-out split improved monotonically at every eval step, from 1.4898 after 500k samples to 1.4314 after 4.5M samples (final best). README updated with the new training-setup table, stage-3 eval curve, and refreshed example predictions generated by the newly uploaded checkpoint. Example 2 was replaced (test tone -> sonar-style beeps) to better showcase stage-3 captioning quality.

.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  examples/example_01_vocal_burst_groan.mp3 filter=lfs diff=lfs merge=lfs -text
37
  examples/example_03_upbeat_electronic_music.mp3 filter=lfs diff=lfs merge=lfs -text
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  examples/example_01_vocal_burst_groan.mp3 filter=lfs diff=lfs merge=lfs -text
37
  examples/example_03_upbeat_electronic_music.mp3 filter=lfs diff=lfs merge=lfs -text
38
+ examples/example_02_sonar_beeps.mp3 filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -30,13 +30,14 @@ on top of OpenAI's Whisper-Small.
30
  - **Input**: mono audio, 16 kHz, up to 30 s
31
  - **Output**: free-form English caption describing the sound
32
  - **Best validation loss** (held-out val set, ~500 samples across 5 source datasets):
33
- **1.494** — down from 1.689 after stage 1 and from ~2.0 before any sound-effect
34
- training
35
- - **Training compute**: 8 × H100 · DDP via `torchrun` · ~4 h wall-clock end-to-end
 
36
 
37
  ## Model genealogy
38
 
39
- This checkpoint is the end of a four-step lineage; every stage starts from the
40
  previous one.
41
 
42
  1. **OpenAI Whisper-Small** – ASR pre-training on 680 k hours of labelled speech.
@@ -60,10 +61,13 @@ previous one.
60
  This stage teaches the decoder the "sound-effect paragraph" writing style
61
  and expands its vocabulary from speech events to environmental, mechanical,
62
  musical and abstract audio events.
63
- 4. **This checkpoint – Sound-Effect Captioning Whisper** – two additional
64
  fine-tuning stages on a fresh local mix of *five* public sound-effect /
65
  vocal-burst datasets (details below). Stage 1 = 1 epoch @ `5e-6` linear,
66
- Stage 2 = 3 epochs @ `1e-5` cosine.
 
 
 
67
 
68
  ## Training data (this model)
69
 
@@ -84,23 +88,41 @@ split (~500 total), leaving ~915.6 k for training.
84
 
85
  ## Training setup
86
 
87
- | Knob | Stage 1 | Stage 2 |
88
- |------|---------|---------|
89
- | Init from | stage-3 (`checkpoint-35000` of `captioning-whisper-proof_of_concept`) | stage-1 `final-best` |
90
- | Epochs | 1 | 3 |
91
- | Peak LR | 5e-6 | 1e-5 |
92
- | LR schedule | linear | cosine |
93
- | Warmup ratio | 3 % | 2 % |
94
- | Per-device batch | 10 | 10 |
95
- | GPUs | 8 | 8 |
96
- | Effective batch | 80 | 80 |
97
- | Precision | fp16 | fp16 |
98
- | Eval / save cadence | ~250 k samples | ~500 k samples |
99
- | Total training samples | ~915 k | ~2.75 M |
100
- | Best val loss | **1.6894** @ step 9 375 | **1.4939** @ step 31 250 |
101
- | Wall clock | ~35 min | ~3 h |
102
-
103
- Eval curve for stage 2 (lower is better):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  | step | samples seen | val loss |
106
  |-----:|-------------:|---------:|
@@ -108,11 +130,11 @@ Eval curve for stage 2 (lower is better):
108
  | 12 500 | 1 000 000 | 1.5464 |
109
  | 18 750 | 1 500 000 | 1.5144 |
110
  | 25 000 | 2 000 000 | 1.4993 |
111
- | 31 250 | 2 500 000 | **1.4940** |
112
 
113
- The final checkpoint uploaded here (`model.safetensors`) is the best weights
114
- from step 31 250, restored via `load_best_model_at_end=True` at the end of
115
- stage-2 training.
116
 
117
  ## Usage
118
 
@@ -155,7 +177,7 @@ Notes:
155
 
156
  All four clips below are drawn from the fixed **held-out validation split**
157
  (these exact files were never seen during training). They were captioned with
158
- this released checkpoint.
159
 
160
  ### Example 1 — Vocal burst · groan
161
 
@@ -170,25 +192,30 @@ this released checkpoint.
170
  > speech-related.
171
  >
172
  > **Prediction**: The audio features a low, guttural vocalization, possibly a
173
- > groan or a moan, with a somewhat strained quality. The sound is drawn out
174
- > and seems to convey a sense of discomfort or perhaps fatigue. The recording
175
- > is clear, with no noticeable background noise or distortion. The
176
- > vocalization is the primary focus of the audio.
177
 
178
- ### Example 2 — Sound effect · test tone
179
 
180
  <audio controls preload="none"
181
- src="https://huggingface.co/laion/sound-effect-captioning-whisper/resolve/main/examples/example_02_test_tone.mp3">
182
  </audio>
183
 
184
- > *Reference*: Test tone. The sound is a steady, high-frequency test tone,
185
- > likely used for calibration or signaling.
 
 
186
  >
187
- > **Prediction**: The audio features a single, distinct, high-pitched
188
- > electronic tone. The tone is pure and sustained, with a clear, pure
189
- > quality. This is likely a synthesized sound, possibly a test tone, a simple
190
- > electronic signal, or a sound effect used to indicate a specific event or
191
- > transition.
 
 
 
192
 
193
  ### Example 3 — Music · upbeat electronic
194
 
@@ -204,10 +231,9 @@ this released checkpoint.
204
  >
205
  > **Prediction**: The audio features a fast-paced, energetic electronic dance
206
  > track. A prominent, repetitive synth melody is layered over a driving
207
- > beat. The overall sound is bright and danceable. This is a clip from an
208
  > electronic dance music track, likely intended for dancing or club
209
- > environments. The fast tempo and repetitive melody are typical of the
210
- > genre, confirming the hint of music.
211
 
212
  ### Example 4 — Machinery · mechanical whine (TangoFlux-generated)
213
 
@@ -217,16 +243,21 @@ this released checkpoint.
217
 
218
  > *Reference*: The audio features a continuous, high-pitched whirring sound,
219
  > which is then joined by a distinct, rhythmic clanking or grinding noise.
220
- > […] This sound is indicative of heavy machinery in operation, possibly a
221
- > large engine or industrial equipment.
 
 
 
 
222
  >
223
- > **Prediction**: The audio features a loud, high-pitched, and continuous
224
- > mechanical whine, characteristic of a jet engine. This sound is accompanied
225
- > by a distinct, lower-frequency rumbling noise, suggesting the operation of
226
- > a powerful engine. The whine is prominent and consistent, while the rumble
227
- > is steady and deep. This soundscape is indicative of a large aircraft
228
- > engine, likely a jet engine, operating at high power, possibly during
229
- > takeoff or during a test run.
 
230
 
231
  ## Intended use
232
 
 
30
  - **Input**: mono audio, 16 kHz, up to 30 s
31
  - **Output**: free-form English caption describing the sound
32
  - **Best validation loss** (held-out val set, ~500 samples across 5 source datasets):
33
+ **1.431** — down from 1.494 after stage 2, 1.689 after stage 1, and from ~2.0
34
+ before any sound-effect training
35
+ - **Training compute**: 8 × H100 · DDP via `torchrun` · ~9 h wall-clock end-to-end
36
+ across three fine-tuning stages
37
 
38
  ## Model genealogy
39
 
40
+ This checkpoint is the end of a multi-step lineage; every stage starts from the
41
  previous one.
42
 
43
  1. **OpenAI Whisper-Small** – ASR pre-training on 680 k hours of labelled speech.
 
61
  This stage teaches the decoder the "sound-effect paragraph" writing style
62
  and expands its vocabulary from speech events to environmental, mechanical,
63
  musical and abstract audio events.
64
+ 4. **This checkpoint – Sound-Effect Captioning Whisper** – three additional
65
  fine-tuning stages on a fresh local mix of *five* public sound-effect /
66
  vocal-burst datasets (details below). Stage 1 = 1 epoch @ `5e-6` linear,
67
+ Stage 2 = 3 epochs @ `1e-5` cosine (warmup 2 %), Stage 3 = 5 more epochs @
68
+ `5e-6` cosine (warmup 5 %). Each stage starts from the best checkpoint of
69
+ the previous one, and the validation split is held fixed across all three
70
+ so val-loss numbers are directly comparable.
71
 
72
  ## Training data (this model)
73
 
 
88
 
89
  ## Training setup
90
 
91
+ | Knob | Stage 1 | Stage 2 | Stage 3 |
92
+ |------|---------|---------|---------|
93
+ | Init from | `captioning-whisper-proof_of_concept` `checkpoint-35000` | stage-1 `final-best` | stage-2 `final-best` |
94
+ | Epochs | 1 | 3 | 5 |
95
+ | Peak LR | 5e-6 | 1e-5 | 5e-6 |
96
+ | LR schedule | linear | cosine | cosine |
97
+ | Warmup ratio | 3 % | 2 % | 5 % |
98
+ | Per-device batch | 10 | 10 | 10 |
99
+ | GPUs | 8 | 8 | 8 |
100
+ | Effective batch | 80 | 80 | 80 |
101
+ | Precision | fp16 | fp16 | fp16 |
102
+ | Eval / save cadence | ~250 k samples | ~500 k samples | ~500 k samples |
103
+ | Total training samples | ~915 k | ~2.75 M | ~4.58 M |
104
+ | Best val loss | **1.6894** @ step 9 375 | **1.4939** @ step 31 250 | **1.4314** @ step 56 250 |
105
+ | Wall clock | ~35 min | ~3 h | ~4 h 47 min |
106
+
107
+ The stage-3 loss descended monotonically at every eval step — it still had not
108
+ plateaued at the end of epoch 5, so a further continuation run could likely
109
+ shave off more.
110
+
111
+ Stage-3 eval curve (lower is better):
112
+
113
+ | step | samples seen | val loss |
114
+ |-----:|-------------:|---------:|
115
+ | 6 250 | 500 000 | 1.4898 |
116
+ | 12 500 | 1 000 000 | 1.4742 |
117
+ | 18 750 | 1 500 000 | 1.4620 |
118
+ | 25 000 | 2 000 000 | 1.4512 |
119
+ | 31 250 | 2 500 000 | 1.4423 |
120
+ | 37 500 | 3 000 000 | 1.4374 |
121
+ | 43 750 | 3 500 000 | 1.4336 |
122
+ | 50 000 | 4 000 000 | 1.4317 |
123
+ | 56 250 | 4 500 000 | **1.4314** |
124
+
125
+ For reference, the earlier stage-2 run's eval curve on the *same* held-out set:
126
 
127
  | step | samples seen | val loss |
128
  |-----:|-------------:|---------:|
 
130
  | 12 500 | 1 000 000 | 1.5464 |
131
  | 18 750 | 1 500 000 | 1.5144 |
132
  | 25 000 | 2 000 000 | 1.4993 |
133
+ | 31 250 | 2 500 000 | 1.4940 |
134
 
135
+ The final checkpoint uploaded here (`model.safetensors`) is the stage-3 best
136
+ weights from step 56 250, restored via `load_best_model_at_end=True` at the end
137
+ of stage-3 training.
138
 
139
  ## Usage
140
 
 
177
 
178
  All four clips below are drawn from the fixed **held-out validation split**
179
  (these exact files were never seen during training). They were captioned with
180
+ this released checkpoint (stage-3 `final-best`, step 56 250, val loss 1.4314).
181
 
182
  ### Example 1 — Vocal burst · groan
183
 
 
192
  > speech-related.
193
  >
194
  > **Prediction**: The audio features a low, guttural vocalization, possibly a
195
+ > groan or a moan. The sound is drawn out and has a somewhat strained
196
+ > quality, suggesting discomfort or effort. The recording is clear, with no
197
+ > noticeable background noise. The vocalization is the primary focus of the
198
+ > audio. The speaker sounds like an adult male.
199
 
200
+ ### Example 2 — Sound effect · sonar-style beeps
201
 
202
  <audio controls preload="none"
203
+ src="https://huggingface.co/laion/sound-effect-captioning-whisper/resolve/main/examples/example_02_sonar_beeps.mp3">
204
  </audio>
205
 
206
+ > *Reference*: Repetitive, high-pitched electronic beeps with a reverberant
207
+ > quality, resembling sonar or a digital alert. This sound could indicate a
208
+ > warning, a scanning process (like sonar), or a futuristic interface sound
209
+ > in a sci-fi context.
210
  >
211
+ > **Prediction**: The audio features a distinct, high-pitched, and sustained
212
+ > electronic tone that rapidly sweeps upwards in frequency, creating a
213
+ > sweeping effect. This is followed by a series of shorter, distinct
214
+ > electronic beeps or pulses, which then transition into a continuous,
215
+ > high-pitched, and somewhat distorted electronic tone. This sounds like a
216
+ > synthesized sound effect, possibly used in electronic music, video games,
217
+ > or as a sound design element to create a sense of tension, transition, or
218
+ > an otherworldly atmosphere.
219
 
220
  ### Example 3 — Music · upbeat electronic
221
 
 
231
  >
232
  > **Prediction**: The audio features a fast-paced, energetic electronic dance
233
  > track. A prominent, repetitive synth melody is layered over a driving
234
+ > beat. The overall sound is bright and upbeat. This is a clip from an
235
  > electronic dance music track, likely intended for dancing or club
236
+ > environments. The hint confirms the presence of music.
 
237
 
238
  ### Example 4 — Machinery · mechanical whine (TangoFlux-generated)
239
 
 
243
 
244
  > *Reference*: The audio features a continuous, high-pitched whirring sound,
245
  > which is then joined by a distinct, rhythmic clanking or grinding noise.
246
+ > This clanking/grinding sound is accompanied by a squealing sound that
247
+ > rises in pitch and then quickly descends. The sounds are consistent and
248
+ > repetitive. This sound is indicative of heavy machinery in operation,
249
+ > possibly a large engine or industrial equipment, given the combination of
250
+ > continuous whirring and repetitive clanking/grinding sounds, along with
251
+ > the squealing of moving parts.
252
  >
253
+ > **Prediction**: The audio features a continuous, high-pitched mechanical
254
+ > whine, characteristic of a jet engine. This sound is accompanied by a
255
+ > distinct, rhythmic clanking or grinding noise, suggesting the movement of
256
+ > mechanical parts. The overall sound is loud and sustained. This soundscape
257
+ > is consistent with the operation of a large industrial fan or a powerful
258
+ > engine, possibly from a jet engine or a large vehicle. The combination of
259
+ > whine, grinding, and clanking suggests the movement of heavy machinery or
260
+ > a large mechanical system.
261
 
262
  ## Intended use
263
 
examples/example_02_sonar_beeps.mp3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8513897fb20d90e148a77216e1a4bb97b3e8ac3b1e7452bb770716e6c97e01a2
3
+ size 139581
examples/example_02_test_tone.mp3 DELETED
Binary file (2.21 kB)
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8e498260695c46e85e6c7a3ccfb859dcd9b009c2bee534e1068071c0af69d6f5
3
  size 966995080
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e15eab5eda6c51c8bc6fb60208a0b4ab3e74b25cdf200a4ccb773b093921da5
3
  size 966995080