Upgrade to stage-3 checkpoint (val loss 1.4314, down from 1.4940)

Replaces stage-2 weights with stage-3 weights. Stage 3 is 5 more epochs of cosine fine-tuning at peak LR 5e-6 (half of stage-2) with warmup_ratio 0.05, starting from the stage-2 final-best checkpoint. Evaluation on the identical 500-sample held-out split improved monotonically at every eval step, from 1.4898 after 500k samples to 1.4314 after 4.5M samples (final best). README updated with the new training-setup table, stage-3 eval curve, and refreshed example predictions generated by the newly uploaded checkpoint. Example 2 was replaced (test tone -> sonar-style beeps) to better showcase stage-3 captioning quality.

Files changed (5) hide show

.gitattributes +1 -0
README.md +84 -53
examples/example_02_sonar_beeps.mp3 +3 -0
examples/example_02_test_tone.mp3 +0 -0
model.safetensors +1 -1

.gitattributes CHANGED Viewed

@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 examples/example_01_vocal_burst_groan.mp3 filter=lfs diff=lfs merge=lfs -text
 examples/example_03_upbeat_electronic_music.mp3 filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 examples/example_01_vocal_burst_groan.mp3 filter=lfs diff=lfs merge=lfs -text
 examples/example_03_upbeat_electronic_music.mp3 filter=lfs diff=lfs merge=lfs -text
+examples/example_02_sonar_beeps.mp3 filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -30,13 +30,14 @@ on top of OpenAI's Whisper-Small.
 - **Input**: mono audio, 16 kHz, up to 30 s
 - **Output**: free-form English caption describing the sound
 - **Best validation loss** (held-out val set, ~500 samples across 5 source datasets):
-  **1.494** — down from 1.689 after stage 1 and from ~2.0 before any sound-effect
-  training
-- **Training compute**: 8 × H100 · DDP via `torchrun` · ~4 h wall-clock end-to-end
 ## Model genealogy
-This checkpoint is the end of a four-step lineage; every stage starts from the
 previous one.
 1. **OpenAI Whisper-Small** – ASR pre-training on 680 k hours of labelled speech.
@@ -60,10 +61,13 @@ previous one.
    This stage teaches the decoder the "sound-effect paragraph" writing style
    and expands its vocabulary from speech events to environmental, mechanical,
    musical and abstract audio events.
-4. **This checkpoint – Sound-Effect Captioning Whisper** – two additional
    fine-tuning stages on a fresh local mix of *five* public sound-effect /
    vocal-burst datasets (details below). Stage 1 = 1 epoch @ `5e-6` linear,
-   Stage 2 = 3 epochs @ `1e-5` cosine.
 ## Training data (this model)
@@ -84,23 +88,41 @@ split (~500 total), leaving ~915.6 k for training.
 ## Training setup
-| Knob | Stage 1 | Stage 2 |
-|------|---------|---------|
-| Init from                       | stage-3 (`checkpoint-35000` of `captioning-whisper-proof_of_concept`) | stage-1 `final-best` |
-| Epochs                          | 1                                    | 3                                   |
-| Peak LR                         | 5e-6                                 | 1e-5                                |
-| LR schedule                     | linear                               | cosine                              |
-| Warmup ratio                    | 3 %                                  | 2 %                                 |
-| Per-device batch                | 10                                   | 10                                  |
-| GPUs                            | 8                                    | 8                                   |
-| Effective batch                 | 80                                   | 80                                  |
-| Precision                       | fp16                                 | fp16                                |
-| Eval / save cadence             | ~250 k samples                       | ~500 k samples                      |
-| Total training samples          | ~915 k                               | ~2.75 M                             |
-| Best val loss                   | **1.6894** @ step 9 375              | **1.4939** @ step 31 250            |
-| Wall clock                      | ~35 min                              | ~3 h                                |
-Eval curve for stage 2 (lower is better):
 | step | samples seen | val loss |
 |-----:|-------------:|---------:|
@@ -108,11 +130,11 @@ Eval curve for stage 2 (lower is better):
 | 12 500 | 1 000 000 | 1.5464 |
 | 18 750 | 1 500 000 | 1.5144 |
 | 25 000 | 2 000 000 | 1.4993 |
-| 31 250 | 2 500 000 | **1.4940** |
-The final checkpoint uploaded here (`model.safetensors`) is the best weights
-from step 31 250, restored via `load_best_model_at_end=True` at the end of
-stage-2 training.
 ## Usage
@@ -155,7 +177,7 @@ Notes:
 All four clips below are drawn from the fixed **held-out validation split**
 (these exact files were never seen during training). They were captioned with
-this released checkpoint.
 ### Example 1 — Vocal burst · groan
@@ -170,25 +192,30 @@ this released checkpoint.
 > speech-related.
 >
 > **Prediction**: The audio features a low, guttural vocalization, possibly a
-> groan or a moan, with a somewhat strained quality. The sound is drawn out
-> and seems to convey a sense of discomfort or perhaps fatigue. The recording
-> is clear, with no noticeable background noise or distortion. The
-> vocalization is the primary focus of the audio.
-### Example 2 — Sound effect · test tone
 <audio controls preload="none"
-       src="https://huggingface.co/laion/sound-effect-captioning-whisper/resolve/main/examples/example_02_test_tone.mp3">
 </audio>
-> *Reference*: Test tone. The sound is a steady, high-frequency test tone,
-> likely used for calibration or signaling.
 >
-> **Prediction**: The audio features a single, distinct, high-pitched
-> electronic tone. The tone is pure and sustained, with a clear, pure
-> quality. This is likely a synthesized sound, possibly a test tone, a simple
-> electronic signal, or a sound effect used to indicate a specific event or
-> transition.
 ### Example 3 — Music · upbeat electronic
@@ -204,10 +231,9 @@ this released checkpoint.
 >
 > **Prediction**: The audio features a fast-paced, energetic electronic dance
 > track. A prominent, repetitive synth melody is layered over a driving
-> beat. The overall sound is bright and danceable. This is a clip from an
 > electronic dance music track, likely intended for dancing or club
-> environments. The fast tempo and repetitive melody are typical of the
-> genre, confirming the hint of music.
 ### Example 4 — Machinery · mechanical whine (TangoFlux-generated)
@@ -217,16 +243,21 @@ this released checkpoint.
 > *Reference*: The audio features a continuous, high-pitched whirring sound,
 > which is then joined by a distinct, rhythmic clanking or grinding noise.
-> […] This sound is indicative of heavy machinery in operation, possibly a
-> large engine or industrial equipment.
 >
-> **Prediction**: The audio features a loud, high-pitched, and continuous
-> mechanical whine, characteristic of a jet engine. This sound is accompanied
-> by a distinct, lower-frequency rumbling noise, suggesting the operation of
-> a powerful engine. The whine is prominent and consistent, while the rumble
-> is steady and deep. This soundscape is indicative of a large aircraft
-> engine, likely a jet engine, operating at high power, possibly during
-> takeoff or during a test run.
 ## Intended use

 - **Input**: mono audio, 16 kHz, up to 30 s
 - **Output**: free-form English caption describing the sound
 - **Best validation loss** (held-out val set, ~500 samples across 5 source datasets):
+  **1.431** — down from 1.494 after stage 2, 1.689 after stage 1, and from ~2.0
+  before any sound-effect training
+- **Training compute**: 8 × H100 · DDP via `torchrun` · ~9 h wall-clock end-to-end
+  across three fine-tuning stages
 ## Model genealogy
+This checkpoint is the end of a multi-step lineage; every stage starts from the
 previous one.
 1. **OpenAI Whisper-Small** – ASR pre-training on 680 k hours of labelled speech.
    This stage teaches the decoder the "sound-effect paragraph" writing style
    and expands its vocabulary from speech events to environmental, mechanical,
    musical and abstract audio events.
+4. **This checkpoint – Sound-Effect Captioning Whisper** – three additional
    fine-tuning stages on a fresh local mix of *five* public sound-effect /
    vocal-burst datasets (details below). Stage 1 = 1 epoch @ `5e-6` linear,
+   Stage 2 = 3 epochs @ `1e-5` cosine (warmup 2 %), Stage 3 = 5 more epochs @
+   `5e-6` cosine (warmup 5 %). Each stage starts from the best checkpoint of
+   the previous one, and the validation split is held fixed across all three
+   so val-loss numbers are directly comparable.
 ## Training data (this model)
 ## Training setup
+| Knob | Stage 1 | Stage 2 | Stage 3 |
+|------|---------|---------|---------|
+| Init from                       | `captioning-whisper-proof_of_concept` `checkpoint-35000` | stage-1 `final-best` | stage-2 `final-best` |
+| Epochs                          | 1                                    | 3                                   | 5                                   |
+| Peak LR                         | 5e-6                                 | 1e-5                                | 5e-6                                |
+| LR schedule                     | linear                               | cosine                              | cosine                              |
+| Warmup ratio                    | 3 %                                  | 2 %                                 | 5 %                                 |
+| Per-device batch                | 10                                   | 10                                  | 10                                  |
+| GPUs                            | 8                                    | 8                                   | 8                                   |
+| Effective batch                 | 80                                   | 80                                  | 80                                  |
+| Precision                       | fp16                                 | fp16                                | fp16                                |
+| Eval / save cadence             | ~250 k samples                       | ~500 k samples                      | ~500 k samples                      |
+| Total training samples          | ~915 k                               | ~2.75 M                             | ~4.58 M                             |
+| Best val loss                   | **1.6894** @ step 9 375              | **1.4939** @ step 31 250            | **1.4314** @ step 56 250            |
+| Wall clock                      | ~35 min                              | ~3 h                                | ~4 h 47 min                         |
+The stage-3 loss descended monotonically at every eval step — it still had not
+plateaued at the end of epoch 5, so a further continuation run could likely
+shave off more.
+Stage-3 eval curve (lower is better):
+| step | samples seen | val loss |
+|-----:|-------------:|---------:|
+|  6 250 |   500 000 | 1.4898 |
+| 12 500 | 1 000 000 | 1.4742 |
+| 18 750 | 1 500 000 | 1.4620 |
+| 25 000 | 2 000 000 | 1.4512 |
+| 31 250 | 2 500 000 | 1.4423 |
+| 37 500 | 3 000 000 | 1.4374 |
+| 43 750 | 3 500 000 | 1.4336 |
+| 50 000 | 4 000 000 | 1.4317 |
+| 56 250 | 4 500 000 | **1.4314** |
+For reference, the earlier stage-2 run's eval curve on the *same* held-out set:
 | step | samples seen | val loss |
 |-----:|-------------:|---------:|
 | 12 500 | 1 000 000 | 1.5464 |
 | 18 750 | 1 500 000 | 1.5144 |
 | 25 000 | 2 000 000 | 1.4993 |
+| 31 250 | 2 500 000 | 1.4940 |
+The final checkpoint uploaded here (`model.safetensors`) is the stage-3 best
+weights from step 56 250, restored via `load_best_model_at_end=True` at the end
+of stage-3 training.
 ## Usage
 All four clips below are drawn from the fixed **held-out validation split**
 (these exact files were never seen during training). They were captioned with
+this released checkpoint (stage-3 `final-best`, step 56 250, val loss 1.4314).
 ### Example 1 — Vocal burst · groan
 > speech-related.
 >
 > **Prediction**: The audio features a low, guttural vocalization, possibly a
+> groan or a moan. The sound is drawn out and has a somewhat strained
+> quality, suggesting discomfort or effort. The recording is clear, with no
+> noticeable background noise. The vocalization is the primary focus of the
+> audio. The speaker sounds like an adult male.
+### Example 2 — Sound effect · sonar-style beeps
 <audio controls preload="none"
+       src="https://huggingface.co/laion/sound-effect-captioning-whisper/resolve/main/examples/example_02_sonar_beeps.mp3">
 </audio>
+> *Reference*: Repetitive, high-pitched electronic beeps with a reverberant
+> quality, resembling sonar or a digital alert. This sound could indicate a
+> warning, a scanning process (like sonar), or a futuristic interface sound
+> in a sci-fi context.
 >
+> **Prediction**: The audio features a distinct, high-pitched, and sustained
+> electronic tone that rapidly sweeps upwards in frequency, creating a
+> sweeping effect. This is followed by a series of shorter, distinct
+> electronic beeps or pulses, which then transition into a continuous,
+> high-pitched, and somewhat distorted electronic tone. This sounds like a
+> synthesized sound effect, possibly used in electronic music, video games,
+> or as a sound design element to create a sense of tension, transition, or
+> an otherworldly atmosphere.
 ### Example 3 — Music · upbeat electronic
 >
 > **Prediction**: The audio features a fast-paced, energetic electronic dance
 > track. A prominent, repetitive synth melody is layered over a driving
+> beat. The overall sound is bright and upbeat. This is a clip from an
 > electronic dance music track, likely intended for dancing or club
+> environments. The hint confirms the presence of music.
 ### Example 4 — Machinery · mechanical whine (TangoFlux-generated)
 > *Reference*: The audio features a continuous, high-pitched whirring sound,
 > which is then joined by a distinct, rhythmic clanking or grinding noise.
+> This clanking/grinding sound is accompanied by a squealing sound that
+> rises in pitch and then quickly descends. The sounds are consistent and
+> repetitive. This sound is indicative of heavy machinery in operation,
+> possibly a large engine or industrial equipment, given the combination of
+> continuous whirring and repetitive clanking/grinding sounds, along with
+> the squealing of moving parts.
 >
+> **Prediction**: The audio features a continuous, high-pitched mechanical
+> whine, characteristic of a jet engine. This sound is accompanied by a
+> distinct, rhythmic clanking or grinding noise, suggesting the movement of
+> mechanical parts. The overall sound is loud and sustained. This soundscape
+> is consistent with the operation of a large industrial fan or a powerful
+> engine, possibly from a jet engine or a large vehicle. The combination of
+> whine, grinding, and clanking suggests the movement of heavy machinery or
+> a large mechanical system.
 ## Intended use

examples/example_02_sonar_beeps.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8513897fb20d90e148a77216e1a4bb97b3e8ac3b1e7452bb770716e6c97e01a2
+size 139581

examples/example_02_test_tone.mp3 DELETED Viewed

Binary file (2.21 kB)

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8e498260695c46e85e6c7a3ccfb859dcd9b009c2bee534e1068071c0af69d6f5
 size 966995080

 version https://git-lfs.github.com/spec/v1
+oid sha256:9e15eab5eda6c51c8bc6fb60208a0b4ab3e74b25cdf200a4ccb773b093921da5
 size 966995080