ChristophSchuhmann commited on
Commit
fe6fbcc
Β·
verified Β·
1 Parent(s): f434b31

Re-caption sfx samples with updated sound-effect-captioning-whisper

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -275,7 +275,7 @@ _Source dataset: [`mitermix/audioset-with-grounded-captions`](https://huggingfac
275
 
276
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
277
 
278
- > The audio features a distinct, high-pitched, and somewhat distorted sound. It sounds like a squeaky toy or a similar mechanical sound. The sound is short and repetitive. The audio is a recording of a toy, as indicated by the hint. The squeaky sound suggests the toy is being manipulated, possibly with a squeaky toy or a similar device.
279
 
280
  ---
281
 
@@ -339,7 +339,7 @@ _Source dataset: [`mitermix/audioset-with-grounded-captions`](https://huggingfac
339
 
340
  | # | Label | Confidence |
341
  |---|---|---|
342
- | 1 | `Speech` | 78.0% |
343
  | 2 | `Inside, small room` | 19.2% |
344
  | 3 | `Squish` | 9.1% |
345
 
@@ -347,7 +347,7 @@ _Source dataset: [`mitermix/audioset-with-grounded-captions`](https://huggingfac
347
 
348
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
349
 
350
- > The audio contains speech. A male voice is speaking, but the content is unintelligible. The speech is somewhat muffled and difficult to understand. The audio contains speech, as indicated by the hint. The muffled quality suggests the speaker is either far away, or the recording quality is poor.
351
 
352
  ---
353
 
@@ -417,7 +417,7 @@ _Source dataset: [`laion/captioned-ai-music-snippets`](https://huggingface.co/da
417
 
418
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
419
 
420
- > A male voice, perceived as adult, speaks in a clear, measured tone. The speech is articulate and articulate, with a slightly formal timbre. The pace is moderate, and the pitch is in the mid-range. The recording is clean, with minimal background noise. This is a recording of a male speaker delivering a formal speech or narration, possibly in a professional or educational setting. The clear articulation and measured pace suggest a prepared statement or a prepared statement.
421
 
422
  ---
423
 
@@ -487,7 +487,7 @@ _Source dataset: [`TTS-AGI/majestrino-unified-detailed-captions-temporal`](https
487
 
488
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
489
 
490
- > A female voice, perceived as adult, is speaking in a casual, conversational tone. The speech is clear and easily understandable, with a moderate pace and a neutral emotional state. The audio contains a snippet of a conversation or monologue. The speaker's neutral tone suggests a non-emotional or informative context.
491
 
492
  ---
493
 
@@ -559,7 +559,7 @@ _Source dataset: [`TTS-AGI/majestrino-unified-detailed-captions-temporal`](https
559
 
560
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
561
 
562
- > The audio contains speech from a female speaker. The speech is clear and understandable, with a moderate pace and a neutral tone. The audio contains speech, as indicated by the hint. The speaker is likely delivering information or engaging in a conversation.
563
 
564
  ---
565
 
@@ -629,7 +629,7 @@ _Source dataset: [`TTS-AGI/majestrino-unified-detailed-captions-temporal`](https
629
 
630
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
631
 
632
- > The audio contains speech from a male speaker. The speaker is talking at a normal pace and volume. The recording quality is clear, with no noticeable background noise. The audio is a recording of a male speaker, as indicated by the hint. The neutral tone suggests a factual or informative context.
633
 
634
  ---
635
 
@@ -673,7 +673,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
673
 
674
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
675
 
676
- > The audio captures the sounds of a large vehicle, likely a bus or truck, including engine noise, air brakes, and a distinct whoosh sound, indicating movement. The soundscape suggests a large commercial vehicle, such as a bus or truck, in motion, possibly passing by or idling. The prominent air brake sound is a clear indicator of its operation.
677
 
678
  ---
679
 
@@ -695,7 +695,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
695
 
696
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
697
 
698
- > The audio features a continuous, high-pitched, and somewhat distorted electronic sound, resembling a siren or a high-frequency alarm. It has a rhythmic, almost pulsating quality, with a distinct, repetitive pattern. The sound is loud and has a somewhat abrasive texture. This sound is characteristic of an alarm or warning signal, possibly from a fire alarm, a security system, or a warning siren. The repetitive and somewhat distorted nature suggests a warning or alert.
699
 
700
  ---
701
 
@@ -717,7 +717,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
717
 
718
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
719
 
720
- > The audio features a single, clear, and resonant bell sound. The sound is bright and metallic, with a distinct attack and a long, decaying sustain. This is the sound of a bell being struck, likely a small handbell or a chime. It could be used as a notification sound, a signal, or a simple percussive element in music.
721
 
722
  ---
723
 
@@ -731,7 +731,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
731
 
732
  | # | Label | Confidence |
733
  |---|---|---|
734
- | 1 | `Door` | 12.8% |
735
  | 2 | `Silence` | 7.2% |
736
  | 3 | `Thunk` | 5.1% |
737
 
@@ -739,7 +739,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
739
 
740
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
741
 
742
- > , A sharp, percussive sound, possibly a gunshot or a gunshot, with a distinct metallic resonance.
743
 
744
  ---
745
 
@@ -761,7 +761,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
761
 
762
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
763
 
764
- > The audio features a distinct, high-pitched squeaking sound, characteristic of a squeaky wheel, followed by a series of softer, repetitive squeaks. These sounds are consistent and rhythmic, suggesting a mechanical source. The sounds are clear and appear to be close to the recording device. The sounds strongly suggest the operation of a mechanical device, specifically a squeaky wheel, possibly a cart, a trolley, or a similar piece of machinery. The repetitive nature of the squeaks indicates continuous motion.
765
 
766
  ---
767
 
@@ -783,7 +783,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
783
 
784
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
785
 
786
- > The audio features the distinct sound of a squeaky toy, characterized by a high-pitched, repetitive squeak. The squeak is short and sharp, occurring in quick succession. This sound is indicative of a toy being squeezed, likely a rubber or plastic toy, due to the squeaking noise. It suggests a playful or exploratory action.
787
 
788
  ---
789
 
@@ -805,7 +805,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
805
 
806
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
807
 
808
- > Ambient soundscape featuring a continuous, high-pitched whirring or buzzing, characteristic of a drone. The sound is consistent and suggests the operation of a drone or a similar flying insect. The continuous whirring indicates the drone is in flight, and the subtle variations in pitch and intensity suggest the insect is moving closer to and further away from the recording device.
809
 
810
  ---
811
 
@@ -827,7 +827,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
827
 
828
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
829
 
830
- > The audio features a variety of bird vocalizations, including chirps, calls, and possibly some high-pitched calls. The sounds are varied in their timbre and rhythm, suggesting multiple birds or a group of birds. This is a recording of birds in their natural habitat, likely a forest, garden, or park. The variety of calls and chirps suggests a diverse bird population, possibly communicating with each other or calling for each other.
831
 
832
  ---
833
 
@@ -849,7 +849,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
849
 
850
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
851
 
852
- > The audio features a single, distinct, high-pitched electronic beep. The sound is brief and clear, with a consistent tone and no discernible modulation. This is a simple electronic alert or notification sound, commonly used as an indicator or a simple signal.
853
 
854
  ---
855
 
@@ -871,7 +871,7 @@ _Source dataset: [`laion/freesound-commercially-permissive-subset-with-captions`
871
 
872
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
873
 
874
- > A low, continuous hum with a distinct, rhythmic pulsing or throbbing sound. The sound suggests the operation of a large, powerful machine or vehicle, possibly a train or heavy industrial equipment, characterized by a rhythmic pulsing and a deep, resonant hum.
875
 
876
  ---
877
 
 
275
 
276
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
277
 
278
+ > The audio features a distinct, high-pitched squeaking sound. The squeak is short and sharp, with a slightly metallic quality. The sound is isolated, with no other discernible background noise. The audio is a recording of a squeaky toy being manipulated. The hint confirms the presence of a squeaky toy.
279
 
280
  ---
281
 
 
339
 
340
  | # | Label | Confidence |
341
  |---|---|---|
342
+ | 1 | `Speech` | 78.1% |
343
  | 2 | `Inside, small room` | 19.2% |
344
  | 3 | `Squish` | 9.1% |
345
 
 
347
 
348
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
349
 
350
+ > The audio contains speech. A male voice is speaking, but the words are unintelligible. The speech is somewhat muffled and difficult to understand. The audio contains speech, as indicated by the hint. The muffled quality suggests the speaker is either far away, speaking through a barrier, or the recording was made at a distance.
351
 
352
  ---
353
 
 
417
 
418
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
419
 
420
+ > A male voice, perceived as adult, speaks in a clear, measured tone, delivering a narrative or monologue. The speech is articulate and articulate, with a slightly formal timbre. The pace is moderate, and the pitch is in the mid-range. The audio quality is clean, with minimal background noise. This is a recording of a spoken word performance, likely a narration, a monologue, or a formal address. The clear articulation and measured pace suggest a professional or educational context, possibly for an audiobook, documentary, or a documentary.
421
 
422
  ---
423
 
 
487
 
488
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
489
 
490
+ > A female voice speaks in a frustrated and exasperated tone. The speaker is expressing negative feelings, using a harsh and somewhat sarcastic tone. The audio quality is clear.
491
 
492
  ---
493
 
 
559
 
560
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
561
 
562
+ > The audio contains speech from a female speaker. The speech is clear and understandable, with a moderate pace and a neutral tone. The audio quality is good, with no noticeable background noise. The audio contains speech, as indicated by the hint. The speaker's voice is clear and understandable.
563
 
564
  ---
565
 
 
629
 
630
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
631
 
632
+ > The audio contains a male voice speaking. The speech is clear and articulate, with a moderate pace and a neutral tone. The recording quality is good, with minimal background noise. The audio is a recording of a male speaker, likely delivering information or engaging in a conversation. The hint confirms the presence of speech.
633
 
634
  ---
635
 
 
673
 
674
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
675
 
676
+ > The audio captures the sounds of a large vehicle, likely a bus or truck, including engine noise, air brakes, and the distinct hiss of air brakes. The soundscape suggests an urban or industrial environment, possibly a bus stop or a large commercial vehicle, with the characteristic sounds of its air brakes and the hiss of air brakes.
677
 
678
  ---
679
 
 
695
 
696
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
697
 
698
+ > The audio features a continuous, high-pitched whirring sound, characteristic of a vacuum cleaner. The sound is consistent and sustained, indicating the operation of a motorized device. There are no other distinct sounds present. This is the sound of a vacuum cleaner in operation. The continuous nature of the sound suggests it is running steadily, likely for cleaning purposes.
699
 
700
  ---
701
 
 
717
 
718
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
719
 
720
+ > The audio features a high-pitched, sustained electronic tone that gradually fades out. The sound is pure and consistent in its frequency and amplitude, without any discernible modulation or additional elements. This sound is characteristic of a digital alert, a test tone, or a simple electronic signal. It could be used as a simple notification, a system sound, or a component of a larger electronic device.
721
 
722
  ---
723
 
 
731
 
732
  | # | Label | Confidence |
733
  |---|---|---|
734
+ | 1 | `Door` | 12.7% |
735
  | 2 | `Silence` | 7.2% |
736
  | 3 | `Thunk` | 5.1% |
737
 
 
739
 
740
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
741
 
742
+ > A whoosh sound followed by a metallic clang. This sound suggests a rapid movement of air or an object, immediately followed by a metallic impact, possibly from a projectile hitting metal or a heavy object falling onto a metal surface.
743
 
744
  ---
745
 
 
761
 
762
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
763
 
764
+ > The audio begins with a distinct mechanical whirring sound, followed by a series of rapid, high-pitched clicks or clacks, and then a final, softer mechanical thud. This sequence repeats multiple times. The sounds suggest the operation of a mechanical device, possibly a printer or a similar office machine, where internal components are moving, engaging, and then settling into place.
765
 
766
  ---
767
 
 
783
 
784
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
785
 
786
+ > The audio features the distinct sound of a squeaky wheel, accompanied by the rustling of fabric. The squeaky wheel sound is prominent, suggesting movement over a surface. The rustling could be from clothing or paper, and the squeaking might be from a door or a chair.
787
 
788
  ---
789
 
 
805
 
806
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
807
 
808
+ > The audio captures the distinct sound of a large vehicle, likely a truck, in operation, characterized by its engine noise and the sound of air brakes. The sound suggests the presence of heavy machinery or a large vehicle, possibly in an industrial or transportation context, indicating movement or a busy environment.
809
 
810
  ---
811
 
 
827
 
828
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
829
 
830
+ > The audio features a variety of bird vocalizations, including chirps, calls, and possibly some squawks. The sounds are varied in pitch and rhythm, suggesting multiple birds or a single bird. This is a recording of birds in their natural environment, likely a garden, park, or forest, where birds are actively communicating. The variety and variety of calls suggest a diverse bird population.
831
 
832
  ---
833
 
 
849
 
850
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
851
 
852
+ > The audio features a single, distinct, high-pitched electronic beep. The beep is short and sharp, with a clear, electronic timbre. This sound is characteristic of an electronic alert or notification, possibly from a digital device, a timer, or a simple electronic gadget.
853
 
854
  ---
855
 
 
871
 
872
  **`laion/sound-effect-captioning-whisper` β€” sound caption:**
873
 
874
+ > A vehicle passing by, with engine noise and tire sounds, and a distinct whoosh. The audio captures the sound of a vehicle, likely a car or truck, passing by. The engine noise is prominent, indicating it is moving at a moderate speed. The sound includes the distinct whoosh of air as it passes, and the Doppler effect as it moves past the listener.
875
 
876
  ---
877