Audio Classification
Transformers
English
audio
audio-captioning
audio-tagging
audioset
whisper
speech-captioning
music-captioning
sound-effect-captioning
laion
ast
audio-spectrogram-transformer
Instructions to use laion/whisper-captioning-ensemble with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use laion/whisper-captioning-ensemble with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="laion/whisper-captioning-ensemble")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("laion/whisper-captioning-ensemble", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Refresh sample sidecars after Speech>=0.80 routing threshold
Browse files- samples/audioset/audioset__qS6y9dA1GX4_185736.json +1 -1
- samples/audioset/audioset__rQmOOSlJ74g_195927.json +2 -3
- samples/audioset/audioset__xYVzy_dh20A_184856.json +2 -3
- samples/freesound/audio_206738_390405.json +1 -1
- samples/freesound/audio_360573_395201.json +1 -1
- samples/freesound/audio_389654_399191.json +1 -1
- samples/freesound/audio_41515_397745.json +1 -1
- samples/majestrino/majestrino__03001175.json +1 -1
- samples/majestrino/majestrino__03001817.json +2 -3
- samples/majestrino/majestrino__03002030.json +2 -3
- samples/majestrino/majestrino__03002116.json +2 -3
- samples/majestrino/majestrino__03004481.json +1 -1
- samples/music/music__suno_audio_196211_4_1844520.json +2 -3
samples/audioset/audioset__qS6y9dA1GX4_185736.json
CHANGED
|
@@ -18,7 +18,7 @@
|
|
| 18 |
"route": "speech",
|
| 19 |
"annotations": {
|
| 20 |
"voice_tags": "Suitable for Work, natural speaking, fluent, casual speaking style, modal voice, neutral airflow, normal loudness, slightly dynamic, precise articulation, normal speaking",
|
| 21 |
-
"bud_e_speech_caption": "A male speaker delivers a highly engaging and professional performance in standard American English. The recording boasts exceptional clarity with minimal background noise, creating a studio-quality listening experience. The speaker, likely a young adult male in his 20s or 30s, exhibits a resonant, slightly rough baritone timbre with a near-neutral-slightly-bright quality and a mild breathiness. The voice is chest-mixed, with a near-neutral-heavy weight and a slight wobble, suggesting a natural, healthy vocal production with mild wear. Articulation is precise and dynamic, contributing to a fluent and natural speaking style.\n\nInitially, the speaker conveys strong elation and moderate hope, with a hint of triumph, delivered at a moderate tempo with a mid-to-high pitch range. The tone is clear and resonant, reflecting a spontaneous and natural delivery. As the recording progresses, the emotional landscape shifts to include a strong sense of hope and enthusiasm, coupled with moderate elation and a slight feeling of triumph. The tempo
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 18 |
"route": "speech",
|
| 19 |
"annotations": {
|
| 20 |
"voice_tags": "Suitable for Work, natural speaking, fluent, casual speaking style, modal voice, neutral airflow, normal loudness, slightly dynamic, precise articulation, normal speaking",
|
| 21 |
+
"bud_e_speech_caption": "A male speaker delivers a highly engaging and professional performance in standard American English. The recording boasts exceptional clarity with minimal background noise, creating a studio-quality listening experience. The speaker, likely a young adult male in his 20s or 30s, exhibits a resonant, slightly rough baritone timbre with a near-neutral-slightly-bright quality and a mild breathiness. The voice is chest-mixed, with a near-neutral-heavy weight and a slight wobble, suggesting a natural, healthy vocal production with mild wear. Articulation is precise and dynamic, contributing to a fluent and natural speaking style.\n\nInitially, the speaker conveys strong elation and moderate hope, with a hint of triumph, delivered at a moderate tempo with a mid-to-high pitch range. The tone is clear and resonant, reflecting a spontaneous and natural delivery. As the recording progresses, the emotional landscape shifts to include a strong sense of hope and enthusiasm, coupled with moderate elation and a slight feeling of triumph. The tempo increases to a fast pace, and the pitch range becomes higher and more dynamic. The overall delivery remains natural and spontaneous, showcasing a confident and expressive vocal performance. The speaker's voice is consistently"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
samples/audioset/audioset__rQmOOSlJ74g_195927.json
CHANGED
|
@@ -15,10 +15,9 @@
|
|
| 15 |
"confidence": 0.128052
|
| 16 |
}
|
| 17 |
],
|
| 18 |
-
"route": "
|
| 19 |
"annotations": {
|
| 20 |
-
"
|
| 21 |
-
"bud_e_speech_caption": "A young adult female, likely in her 20s, delivers a highly expressive and engaging speech in English with a neutral American accent. The recording boasts exceptional quality, captured in a silent, studio-like environment, ensuring pristine audio clarity. The voice exhibits a bright, clear, and moderately bright timbre, leaning towards a female soprano register. The delivery is characterized by a fast tempo and a wide, dynamic pitch range, showcasing a natural and spontaneous speaking style. Initially, the speaker conveys strong amusement and moderate elation, with a subtle hint of teasing, expressed through a soft, breathy, and slightly nasal vocal quality. The voice is generally relaxed and stable, with a light vocal weight and healthy resonance, exhibiting a slight breathiness. As the speech progresses, the emotional intensity remains high, with the amusement and elation persisting, now accompanied by a slight hint of pleasure. The speaking style remains natural and spontaneous, with a consistent high pitch and dynamic range. Articulation is precise, and the airflow is neutral, contributing to a smooth and controlled delivery. The overall impression is one of high enjoyment and professionalism, stemming from the expressive vocal performance and excellent audio quality. The voice is consistently clear, head-mixed,"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 15 |
"confidence": 0.128052
|
| 16 |
}
|
| 17 |
],
|
| 18 |
+
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "The audio features a distinct, high-pitched, and somewhat distorted sound. It sounds like a squeaky toy or a similar mechanical sound. The sound is short and repetitive. The audio is a recording of a toy, as indicated by the hint. The squeaky sound suggests the toy is being manipulated, possibly with a squeaky toy or a similar device."
|
|
|
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
samples/audioset/audioset__xYVzy_dh20A_184856.json
CHANGED
|
@@ -15,10 +15,9 @@
|
|
| 15 |
"confidence": 0.091248
|
| 16 |
}
|
| 17 |
],
|
| 18 |
-
"route": "
|
| 19 |
"annotations": {
|
| 20 |
-
"
|
| 21 |
-
"bud_e_speech_caption": "auditory experience is initially characterized by a low-quality recording featuring a male voice, likely in his 20s, speaking English with a neutral American accent. The initial segment presents a neutral-valence, calm demeanor, though the recording quality is poor, marked by significant background hiss and a sustained low-frequency hum, suggesting a non-studio environment. The voice exhibits a very low pitch, a monotonous and slow tempo, and a consistent, moderate loudness. Articulation is clear, but the overall delivery lacks naturalness, resulting in low enjoyment and professionalism.\n\nThe audio then transitions to a second half featuring a slightly low-pitched, soft voice, belonging to an adult male, likely in his 20s to 40s. The emotional state shifts to a mix of moderate sadness and slight disappointment, conveyed through a slow and deliberate delivery. The timbre is a male baritone, slightly soft and neutral, with a dark neutral quality, a touch of breathiness and nasality, and a relaxed vocal production. A slight roughness and chest-mixed resonance are present, with near-neutral heavy vocal weight and mild wear, yet the voice remains mostly natural and stable. The recording quality is decent, with a"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 15 |
"confidence": 0.091248
|
| 16 |
}
|
| 17 |
],
|
| 18 |
+
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "The audio contains speech. A male voice is speaking, but the content is unintelligible. The speech is somewhat muffled and difficult to understand. The audio contains speech, as indicated by the hint. The muffled quality suggests the speaker is either far away, or the recording quality is poor."
|
|
|
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
samples/freesound/audio_206738_390405.json
CHANGED
|
@@ -17,7 +17,7 @@
|
|
| 17 |
],
|
| 18 |
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
-
"sound_effect_caption": "The audio features a continuous, high-pitched, and somewhat distorted electronic sound, resembling a siren or a high-frequency alarm. It has a rhythmic, almost pulsating quality, with a distinct, repetitive pattern. The sound is loud and
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
|
|
|
| 17 |
],
|
| 18 |
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "The audio features a continuous, high-pitched, and somewhat distorted electronic sound, resembling a siren or a high-frequency alarm. It has a rhythmic, almost pulsating quality, with a distinct, repetitive pattern. The sound is loud and has a somewhat abrasive texture. This sound is characteristic of an alarm or warning signal, possibly from a fire alarm, a security system, or a warning siren. The repetitive and somewhat distorted nature suggests a warning or alert."
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
samples/freesound/audio_360573_395201.json
CHANGED
|
@@ -17,7 +17,7 @@
|
|
| 17 |
],
|
| 18 |
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
-
"sound_effect_caption": "
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
|
|
|
| 17 |
],
|
| 18 |
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "Ambient soundscape featuring a continuous, high-pitched whirring or buzzing, characteristic of a drone. The sound is consistent and suggests the operation of a drone or a similar flying insect. The continuous whirring indicates the drone is in flight, and the subtle variations in pitch and intensity suggest the insect is moving closer to and further away from the recording device."
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
samples/freesound/audio_389654_399191.json
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
"audioset_top3": [
|
| 5 |
{
|
| 6 |
"label": "Insect",
|
| 7 |
-
"confidence": 0.
|
| 8 |
},
|
| 9 |
{
|
| 10 |
"label": "Cricket",
|
|
|
|
| 4 |
"audioset_top3": [
|
| 5 |
{
|
| 6 |
"label": "Insect",
|
| 7 |
+
"confidence": 0.882324
|
| 8 |
},
|
| 9 |
{
|
| 10 |
"label": "Cricket",
|
samples/freesound/audio_41515_397745.json
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
"audioset_top3": [
|
| 5 |
{
|
| 6 |
"label": "Vehicle",
|
| 7 |
-
"confidence": 0.
|
| 8 |
},
|
| 9 |
{
|
| 10 |
"label": "Field recording",
|
|
|
|
| 4 |
"audioset_top3": [
|
| 5 |
{
|
| 6 |
"label": "Vehicle",
|
| 7 |
+
"confidence": 0.628418
|
| 8 |
},
|
| 9 |
{
|
| 10 |
"label": "Field recording",
|
samples/majestrino/majestrino__03001175.json
CHANGED
|
@@ -18,7 +18,7 @@
|
|
| 18 |
"route": "speech",
|
| 19 |
"annotations": {
|
| 20 |
"voice_tags": "Suitable for Work, natural speaking, fluent, conversational style, modal voice, neutral airflow, normal loudness, slightly dynamic, precise articulation, normal speaking",
|
| 21 |
-
"bud_e_speech_caption": "adult male voice delivers a narration with a slightly pensive and melancholic tone, exhibiting a moderate sense of contemplation and a subtle hint of disappointment. The voice possesses a male baritone timbre, characterized by a slightly soft and neutral quality with a near-neutral-slightly-bright
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 18 |
"route": "speech",
|
| 19 |
"annotations": {
|
| 20 |
"voice_tags": "Suitable for Work, natural speaking, fluent, conversational style, modal voice, neutral airflow, normal loudness, slightly dynamic, precise articulation, normal speaking",
|
| 21 |
+
"bud_e_speech_caption": "adult male voice delivers a narration with a slightly pensive and melancholic tone, exhibiting a moderate sense of contemplation and a subtle hint of disappointment. The voice possesses a male baritone timbre, characterized by a slightly soft and neutral quality with a near-neutral-slightly-bright overall tonality. A subtle breathiness and a slight nasality are present, contributing to a relaxed vocal production. The voice exhibits a slight roughness and a chest-mixed resonance, with a near-neutral heavy vocal weight and mild wear, yet remains mostly natural and stable. The speaker's delivery is generally calm and measured, with a neutral pitch and volume, and a slightly monotonous intonation, though not entirely devoid of subtle dynamic shifts. Articulation is precise, indicative of a narration style. The airflow is neutral, and the voice is generally stable. The recording quality is excellent, with no discernible background noise, ensuring a clear and natural sound. The overall enjoyment is rated as medium, reflecting the clear audio but subdued emotional expression. The professionalism is high, attributed to the excellent recording quality and controlled vocal performance. The speaker conveys a sense of vulnerability and a slightly confident demeanor, with a neutral"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
samples/majestrino/majestrino__03001817.json
CHANGED
|
@@ -15,10 +15,9 @@
|
|
| 15 |
"confidence": 0.238403
|
| 16 |
}
|
| 17 |
],
|
| 18 |
-
"route": "
|
| 19 |
"annotations": {
|
| 20 |
-
"
|
| 21 |
-
"bud_e_speech_caption": "A young adult male, likely in his 20s, delivers a speech characterized by a blend of impatience, irritability, moderate disappointment, and a hint of anger. The voice exhibits a slightly nasal timbre, a mid-range pitch, and a moderate tempo, with clear and precise articulation. The delivery is natural and spontaneous, suggesting a casual, unscripted utterance in English with a standard American accent. The recording quality is high, captured in a quiet environment with minimal background noise, indicating a good microphone setup. The overall enjoyment is medium, stemming from the clear audio but somewhat negative emotional tone. Professionalism is low, reflecting the informal nature of the speech.\n\nThe speaker's emotional state shifts throughout the recording. Initially, the tone is subdued and neutral, conveying mild impatience and irritability, with a touch of disappointment. This evolves into a more pronounced mix of impatience, irritability, and a hint of anger. The voice maintains a clear, slightly nasal timbre, with a mid-range pitch and moderate tempo. The speech remains natural and spontaneous, with a slight room echo present. The voice is described as a slightly soft young adult male voice with medium-pitched delivery, exhibiting a near-ne"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 15 |
"confidence": 0.238403
|
| 16 |
}
|
| 17 |
],
|
| 18 |
+
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "A female voice, perceived as adult, is speaking in a casual, conversational tone. The speech is clear and easily understandable, with a moderate pace and a neutral emotional state. The audio contains a snippet of a conversation or monologue. The speaker's neutral tone suggests a non-emotional or informative context."
|
|
|
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
samples/majestrino/majestrino__03002030.json
CHANGED
|
@@ -15,10 +15,9 @@
|
|
| 15 |
"confidence": 0.140991
|
| 16 |
}
|
| 17 |
],
|
| 18 |
-
"route": "
|
| 19 |
"annotations": {
|
| 20 |
-
"
|
| 21 |
-
"bud_e_speech_caption": "A medium-quality recording features a female speaker, likely a young adult, delivering a casual and slightly resigned monologue in English with a neutral American accent. The overall tone is subdued, exhibiting a blend of mild disappointment, moderate impatience, and underlying irritability, occasionally tinged with a hint of contempt. The speaker's voice possesses a soft, slightly breathy timbre, with a near-neutral-slightly-bright quality and a subtle nasality. The pitch is mid-range, and the tempo is slow and deliberate, contributing to a sense of thoughtfulness. Articulation is mostly clear, though with a slight imprecision, and the airflow is neutral. The voice exhibits a balanced head-chest resonance, a neutral weight, and a healthy, clear quality, sounding perfectly natural and stable. \n\nThe initial portion of the audio reveals a hesitant delivery, with a moderate degree of doubt and a slight sense of confusion, though the overall valence remains neutral. The speaker's tone is calm and soft, with a relatively monotonous pitch range. As the recording progresses, the speaker's delivery becomes slightly more hesitant, yet maintains a genuine quality. A moderate degree of doubt and a slight sense of confusion persist, but"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 15 |
"confidence": 0.140991
|
| 16 |
}
|
| 17 |
],
|
| 18 |
+
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "The audio contains speech from a female speaker. The speech is clear and understandable, with a moderate pace and a neutral tone. The audio contains speech, as indicated by the hint. The speaker is likely delivering information or engaging in a conversation."
|
|
|
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
samples/majestrino/majestrino__03002116.json
CHANGED
|
@@ -15,10 +15,9 @@
|
|
| 15 |
"confidence": 0.329834
|
| 16 |
}
|
| 17 |
],
|
| 18 |
-
"route": "
|
| 19 |
"annotations": {
|
| 20 |
-
"
|
| 21 |
-
"bud_e_speech_caption": "A male voice, likely in his late 20s to 40s, delivers a narration with a generally neutral and controlled tone. The voice possesses a male baritone timbre, characterized by a slightly soft and relaxed quality, a near-neutral brightness, and a subtle breathiness and nasality. There's a hint of roughness to the clarity, with a chest-mixed resonance and a near-neutral vocal weight, suggesting mild vocal wear but overall natural stability. The airflow is neutral, and the articulation is precise, contributing to a narration style delivery. The speaking style is fluent and natural, with a moderate tempo and a slightly dynamic delivery. \n\nInitially, the voice conveys a sense of contemplation and moderate interest, with a neutral valence and a slightly calm arousal level. The speaker exhibits a neutral submissive-dominant dynamic, a moderate degree of vulnerability, and a slightly confident demeanor. The pitch and volume remain neutral, with a slightly monotone quality, and the overall tone is neither warm nor cold. The recording quality is high, with no background noise, resulting in a clear and natural sound. \n\nThe second half of the audio shifts to a slightly low-p"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 15 |
"confidence": 0.329834
|
| 16 |
}
|
| 17 |
],
|
| 18 |
+
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "The audio contains speech from a male speaker. The speaker is talking at a normal pace and volume. The recording quality is clear, with no noticeable background noise. The audio is a recording of a male speaker, as indicated by the hint. The neutral tone suggests a factual or informative context."
|
|
|
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|
samples/majestrino/majestrino__03004481.json
CHANGED
|
@@ -18,7 +18,7 @@
|
|
| 18 |
"route": "speech",
|
| 19 |
"annotations": {
|
| 20 |
"voice_tags": "Suitable for Work, natural-SFW, fluent, casual speaking style, slack voice, neutral airflow, quiet, flat intonation, neutral articulation, sighing delivery",
|
| 21 |
-
"bud_e_speech_caption": "A female voice, likely in her
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 18 |
"route": "speech",
|
| 19 |
"annotations": {
|
| 20 |
"voice_tags": "Suitable for Work, natural-SFW, fluent, casual speaking style, slack voice, neutral airflow, quiet, flat intonation, neutral articulation, sighing delivery",
|
| 21 |
+
"bud_e_speech_caption": "A female voice, likely in her 30s or 40s, delivers a melancholic and reflective monologue in English with a neutral American accent. The recording boasts high quality, captured in a quiet, studio-like environment with no discernible background noise, contributing to a clear and professional sound. The speaker's tone is consistently soft and breathy, conveying a sense of sadness, longing, and a hint of bitterness. Initially, the delivery is slow and deliberate, with a low pitch and a slightly shaky timbre, suggesting vulnerability and emotional weight. A noticeable sigh punctuates the speech, emphasizing the speaker's distress. The articulation is neutral, and the airflow is quiet, contributing to a sense of subdued emotion. The voice possesses a female mezzo-soprano quality, moderately bright and clear, with a slight nasality and a touch of tension. The resonance is primarily head-mixed, lending a light vocal weight. The overall impression is one of naturalness and stability, though with a subtle wobble. As the monologue progresses, the emotional intensity remains consistent, with the speaker maintaining a slow, deliberate pace and a low pitch. The voice retains its breathy quality, further emphasizing the melancholic mood"
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
samples/music/music__suno_audio_196211_4_1844520.json
CHANGED
|
@@ -15,10 +15,9 @@
|
|
| 15 |
"confidence": 0.074341
|
| 16 |
}
|
| 17 |
],
|
| 18 |
-
"route": "
|
| 19 |
"annotations": {
|
| 20 |
-
"
|
| 21 |
-
"bud_e_speech_caption": "This recording features a young adult male, likely in his 20s, delivering a highly energetic and expressive speech in English with a standard American accent. The overall tone is overwhelmingly positive, characterized by strong elation, moderate hope, and a hint of triumph, evolving into amusement and contentment. The speaker exhibits a dynamic and fast-paced delivery, with a wide pitch range and precise articulation, suggesting a professional voice-over style rather than casual conversation. The voice possesses a male baritone timbre, slightly harsh yet moderately bright and clear, with a subtle breathiness and mild nasality. There's a noticeable tension and pressed quality to the voice, accompanied by a slight roughness and chest-mixed resonance. The vocal weight is moderately heavy, with a hint of wear, but the voice remains perfectly natural and stable.\n\nThe recording quality is moderate, likely captured via a consumer-grade microphone in a room with some audible echo, indicating a live event or outdoor setting. A second, quieter speaker is briefly audible at the end. The speech is delivered with a very loud volume and a fast tempo, occasionally exhibiting vocal bursts. The speaker's style is highly expressive, with a wide dynamic range and a clear, resonant timbre."
|
| 22 |
},
|
| 23 |
"error": null
|
| 24 |
}
|
|
|
|
| 15 |
"confidence": 0.074341
|
| 16 |
}
|
| 17 |
],
|
| 18 |
+
"route": "sfx",
|
| 19 |
"annotations": {
|
| 20 |
+
"sound_effect_caption": "A male voice, perceived as adult, speaks in a clear, measured tone. The speech is articulate and articulate, with a slightly formal timbre. The pace is moderate, and the pitch is in the mid-range. The recording is clean, with minimal background noise. This is a recording of a male speaker delivering a formal speech or narration, possibly in a professional or educational setting. The clear articulation and measured pace suggest a prepared statement or a prepared statement."
|
|
|
|
| 21 |
},
|
| 22 |
"error": null
|
| 23 |
}
|