YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Bud-E Wake Word Models
Wake word detection models for the phrases "Hey Buddy", "Stop Buddy", and "Go Buddy", trained using the livekit-wakeword toolkit. These models are designed for the Bud-E voice assistant project.
Models
"Hey Buddy" β English
| Model | Size | AUT | FPPH | Recall@0.5 | Optimal Recall | Optimal Threshold | ONNX Size |
|---|---|---|---|---|---|---|---|
en_tiny |
16d, 1 block | 0.0087 | 0.00 | 39.6% | 66.1% @ 0.32 | 0.32 | 119 KB |
en_small |
32d, 1 block | 0.0067 | 0.00 | 61.8% | 73.6% @ 0.36 | 0.36 | 163 KB |
en_medium |
128d, 2 blocks | 0.0062 | 0.54 | 86.8% | 79.6% @ 0.76 | 0.76 | 933 KB |
en_medium_v2 |
128d, 2 blocks | 0.0332 | 0.00 | 26.8% | 93.4% @ 0.01 | 0.01 | 933 KB |
en_large |
256d, 3 blocks | 0.0038 | 1.03 | 92.3% | 83.7% @ 0.88 | 0.88 | 3.8 MB |
en_large_v2 |
256d, 3 blocks | 0.0172 | 0.92 | 63.8% | 53.6% @ 0.80 | 0.80 | 3.8 MB |
en_large_v3 |
256d, 3 blocks | 0.0214 | 0.44 | 53.9% | 88.8% @ 0.02 | 0.02 | 3.8 MB |
en_medium_v2was trained with enhanced time-stretch augmentation (0.4xβ2.0x speech rate) and pitch shifting to improve robustness to varying speaking speeds. When the originalen_mediummodel is evaluated on speed-varied test data, its recall drops to 42.4% β the v2 model handles these variations much better. The tradeoff is a worse overall AUT; the score distribution is more compressed, requiring a lower detection threshold.
en_large_v2,stop_buddy_en_large_v2, andgo_buddy_en_large_v2were trained with cross-wake-word adversarial negatives: each model uses the other two wake words' positive clips as additional negative training data (e.g., "stop buddy" and "go buddy" clips are negatives for the "hey buddy" model). This helps the models discriminate between the three wake words. Additional background noise from the freesound_audio_sm dataset was mixed in during augmentation. The stop_buddy_v2 model shows a major improvement: AUT dropped from 0.0272 to 0.0117, with optimal threshold moving from 0.02 to 0.93 (much better score separation).
en_large_v3,stop_buddy_en_large_v3, andgo_buddy_en_large_v3build on v2 by adding confusable-phrase adversarial negatives: 20,000 TTS-generated clips of phonetically similar phrases per model (e.g., "hey daddy", "hey baby", "hey bunny" for "hey buddy"; "stop daddy", "shop buddy" for "stop buddy"; "go daddy", "no buddy" for "go buddy"). These were augmented with 3 rounds of time-stretch, pitch-shift, EQ, RIR, and background noise, then combined with the v2 training data. The result is significantly reduced FPPH (fewer false activations on similar-sounding phrases), with improved AUT scores across all three models.
"Hey Buddy" β German
| Model | Size | AUT | FPPH | Recall@0.5 | Optimal Recall | Optimal Threshold | ONNX Size |
|---|---|---|---|---|---|---|---|
de_tiny |
16d, 1 block | 0.0111 | 0.00 | 25.0% | 63.4% @ 0.24 | 0.24 | 119 KB |
de_small |
32d, 1 block | 0.0097 | 0.00 | 47.6% | 67.7% @ 0.30 | 0.30 | 163 KB |
de_medium |
128d, 2 blocks | 0.0060 | 1.11 | 82.6% | 74.5% @ 0.79 | 0.79 | 933 KB |
de_large |
256d, 3 blocks | 0.0066 | 2.55 | 89.2% | 51.3% @ 0.95 | 0.95 | 3.8 MB |
"Stop Buddy" β English
| Model | Size | AUT | FPPH | Recall@0.5 | Optimal Recall | Optimal Threshold | ONNX Size |
|---|---|---|---|---|---|---|---|
stop_buddy_en_medium |
128d, 2 blocks | 0.0339 | 1.13 | 49.3% | 90.4% @ 0.01 | 0.01 | 933 KB |
stop_buddy_en_large |
256d, 3 blocks | 0.0272 | 7.85 | 64.8% | 87.4% @ 0.02 | 0.02 | 3.8 MB |
stop_buddy_en_large_v2 |
256d, 3 blocks | 0.0117 | 4.68 | 81.0% | 58.7% @ 0.93 | 0.93 | 3.8 MB |
stop_buddy_en_large_v3 |
256d, 3 blocks | 0.0232 | 3.39 | 60.0% | 89.5% @ 0.02 | 0.02 | 3.8 MB |
"Go Buddy" β English
| Model | Size | AUT | FPPH | Recall@0.5 | Optimal Recall | Optimal Threshold | ONNX Size |
|---|---|---|---|---|---|---|---|
go_buddy_en_medium |
128d, 2 blocks | 0.0385 | 0.62 | 45.7% | 87.8% @ 0.01 | 0.01 | 933 KB |
go_buddy_en_large |
256d, 3 blocks | 0.0344 | 5.69 | 62.1% | 88.5% @ 0.01 | 0.01 | 3.8 MB |
go_buddy_en_large_v2 |
256d, 3 blocks | 0.0153 | 5.22 | 81.5% | 92.8% @ 0.02 | 0.02 | 3.8 MB |
go_buddy_en_large_v3 |
256d, 3 blocks | 0.0230 | 3.71 | 63.2% | 89.3% @ 0.02 | 0.02 | 3.8 MB |
Understanding the Metrics
Every wake word detector faces a fundamental tradeoff: if you make it more sensitive (catch more real wake words), it will also trigger more often on things that aren't the wake word. The metrics below capture different aspects of this tradeoff.
AUT (Area Under the DET Curve) β Lower is better. Range: 0 to 1.
This is the single best number to compare models. The DET (Detection Error Tradeoff) curve plots the miss rate against the false alarm rate at every possible threshold. AUT is the area under that curve. A perfect model that never misses and never false-triggers would score 0. In practice, a score below 0.01 is excellent, 0.01β0.04 is decent, and above 0.05 starts to be problematic. Think of it as: "across all possible sensitivity settings, how often does this model make mistakes?"
FPPH (False Positives Per Hour) β Lower is better. Measured at threshold=0.5.
This answers: "if I leave this running on normal audio (speech, music, silence, etc.), how many times per hour will it falsely think it heard the wake word?" For a usable voice assistant, you want this below 1. A value of 0.00 means zero false triggers at the default threshold in the test set. Note: the test set is finite (~20 hours), so 0.00 means "none observed" rather than "mathematically impossible."
Recall@0.5 β Higher is better. Measured at threshold=0.5.
This answers: "when someone actually says the wake word, what percentage of the time does the model detect it?" A recall of 86.8% means the model catches about 87 out of 100 real wake words at the default threshold. The remaining 13% are missed. Higher is better, but pushing recall too high usually increases false positives too.
Optimal Recall β Higher is better. Measured at the optimal threshold.
This is the recall at the "optimal threshold" β the threshold that the evaluation found gives the best recall while keeping false positives below a target rate (0.1 FPPH). This shows the model's best achievable performance in a practical deployment setting.
Optimal Threshold β Closer to 0.5 is generally better.
The detection threshold where the model performs best (highest recall while staying below the FPPH target). A high optimal threshold (like 0.76 or 0.88) means the model produces well-separated scores β genuine wake words score high, everything else scores low β which is ideal. A very low optimal threshold (like 0.01) means the model's scores aren't well-separated: it needs to accept almost everything to catch the wake words, which also lets through many false positives.
How to read the tables β a practical example:
Take en_medium (AUT=0.0062, FPPH=0.54, Recall@0.5=86.8%, Optimal=79.6% @ 0.76):
- At the default threshold of 0.5, it catches 86.8% of wake words but triggers ~0.5 false alarms per hour
- At the optimal threshold of 0.76, it catches 79.6% of wake words with fewer than 0.1 false alarms per hour
- The AUT of 0.0062 confirms this is a high-quality model overall
Compare with en_tiny (AUT=0.0087, FPPH=0.00, Recall@0.5=39.6%, Optimal=66.1% @ 0.32):
- At threshold 0.5, it catches only 39.6% of wake words β too conservative
- Lowering the threshold to 0.32 improves recall to 66.1% with acceptable false positives
- The higher AUT (0.0087 vs 0.0062) confirms it's a weaker model overall β expected given its smaller size
Architecture
All models use the conv_attention classifier architecture from livekit-wakeword:
- Input: Pre-extracted speech embeddings of shape
(batch, 16, 96)from the frozen Google speech_embedding model - Architecture: Conv1D layers + Multi-head Attention + Mean Pooling + Linear head + Sigmoid
- Output: Confidence score in
[0, 1]
The full inference pipeline is:
- Audio (16 kHz mono) β Mel spectrogram (ONNX frontend) β Speech embeddings
(N, 96)(ONNX encoder) β Pad/truncate to(16, 96)β Classifier (this model) β Score[0, 1]
The mel spectrogram and speech embedding ONNX models are bundled with the livekit-wakeword package in resources/.
Usage
With livekit-wakeword (Recommended)
pip install livekit-wakeword
from livekit.wakeword import WakeWordDetector
import numpy as np
# Load model
detector = WakeWordDetector.from_pretrained("laion/bud-e_wakeword-models_livekit-wakeword", model_name="en_large")
# Process audio (16 kHz, mono, float32)
audio = np.random.randn(32000).astype(np.float32) # 2 seconds
score = detector.detect(audio)
print(f"Wake word confidence: {score:.3f}")
# Use optimal threshold from training
if score > 0.88: # optimal_threshold for en_large
print("Wake word detected!")
Direct ONNX Inference
import onnxruntime as ort
import numpy as np
# Load classifier
session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")
# Input: pre-extracted speech embeddings (batch, 16, 96)
embeddings = np.random.randn(1, 16, 96).astype(np.float32)
# Run inference
score = session.run(["score"], {"embeddings": embeddings})[0]
print(f"Score: {score[0, 0]:.4f}")
Full Pipeline (Manual)
For custom integration without the livekit-wakeword package:
import onnxruntime as ort
import numpy as np
import librosa
# Load pipeline models (from livekit-wakeword resources/)
mel_session = ort.InferenceSession("melspectrogram.onnx")
embed_session = ort.InferenceSession("speech_embedding.onnx")
classifier_session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")
# 1. Load audio
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)
# 2. Compute mel spectrogram
mel = mel_session.run(None, {"audio": audio.reshape(1, -1)})[0]
# 3. Extract embeddings
embeddings = embed_session.run(None, {"mel": mel})[0] # (N, 96)
# 4. Pad/truncate to (16, 96)
if embeddings.shape[0] >= 16:
embeddings = embeddings[-16:]
else:
pad = np.zeros((16 - embeddings.shape[0], 96), dtype=np.float32)
embeddings = np.concatenate([pad, embeddings])
# 5. Run classifier
score = classifier_session.run(
["score"],
{"embeddings": embeddings.reshape(1, 16, 96)}
)[0][0, 0]
print(f"Wake word score: {score:.4f}")
Training Data
Models were trained on synthetic speech data generated using up to 3 TTS backends for maximum diversity:
"Hey Buddy" β English
- Piper VITS (en_US-lessac-medium): 4,000 positive train / 800 test β 904 speaker voices with SLERP blending
- VoxCPM2: 3,000 positive train / 600 test β 29 voice design prompts x 4 CFG values x 3 timestep configs
- ChatterboxTTS: 2,000 positive train / 400 test β 8 reference voices with varying exaggeration/temperature
- Adversarial negatives: 4,000 train / 800 test β phonetically similar phrases ("hey body", "hey bunny", "hey baby", etc.)
- Background noise: 1,000 train / 200 test β from MUSAN
- General negatives: ACAV100M ~2000 hrs pre-extracted speech features
"Hey Buddy" β German
- VoxCPM2: 3,000 positive train / 600 test
- ChatterboxTTS: 2,000 positive train / 400 test
- Adversarial negatives: 3,000 train / 600 test
- Background noise: 1,000 train / 200 test
- General negatives: ACAV100M ~2000 hrs
"Stop Buddy" / "Go Buddy" β English
- Piper VITS (en_US-lessac-medium): 4,000 positive train / 800 test
- VoxCPM2: 3,000 positive train / 600 test β 29 voice design prompts x 4 CFG values x 3 timestep configs
- ChatterboxTTS: 2,000 positive train / 400 test β reference voice cloning
- Adversarial negatives: ~4,000 train / 800 test β custom confusable phrases per wake word (e.g., "stop body", "go bunny", "go buddy" vs "stop buddy", etc.)
- Background noise: 1,000 train / 200 test β from MUSAN
- General negatives: ACAV100M ~2000 hrs pre-extracted speech features
Augmentation
"Hey Buddy" v1 and German models β 3 rounds of compounding augmentation:
- 7-band parametric EQ (25% probability)
- Tanh distortion (25% probability)
- Room impulse response convolution (50% probability, MIT RIRs)
- Background noise mixing (SNR 5-15 dB)
"Hey Buddy" v2, "Stop Buddy", "Go Buddy" β 5 rounds of enhanced compounding augmentation:
- 7-band parametric EQ (25% probability)
- Tanh distortion (25% probability)
- Time stretch (35% probability, 0.5xβ1.8x rate)
- Pitch shift (20% probability, Β±2 semitones)
- Low-pass filter (20% probability, 4β7.5 kHz cutoff)
- Gaussian noise (20% probability)
- 8 kHz downsample round-trip (15% probability) β simulates low-quality audio
- Room impulse response convolution (50% probability, MIT RIRs)
- Background noise mixing (SNR 5-15 dB)
Additionally, "Hey Buddy" v2 includes explicit time-stretched variants in the training set: 25% of positive clips slowed to 0.4xβ0.95x, 25% sped up to 1.05xβ2.0x, 10% pitch-shifted Β±3 semitones, and 10% with combined speed+pitch changes.
Training
- 3-phase adaptive training with focal loss (gamma=2.0)
- Embedding mixup regularization (alpha=0.2)
- Label smoothing (epsilon=0.05)
- Cosine warmup + decay learning rate schedule
- Negative class weight ramp from 1 to 3000
- Checkpoint averaging over best validation checkpoints
File Structure
βββ README.md
βββ configs/ # Training YAML configs
β βββ hey_buddy_en_base.yaml
β βββ hey_buddy_de_base.yaml
β βββ hey_buddy_{en,de}_{tiny,small,medium,large}.yaml
β βββ hey_buddy_en_medium_v2.yaml
β βββ stop_buddy_en_medium.yaml
β βββ stop_buddy_en_large.yaml
β βββ go_buddy_en_medium.yaml
β βββ go_buddy_en_large.yaml
βββ en_tiny/
β βββ hey_buddy_en_tiny.onnx # ONNX model (119 KB)
β βββ hey_buddy_en_tiny.pt # PyTorch state dict
β βββ hey_buddy_en_tiny_eval.json # Evaluation metrics
β βββ hey_buddy_en_tiny_det.png # DET curve plot
β βββ hey_buddy_en_tiny_metrics.json
βββ en_small/
βββ en_medium/
βββ en_medium_v2/ # Enhanced time-stretch augmentation
βββ en_large/
βββ en_large_v2/ # Cross-adversarial training
βββ en_large_v3/ # + confusable-phrase adversarial
βββ de_tiny/
βββ de_small/
βββ de_medium/
βββ de_large/
βββ stop_buddy_en_medium/ # "Stop Buddy" wake word (medium)
βββ stop_buddy_en_large/ # "Stop Buddy" wake word (large)
βββ stop_buddy_en_large_v2/ # "Stop Buddy" with cross-adversarial negatives
βββ stop_buddy_en_large_v3/ # + confusable-phrase adversarial
βββ go_buddy_en_medium/ # "Go Buddy" wake word (medium)
βββ go_buddy_en_large/ # "Go Buddy" wake word (large)
βββ go_buddy_en_large_v2/ # "Go Buddy" with cross-adversarial negatives
βββ go_buddy_en_large_v3/ # + confusable-phrase adversarial
Recommended Models
- For production/edge:
en_mediumorde_mediumβ best balance of recall vs false positive rate - For quality-first:
en_largeorde_largeβ highest recall but higher FPPH - For resource-constrained:
en_smallorde_smallβ zero FPPH, moderate recall - For speed-robust detection:
en_medium_v2β handles fast and slow speech better than v1 - For voice commands:
stop_buddy_en_largeandgo_buddy_en_largeβ companion wake words for Bud-E control (or medium variants for lower latency)
License
Apache 2.0
Citation
@misc{bude-wakeword-2026,
title={Bud-E Wake Word Models},
author={LAION},
year={2026},
url={https://huggingface.co/laion/bud-e_wakeword-models_livekit-wakeword}
}
Acknowledgments
- livekit-wakeword toolkit by LiveKit
- VoxCPM2 TTS by OpenBMB
- ChatterboxTTS by Resemble AI
- Piper TTS
- ACAV100M speech features