YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Bud-E Wake Word Models

Wake word detection models for the phrases "Hey Buddy", "Stop Buddy", and "Go Buddy", trained using the livekit-wakeword toolkit. These models are designed for the Bud-E voice assistant project.

Models

"Hey Buddy" — English

Model	Size	AUT	FPPH	Recall@0.5	Optimal Recall	Optimal Threshold	ONNX Size
`en_tiny`	16d, 1 block	0.0087	0.00	39.6%	66.1% @ 0.32	0.32	119 KB
`en_small`	32d, 1 block	0.0067	0.00	61.8%	73.6% @ 0.36	0.36	163 KB
`en_medium`	128d, 2 blocks	0.0062	0.54	86.8%	79.6% @ 0.76	0.76	933 KB
`en_medium_v2`	128d, 2 blocks	0.0332	0.00	26.8%	93.4% @ 0.01	0.01	933 KB
`en_large`	256d, 3 blocks	0.0038	1.03	92.3%	83.7% @ 0.88	0.88	3.8 MB
`en_large_v2`	256d, 3 blocks	0.0172	0.92	63.8%	53.6% @ 0.80	0.80	3.8 MB
`en_large_v3`	256d, 3 blocks	0.0214	0.44	53.9%	88.8% @ 0.02	0.02	3.8 MB

en_medium_v2 was trained with enhanced time-stretch augmentation (0.4x–2.0x speech rate) and pitch shifting to improve robustness to varying speaking speeds. When the original en_medium model is evaluated on speed-varied test data, its recall drops to 42.4% — the v2 model handles these variations much better. The tradeoff is a worse overall AUT; the score distribution is more compressed, requiring a lower detection threshold.

en_large_v2, stop_buddy_en_large_v2, and go_buddy_en_large_v2 were trained with cross-wake-word adversarial negatives: each model uses the other two wake words' positive clips as additional negative training data (e.g., "stop buddy" and "go buddy" clips are negatives for the "hey buddy" model). This helps the models discriminate between the three wake words. Additional background noise from the freesound_audio_sm dataset was mixed in during augmentation. The stop_buddy_v2 model shows a major improvement: AUT dropped from 0.0272 to 0.0117, with optimal threshold moving from 0.02 to 0.93 (much better score separation).

en_large_v3, stop_buddy_en_large_v3, and go_buddy_en_large_v3 build on v2 by adding confusable-phrase adversarial negatives: 20,000 TTS-generated clips of phonetically similar phrases per model (e.g., "hey daddy", "hey baby", "hey bunny" for "hey buddy"; "stop daddy", "shop buddy" for "stop buddy"; "go daddy", "no buddy" for "go buddy"). These were augmented with 3 rounds of time-stretch, pitch-shift, EQ, RIR, and background noise, then combined with the v2 training data. The result is significantly reduced FPPH (fewer false activations on similar-sounding phrases), with improved AUT scores across all three models.

"Hey Buddy" — German

Model	Size	AUT	FPPH	Recall@0.5	Optimal Recall	Optimal Threshold	ONNX Size
`de_tiny`	16d, 1 block	0.0111	0.00	25.0%	63.4% @ 0.24	0.24	119 KB
`de_small`	32d, 1 block	0.0097	0.00	47.6%	67.7% @ 0.30	0.30	163 KB
`de_medium`	128d, 2 blocks	0.0060	1.11	82.6%	74.5% @ 0.79	0.79	933 KB
`de_large`	256d, 3 blocks	0.0066	2.55	89.2%	51.3% @ 0.95	0.95	3.8 MB

"Stop Buddy" — English

Model	Size	AUT	FPPH	Recall@0.5	Optimal Recall	Optimal Threshold	ONNX Size
`stop_buddy_en_medium`	128d, 2 blocks	0.0339	1.13	49.3%	90.4% @ 0.01	0.01	933 KB
`stop_buddy_en_large`	256d, 3 blocks	0.0272	7.85	64.8%	87.4% @ 0.02	0.02	3.8 MB
`stop_buddy_en_large_v2`	256d, 3 blocks	0.0117	4.68	81.0%	58.7% @ 0.93	0.93	3.8 MB
`stop_buddy_en_large_v3`	256d, 3 blocks	0.0232	3.39	60.0%	89.5% @ 0.02	0.02	3.8 MB

"Go Buddy" — English

Model	Size	AUT	FPPH	Recall@0.5	Optimal Recall	Optimal Threshold	ONNX Size
`go_buddy_en_medium`	128d, 2 blocks	0.0385	0.62	45.7%	87.8% @ 0.01	0.01	933 KB
`go_buddy_en_large`	256d, 3 blocks	0.0344	5.69	62.1%	88.5% @ 0.01	0.01	3.8 MB
`go_buddy_en_large_v2`	256d, 3 blocks	0.0153	5.22	81.5%	92.8% @ 0.02	0.02	3.8 MB
`go_buddy_en_large_v3`	256d, 3 blocks	0.0230	3.71	63.2%	89.3% @ 0.02	0.02	3.8 MB

Understanding the Metrics

Every wake word detector faces a fundamental tradeoff: if you make it more sensitive (catch more real wake words), it will also trigger more often on things that aren't the wake word. The metrics below capture different aspects of this tradeoff.

AUT (Area Under the DET Curve) — Lower is better. Range: 0 to 1.

This is the single best number to compare models. The DET (Detection Error Tradeoff) curve plots the miss rate against the false alarm rate at every possible threshold. AUT is the area under that curve. A perfect model that never misses and never false-triggers would score 0. In practice, a score below 0.01 is excellent, 0.01–0.04 is decent, and above 0.05 starts to be problematic. Think of it as: "across all possible sensitivity settings, how often does this model make mistakes?"

FPPH (False Positives Per Hour) — Lower is better. Measured at threshold=0.5.

This answers: "if I leave this running on normal audio (speech, music, silence, etc.), how many times per hour will it falsely think it heard the wake word?" For a usable voice assistant, you want this below 1. A value of 0.00 means zero false triggers at the default threshold in the test set. Note: the test set is finite (~20 hours), so 0.00 means "none observed" rather than "mathematically impossible."

Recall@0.5 — Higher is better. Measured at threshold=0.5.

This answers: "when someone actually says the wake word, what percentage of the time does the model detect it?" A recall of 86.8% means the model catches about 87 out of 100 real wake words at the default threshold. The remaining 13% are missed. Higher is better, but pushing recall too high usually increases false positives too.

Optimal Recall — Higher is better. Measured at the optimal threshold.

This is the recall at the "optimal threshold" — the threshold that the evaluation found gives the best recall while keeping false positives below a target rate (0.1 FPPH). This shows the model's best achievable performance in a practical deployment setting.

Optimal Threshold — Closer to 0.5 is generally better.

The detection threshold where the model performs best (highest recall while staying below the FPPH target). A high optimal threshold (like 0.76 or 0.88) means the model produces well-separated scores — genuine wake words score high, everything else scores low — which is ideal. A very low optimal threshold (like 0.01) means the model's scores aren't well-separated: it needs to accept almost everything to catch the wake words, which also lets through many false positives.

How to read the tables — a practical example:

Take en_medium (AUT=0.0062, FPPH=0.54, Recall@0.5=86.8%, Optimal=79.6% @ 0.76):

At the default threshold of 0.5, it catches 86.8% of wake words but triggers ~0.5 false alarms per hour
At the optimal threshold of 0.76, it catches 79.6% of wake words with fewer than 0.1 false alarms per hour
The AUT of 0.0062 confirms this is a high-quality model overall

Compare with en_tiny (AUT=0.0087, FPPH=0.00, Recall@0.5=39.6%, Optimal=66.1% @ 0.32):

At threshold 0.5, it catches only 39.6% of wake words — too conservative
Lowering the threshold to 0.32 improves recall to 66.1% with acceptable false positives
The higher AUT (0.0087 vs 0.0062) confirms it's a weaker model overall — expected given its smaller size

Architecture

All models use the conv_attention classifier architecture from livekit-wakeword:

Input: Pre-extracted speech embeddings of shape (batch, 16, 96) from the frozen Google speech_embedding model
Architecture: Conv1D layers + Multi-head Attention + Mean Pooling + Linear head + Sigmoid
Output: Confidence score in [0, 1]

The full inference pipeline is:

Audio (16 kHz mono) → Mel spectrogram (ONNX frontend) → Speech embeddings (N, 96) (ONNX encoder) → Pad/truncate to (16, 96) → Classifier (this model) → Score [0, 1]

The mel spectrogram and speech embedding ONNX models are bundled with the livekit-wakeword package in resources/.

Usage

With livekit-wakeword (Recommended)

pip install livekit-wakeword

from livekit.wakeword import WakeWordDetector
import numpy as np

# Load model
detector = WakeWordDetector.from_pretrained("laion/bud-e_wakeword-models_livekit-wakeword", model_name="en_large")

# Process audio (16 kHz, mono, float32)
audio = np.random.randn(32000).astype(np.float32)  # 2 seconds
score = detector.detect(audio)
print(f"Wake word confidence: {score:.3f}")

# Use optimal threshold from training
if score > 0.88:  # optimal_threshold for en_large
    print("Wake word detected!")

Direct ONNX Inference

import onnxruntime as ort
import numpy as np

# Load classifier
session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")

# Input: pre-extracted speech embeddings (batch, 16, 96)
embeddings = np.random.randn(1, 16, 96).astype(np.float32)

# Run inference
score = session.run(["score"], {"embeddings": embeddings})[0]
print(f"Score: {score[0, 0]:.4f}")

Full Pipeline (Manual)

For custom integration without the livekit-wakeword package:

import onnxruntime as ort
import numpy as np
import librosa

# Load pipeline models (from livekit-wakeword resources/)
mel_session = ort.InferenceSession("melspectrogram.onnx")
embed_session = ort.InferenceSession("speech_embedding.onnx")
classifier_session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")

# 1. Load audio
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)

# 2. Compute mel spectrogram
mel = mel_session.run(None, {"audio": audio.reshape(1, -1)})[0]

# 3. Extract embeddings
embeddings = embed_session.run(None, {"mel": mel})[0]  # (N, 96)

# 4. Pad/truncate to (16, 96)
if embeddings.shape[0] >= 16:
    embeddings = embeddings[-16:]
else:
    pad = np.zeros((16 - embeddings.shape[0], 96), dtype=np.float32)
    embeddings = np.concatenate([pad, embeddings])

# 5. Run classifier
score = classifier_session.run(
    ["score"],
    {"embeddings": embeddings.reshape(1, 16, 96)}
)[0][0, 0]

print(f"Wake word score: {score:.4f}")

Training Data

Models were trained on synthetic speech data generated using up to 3 TTS backends for maximum diversity:

"Hey Buddy" — English

Piper VITS (en_US-lessac-medium): 4,000 positive train / 800 test — 904 speaker voices with SLERP blending
VoxCPM2: 3,000 positive train / 600 test — 29 voice design prompts x 4 CFG values x 3 timestep configs
ChatterboxTTS: 2,000 positive train / 400 test — 8 reference voices with varying exaggeration/temperature
Adversarial negatives: 4,000 train / 800 test — phonetically similar phrases ("hey body", "hey bunny", "hey baby", etc.)
Background noise: 1,000 train / 200 test — from MUSAN
General negatives: ACAV100M ~2000 hrs pre-extracted speech features

"Hey Buddy" — German

VoxCPM2: 3,000 positive train / 600 test
ChatterboxTTS: 2,000 positive train / 400 test
Adversarial negatives: 3,000 train / 600 test
Background noise: 1,000 train / 200 test
General negatives: ACAV100M ~2000 hrs

"Stop Buddy" / "Go Buddy" — English

Piper VITS (en_US-lessac-medium): 4,000 positive train / 800 test
VoxCPM2: 3,000 positive train / 600 test — 29 voice design prompts x 4 CFG values x 3 timestep configs
ChatterboxTTS: 2,000 positive train / 400 test — reference voice cloning
Adversarial negatives: ~4,000 train / 800 test — custom confusable phrases per wake word (e.g., "stop body", "go bunny", "go buddy" vs "stop buddy", etc.)
Background noise: 1,000 train / 200 test — from MUSAN
General negatives: ACAV100M ~2000 hrs pre-extracted speech features

Augmentation

"Hey Buddy" v1 and German models — 3 rounds of compounding augmentation:

7-band parametric EQ (25% probability)
Tanh distortion (25% probability)
Room impulse response convolution (50% probability, MIT RIRs)
Background noise mixing (SNR 5-15 dB)

"Hey Buddy" v2, "Stop Buddy", "Go Buddy" — 5 rounds of enhanced compounding augmentation:

7-band parametric EQ (25% probability)
Tanh distortion (25% probability)
Time stretch (35% probability, 0.5x–1.8x rate)
Pitch shift (20% probability, ±2 semitones)
Low-pass filter (20% probability, 4–7.5 kHz cutoff)
Gaussian noise (20% probability)
8 kHz downsample round-trip (15% probability) — simulates low-quality audio
Room impulse response convolution (50% probability, MIT RIRs)
Background noise mixing (SNR 5-15 dB)

Additionally, "Hey Buddy" v2 includes explicit time-stretched variants in the training set: 25% of positive clips slowed to 0.4x–0.95x, 25% sped up to 1.05x–2.0x, 10% pitch-shifted ±3 semitones, and 10% with combined speed+pitch changes.

Training

3-phase adaptive training with focal loss (gamma=2.0)
Embedding mixup regularization (alpha=0.2)
Label smoothing (epsilon=0.05)
Cosine warmup + decay learning rate schedule
Negative class weight ramp from 1 to 3000
Checkpoint averaging over best validation checkpoints

File Structure

├── README.md
├── configs/                           # Training YAML configs
│   ├── hey_buddy_en_base.yaml
│   ├── hey_buddy_de_base.yaml
│   ├── hey_buddy_{en,de}_{tiny,small,medium,large}.yaml
│   ├── hey_buddy_en_medium_v2.yaml
│   ├── stop_buddy_en_medium.yaml
│   ├── stop_buddy_en_large.yaml
│   ├── go_buddy_en_medium.yaml
│   └── go_buddy_en_large.yaml
├── en_tiny/
│   ├── hey_buddy_en_tiny.onnx         # ONNX model (119 KB)
│   ├── hey_buddy_en_tiny.pt           # PyTorch state dict
│   ├── hey_buddy_en_tiny_eval.json    # Evaluation metrics
│   ├── hey_buddy_en_tiny_det.png      # DET curve plot
│   └── hey_buddy_en_tiny_metrics.json
├── en_small/
├── en_medium/
├── en_medium_v2/                      # Enhanced time-stretch augmentation
├── en_large/
├── en_large_v2/                       # Cross-adversarial training
├── en_large_v3/                       # + confusable-phrase adversarial
├── de_tiny/
├── de_small/
├── de_medium/
├── de_large/
├── stop_buddy_en_medium/              # "Stop Buddy" wake word (medium)
├── stop_buddy_en_large/               # "Stop Buddy" wake word (large)
├── stop_buddy_en_large_v2/            # "Stop Buddy" with cross-adversarial negatives
├── stop_buddy_en_large_v3/            # + confusable-phrase adversarial
├── go_buddy_en_medium/                # "Go Buddy" wake word (medium)
├── go_buddy_en_large/                 # "Go Buddy" wake word (large)
├── go_buddy_en_large_v2/              # "Go Buddy" with cross-adversarial negatives
└── go_buddy_en_large_v3/              # + confusable-phrase adversarial

Recommended Models

For production/edge: en_medium or de_medium — best balance of recall vs false positive rate
For quality-first: en_large or de_large — highest recall but higher FPPH
For resource-constrained: en_small or de_small — zero FPPH, moderate recall
For speed-robust detection: en_medium_v2 — handles fast and slow speech better than v1
For voice commands: stop_buddy_en_large and go_buddy_en_large — companion wake words for Bud-E control (or medium variants for lower latency)

License

Apache 2.0

Citation

@misc{bude-wakeword-2026,
  title={Bud-E Wake Word Models},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/bud-e_wakeword-models_livekit-wakeword}
}

Acknowledgments

livekit-wakeword toolkit by LiveKit
VoxCPM2 TTS by OpenBMB
ChatterboxTTS by Resemble AI
Piper TTS
ACAV100M speech features

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support