pipecat-ai/chirp3_1
Viewer • Updated • 163k • 377 • 1
Smart Turn v2 is an open‑source semantic Voice Activity Detection (VAD) model that tells you whether a speaker has finished their turn by analysing the raw waveform, not the transcript.
Compared with v1 it is:
| Use‑case | Why this model helps |
|---|---|
| Voice agents / chatbots | Wait to reply until the user has actually finished speaking. |
| Real‑time transcription + TTS | Avoid “double‑talk” by triggering TTS only when the user turn ends. |
| Call‑centre assist & analytics | Accurate segmentation for diarisation and sentiment pipelines. |
| Any project needing semantic VAD | Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD. |
The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.
wav2vec2 encoder wav2vec2 + linear configuration out‑performed LSTM and deeper transformer variants during ablation studies.| Source | Type | Languages |
|---|---|---|
human_5_all |
Human‑recorded | EN |
human_convcollector_1 |
Human‑recorded | EN |
rime_2 |
Synthetic (Rime) | EN |
orpheus_midfiller_1 |
Synthetic (Orpheus) | EN |
orpheus_grammar_1 |
Synthetic (Orpheus) | EN |
orpheus_endfiller_1 |
Synthetic (Orpheus) | EN |
chirp3_1 |
Synthetic (Google Chirp3 TTS) | 14 langs |
All audio/text pairs are released on the pipecat‑ai/datasets hub.
| Lang | Acc % | Lang | Acc % |
|---|---|---|---|
| EN | 94.3 | IT | 94.4 |
| FR | 95.5 | KO | 95.5 |
| ES | 92.1 | PT | 95.5 |
| DE | 95.8 | TR | 96.8 |
| NL | 96.7 | PL | 94.6 |
| RU | 93.0 | HI | 91.2 |
| ZH | 87.2 | – | – |
Human English benchmark (human_5_all) : 99 % accuracy.
| Device | Time |
|---|---|
| NVIDIA L40S | 12 ms |
| NVIDIA A100 | 19 ms |
| NVIDIA T4 (AWS g4dn.xlarge) | 75 ms |
| 16‑core x86_64 CPU (Modal) | 410 ms |
Please see the blog post and GitHub repo for more information on using the model, either standalone or with Pipecat.