File size: 13,747 Bytes
32c2a4b ee37677 32c2a4b ee37677 beb084c 32c2a4b beb084c 65adc1f a35f1df beb084c 65adc1f beb084c a35f1df beb084c dfdac77 beb084c a35f1df beb084c 65adc1f beb084c a35f1df beb084c a35f1df beb084c e44c31c beb084c e44c31c beb084c a35f1df beb084c 32c2a4b beb084c 52d5b85 65adc1f beb084c 65adc1f beb084c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 | ---
license: apache-2.0
language:
- zh
- en
widget:
- text: TurnSense 三分类语音轮次判别演示
output:
url: image/PR_new.mp4
---
<div align="center">
<img src="./image/Baiji_Team.png" alt="Baiji Team Logo" width="1000" height="500"/>
<br/>
# TurnSense
### 🎯 Lightweight · Accurate · Three-Class — Redefining Speech Turn Detection
<br/>
<center><strong>47M 参数 | CPU 延迟 ~55ms | F1 高达 96.35% | 无效语义过滤</strong></center>
<br/>
[](https://github.com/Bairong-Xdynamics/TurnSense)
[](https://huggingface.co/brgroup/TurnSense)
[](./LICENSE)
[](https://github.com/Bairong-Xdynamics/TurnSense)
</div>
<br/>
**Language**: **English** | [中文](./README_zh.md)
<br/>
> **⭐ If TurnSense is useful to you, please give us a Star!** It helps us keep improving the model and documentation.
<br/>
## 📖 Table of Contents
- [Why TurnSense](#-why-turnsense)
- [Overview](#-overview)
- [Key Features](#-key-features)
- [Model Size Comparison](#-model-size-comparison)
- [Benchmark Results](#-benchmark-results)
- [Quick Start](#-quick-start)
- [Evaluation Guide](#-evaluation-guide)
- [Citation](#-citation)
- [Contact & Community](#-contact--community)
- [License](#-license)
<br/>
---
<br/>
## 🏆 Why TurnSense
<div align="center">
| Dimension | TurnSense Performance |
| :---: | :---: |
| 🎯 **Accuracy** | F1 **96.35%** (easyturn_real_test_ZH) — best in class |
| ⚡ **Inference Latency** | CPU p50 ≈ **54.65ms** — real-time interaction ready |
| 📦 **Model Size** | Only **47M** parameters, INT8 version only **~50MB** |
| 🧠 **Classification** | First open-source model natively supporting **complete / incomplete / invalid** three-class detection |
| 🚫 **Invalid Filtering** | Invalid utterance F1 reaches **94.34%**, effectively suppressing noise-triggered responses |
| 🤗 **Open-Source Friendly** | FP32 / INT8 ONNX provided, ready to use out of the box |
</div>
<br/>
---
<br/>
## 📌 Overview
**TurnSense** is a **three-class semantic detection model** designed for human-machine voice interaction, focused on solving a critical problem in dialogue systems:
> **During a user's speech, should the system respond immediately, or continue waiting?**
Traditional approaches typically rely on a simple binary classification — "finished or not." **TurnSense goes further** by simultaneously modeling semantic completeness and invalid input detection, enabling more natural turn-taking in complex real-world scenarios and **significantly reducing false interruptions, premature responses, and noise-triggered activations**.
<div align="center">
<img src="./image/TurnSense.svg" alt="TurnSense Three-Class Illustration" width="820"/>
</div>
<div align="center">
<video src="https://huggingface.co/brgroup/TurnSense/resolve/main/image/PR_new.mp4"
width="820"
height="400"
controls
muted
loop
autoplay
playsinline>
</video>
</div>
TurnSense classifies user input into three semantic states:
| State | Description | Example |
| :---: | :--- | :--- |
| ✅ **Complete** | The user has expressed a complete intent; the system can respond | `"Check tomorrow's weather in Shanghai for me."` |
| ⏳ **Incomplete** | The user's expression is unfinished — truncated, paused, or trailing off | `"I'd like to ask about that order from yesterday..."` |
| 🔇 **Invalid** | The input does not constitute meaningful speech and should not trigger a response | `"...(continuous noise / non-verbal vocalization)"` |
These three labels enable the system to determine not only **"should I respond?"** but also **"is it worth responding to?"** — significantly improving interaction naturalness and system stability in voice assistants, real-time calls, intelligent customer service, and more.
<br/>
---
<br/>
## ✨ Key Features
### 🧠 Semantic-Level Three-Class Detection
Simultaneously models `complete / incomplete / invalid` states — closer to real conversational behavior than traditional binary classification, and currently the **only open-source solution with native invalid utterance detection**.
### ⚡ Ultra-Lightweight, Ultra-Fast Inference
Only **47M** parameters (INT8 version ~50MB). CPU inference latency: p50 ≈ **54.65ms**, p90 ≈ **58.00ms** — meets the strict requirements of real-time interaction **without a GPU**.
### 🎯 Leading Accuracy
Achieves **F1 96.35%** (complete) and **F1 96.32%** (incomplete) on easyturn_real_test_ZH (300 samples), and **F1 92.30%** (complete) and **F1 91.62%** (incomplete) on semantic_test_ZH (2000 samples) — best or runner-up among all comparable models.
### 🚫 Invalid Input Filtering
On the NonverbalVocalization test set, invalid utterance precision reaches **100%** with recall of **90.37%** (F1 = 94.34%), effectively suppressing false triggers from non-verbal sounds and noise.
### ⚖️ More Robust Turn Decisions
Balances precision and recall in semantically ambiguous, pause-heavy, or colloquial scenarios, reducing both premature responses and missed responses.
### 📊 Reproducible Evaluation Framework
Ships with a complete evaluation pipeline and scripts, supporting unified metric comparison and performance regression analysis for full reproducibility.
### 🤗 Open-Source Friendly, Plug-and-Play
Standardized repository structure with FP32 / INT8 ONNX models — from installation to inference in just a few minutes.
<br/>
---
<br/>
## 📐 Model Size Comparison
<div align="center">
| Model | Parameters | Three-Class | Link |
| :--- | :---: | :---: | :--- |
| TEN-Turn | **7B** | ❌ | [TEN-framework/TEN_Turn_Detection](https://huggingface.co/TEN-framework/TEN_Turn_Detection) |
| Easy-Turn | 850M | ❌ | [ASLP-lab/Easy-Turn](https://huggingface.co/ASLP-lab/Easy-Turn) |
| NAMO-Turn-Detector (ZH) | 66M | ❌ | [videosdk-live/Namo-Turn-Detector-v1-Multilingual](https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Multilingual) |
| **⭐ TurnSense** | **47M** | **✅** | [**Baiji-Team/TurnSense**](https://huggingface.co/brgroup/TurnSense) |
| Smart-Turn-v3 | 8M | ❌ | [pipecat-ai/smart-turn-v3](https://huggingface.co/pipecat-ai/smart-turn-v3) |
| FireRedChat-turn-detector | -- | ❌ | [FireRedTeam/FireRedChat-turn-detector](https://huggingface.co/FireRedTeam/FireRedChat-turn-detector) |
</div>
> 💡 With only **47M** parameters, TurnSense achieves three-class capability — the best balance between accuracy and model size.
<br/>
---
<br/>
## 📊 Benchmark Results
> All results below are based on open-source Chinese evaluation sets. Latency marked with `(GPU)` indicates GPU environment; otherwise, latency was measured on **CPU**.
<br/>
### 📋 easyturn_real_test_ZH (300 samples)
> Data source: Real data samples from [Easy-Turn-Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset)
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Easy-Turn | 97.26% | 94.67% | 95.95% | 94.81% | 97.33% | 96.05% | 183.87 (GPU) | 300.37 (GPU) |
| Smart-Turn-v3 | 64.97% | 76.67% | 70.34% | 71.54% | 58.67% | 64.47% | 36.84 | 39.10 |
| TEN-Turn | **99.25%** | 88.00% | 93.29% | 89.22% | **99.33%** | 94.01% | 17.66 (GPU) | 19.41 (GPU) |
| FireRedChat | 70.65% | 94.67% | 80.91% | 91.92% | 60.67% | 73.09% | 98.30 | 99.42 |
| NAMO-Turn | 81.53% | 85.33% | 83.39% | 84.62% | 80.67% | 82.59% | 3.60 | 83.44 |
| **⭐ TurnSense** | 96.03% | **96.67%** | **🏆 96.35%** | **96.64%** | 96.00% | **🏆 96.32%** | 54.65 | 58.00 |
> **🔍 Key Finding:** TurnSense achieves the **highest F1** on both complete and incomplete classes, and is the only model with CPU p50 < 60ms while maintaining F1 > 96%.
<br/>
### 📋 semantic_test_ZH (2000 samples)
> Data source: Chinese test split from [KE-Team/SemanticVAD-Dataset](https://huggingface.co/datasets/KE-Team/SemanticVAD-Dataset)
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Easy-Turn | 78.14% | 98.30% | 87.07% | 97.64% | 70.30% | 81.74% | 183.87 (GPU) | 300.37 (GPU) |
| Smart-Turn-v3 | 59.25% | 88.10% | 70.85% | 76.80% | 39.40% | 52.08% | 36.84 | 39.10 |
| TEN-Turn | 85.25% | **99.60%** | 91.87% | **99.52%** | 82.70% | 90.33% | 17.66 (GPU) | 19.41 (GPU) |
| FireRedChat | 66.76% | 99.40% | 79.87% | 98.83% | 50.50% | 66.84% | 98.30 | 99.42 |
| NAMO-Turn | 71.48% | 86.70% | 78.36% | 83.10% | 65.40% | 73.20% | 3.60 | 83.44 |
| **⭐ TurnSense** | **88.96%** | 95.90% | **🏆 92.30%** | 95.55% | **88.00%** | **🏆 91.62%** | 54.65 | 58.00 |
> **🔍 Key Finding:** On the larger 2000-sample test set, TurnSense still maintains the best F1, demonstrating strong generalization capability.
<br/>
### 📋 NonverbalVocalization_invalid (728 samples)
> Data source: OpenSLR [Deeply Nonverbal Vocalization Dataset (SLR99)](https://openslr.elda.org/99/)
| Model | P (invalid) | R (invalid) | **F1 (invalid)** |
| :--- | :---: | :---: | :---: |
| **⭐ TurnSense** | **100.00%** | **90.37%** | **🏆 94.34%** |
> **🔍 Key Finding:** TurnSense is currently the only model that supports invalid utterance detection. A precision of **100%** means zero false positives — effectively preventing noise from triggering system responses.
<br/>
---
<br/>
## 🚀 Quick Start
### 1. Installation
```bash
git clone https://github.com/Bairong-Xdynamics/TurnSense.git
cd TurnSense
pip install -U numpy onnxruntime torch librosa soundfile pandas scikit-learn huggingface_hub
```
### 2. Model Weights
TurnSense model weights are available on Hugging Face: [Baiji-Team/TurnSense](https://huggingface.co/brgroup/TurnSense)
| Version | Size | Use Case |
| :--- | :--- | :--- |
| FP32 | ~191 MB | Accuracy-first |
| INT8 | ~50 MB | Deployment-first (recommended) |
**Download Options:**
**Option 1: Auto-download (Recommended)**
The inference script includes built-in Hugging Face download logic. The model will be automatically fetched and cached on first run.
**Option 2: Git LFS**
```bash
git lfs install
git clone https://huggingface.co/brgroup/TurnSense
```
**Option 3: Hugging Face Hub**
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="brgroup/TurnSense")
```
### 3. Inference
```bash
python infer.py
```
Example output:
```
Loading model from brgroup/TurnSense...
Running inference on: "我想问一下那个订单就是昨天..."
Results:
Input: "我想问一下那个订单就是昨天..."
TurnSense Detection Result: "incomplete"
```
<br/>
---
<br/>
## 🧪 Evaluation Guide
### 1) Evaluation Pipeline
1. Load the `.jsonl` test dataset (line-by-line JSONL)
2. Warm up each model (default `warmup_iters=20`)
3. Run per-sample inference, collecting classification and performance metrics
4. Automatically generate summary and detail files
Output files include:
| File | Description |
| :--- | :--- |
| `report.md` | Summary evaluation report |
| `results.json` | Structured evaluation results |
| `config.json` | Evaluation configuration |
| `per_sample__*.jsonl` | Per-sample prediction details |
### 2) Data Format (JSONL)
Each line is a JSON object containing at least the following fields:
| Field | Description |
| :--- | :--- |
| `audio_path` | Path to the audio file |
| `text` | Text content |
| `label` | Label (`complete` / `incomplete` / `invalid`) |
Example:
```jsonl
{"audio_path":"/001.wav","text":"帮我查一下明天上海天气","label":"complete"}
{"audio_path":"/002.wav","text":"我想问一下那个订单就是昨天...","label":"incomplete"}
{"audio_path":"/003.wav","text":"啊…嗯…(持续噪声)","label":"invalid"}
```
### 3) Run Evaluation
```bash
python TurnSense/Turn_benchmark/benchmark.py
```
<br/>
---
<br/>
## 📚 Citation
If you use TurnSense in your research or product, please cite:
```bibtex
@misc{turnsense2026,
author = {Baiji Team},
title = {TurnSense: A Three-Class Semantic Detection Model for Complete, Incomplete, and Invalid Utterances},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/brgroup/TurnSense}},
}
```
<br/>
## ❓ Contact & Community
If you have questions or suggestions, feel free to reach out:
| Channel | Contact |
| :--- | :--- |
| 📧 Email | [huan.shen@brgroup.com](mailto:huan.shen@brgroup.com) · [yingao.wang@brgroup.com](mailto:yingao.wang@brgroup.com) · [wei.zou@brgroup.com](mailto:wei.zou@brgroup.com) |
| 💬 WeChat | h2538406363 |
| 👥 WeChat Group | Scan the QR code to join the group<br><img src="image/wechat.jpg" alt="WeChat group QR code" width="220" /> |
| 🐛 Issues | [GitHub Issues](https://github.com/Bairong-Xdynamics/TurnSense/issues) |
| 🔀 PR | [Pull Requests](https://github.com/Bairong-Xdynamics/TurnSense/pulls) |
<br/>
## 📄 License
This project is released under the **Apache License 2.0** with certain additional conditions. See [LICENSE](./LICENSE) for details.
<br/>
---
<div align="center">
**Built with ❤️ by [Baiji Team](https://github.com/Bairong-Xdynamics)**
</div>
|