ONNX
Chinese
English
YingaoWang-casia commited on
Commit
beb084c
·
verified ·
1 Parent(s): 285c996

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +372 -3
README.md CHANGED
@@ -1,3 +1,372 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ <img src="./image/Baiji_Team.png" alt="Baiji Team Logo" width="400" height="200"/>
4
+
5
+ <br/>
6
+
7
+ # TurnSense
8
+
9
+ ### 🎯 Lightweight · Accurate · Three-Class — Redefining Speech Turn Detection
10
+
11
+ <br/>
12
+
13
+ ```
14
+ 47M Parameters | CPU Latency ~55ms | F1 up to 96.35% | Invalid Utterance Filtering
15
+ ```
16
+
17
+ <br/>
18
+
19
+ [![GitHub](https://img.shields.io/badge/GitHub-Baiji--Team/TurnSense-181717?style=for-the-badge&logo=github)](https://github.com/Baiji-Team/TurnSense)
20
+ [![Hugging Face](https://img.shields.io/badge/🤗_Hugging_Face-Baiji--Team-yellow?style=for-the-badge)](https://huggingface.co/Baiji-Team/TurnSense)
21
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue?style=for-the-badge)](./LICENSE)
22
+ [![PRs Welcome](https://img.shields.io/badge/PRs-Welcome-brightgreen?style=for-the-badge)](https://github.com/Baiji-Team/TurnSense)
23
+
24
+ </div>
25
+
26
+ <br/>
27
+
28
+ **Language**: **English** | [中文](./README_zh.md)
29
+
30
+ <br/>
31
+
32
+ > **⭐ If TurnSense is useful to you, please give us a Star!** It helps us keep improving the model and documentation.
33
+
34
+ <br/>
35
+
36
+ ## 📖 Table of Contents
37
+
38
+ - [Why TurnSense](#-why-turnsense)
39
+ - [Overview](#-overview)
40
+ - [Key Features](#-key-features)
41
+ - [Model Size Comparison](#-model-size-comparison)
42
+ - [Benchmark Results](#-benchmark-results)
43
+ - [Quick Start](#-quick-start)
44
+ - [Evaluation Guide](#-evaluation-guide)
45
+ - [Citation](#-citation)
46
+ - [Contact & Community](#-contact--community)
47
+ - [License](#-license)
48
+
49
+ <br/>
50
+
51
+ ---
52
+
53
+ <br/>
54
+
55
+ ## 🏆 Why TurnSense
56
+
57
+ <div align="center">
58
+
59
+ | Dimension | TurnSense Performance |
60
+ | :---: | :---: |
61
+ | 🎯 **Accuracy** | F1 **96.35%** (easyturn_real_test_ZH) — best in class |
62
+ | ⚡ **Inference Latency** | CPU p50 ≈ **54.65ms** — real-time interaction ready |
63
+ | 📦 **Model Size** | Only **47M** parameters, INT8 version only **~50MB** |
64
+ | 🧠 **Classification** | First open-source model natively supporting **complete / incomplete / invalid** three-class detection |
65
+ | 🚫 **Invalid Filtering** | Invalid utterance F1 reaches **94.34%**, effectively suppressing noise-triggered responses |
66
+ | 🤗 **Open-Source Friendly** | FP32 / INT8 ONNX provided, ready to use out of the box |
67
+
68
+ </div>
69
+
70
+ <br/>
71
+
72
+ ---
73
+
74
+ <br/>
75
+
76
+ ## 📌 Overview
77
+
78
+ **TurnSense** is a **three-class semantic detection model** designed for human-machine voice interaction, focused on solving a critical problem in dialogue systems:
79
+
80
+ > **During a user's speech, should the system respond immediately, or continue waiting?**
81
+
82
+ Traditional approaches typically rely on a simple binary classification — "finished or not." **TurnSense goes further** by simultaneously modeling semantic completeness and invalid input detection, enabling more natural turn-taking in complex real-world scenarios and **significantly reducing false interruptions, premature responses, and noise-triggered activations**.
83
+
84
+ <div align="center">
85
+ <img src="./image/TurnSense.png" alt="TurnSense Three-Class Illustration" width="820"/>
86
+ </div>
87
+
88
+ <br/>
89
+
90
+ TurnSense classifies user input into three semantic states:
91
+
92
+ | State | Description | Example |
93
+ | :---: | :--- | :--- |
94
+ | ✅ **Complete** | The user has expressed a complete intent; the system can respond | `"Check tomorrow's weather in Shanghai for me."` |
95
+ | ⏳ **Incomplete** | The user's expression is unfinished — truncated, paused, or trailing off | `"I'd like to ask about that order from yesterday..."` |
96
+ | 🔇 **Invalid** | The input does not constitute meaningful speech and should not trigger a response | `"...(continuous noise / non-verbal vocalization)"` |
97
+
98
+ These three labels enable the system to determine not only **"should I respond?"** but also **"is it worth responding to?"** — significantly improving interaction naturalness and system stability in voice assistants, real-time calls, intelligent customer service, and more.
99
+
100
+ <br/>
101
+
102
+ ---
103
+
104
+ <br/>
105
+
106
+ ## ✨ Key Features
107
+
108
+ ### 🧠 Semantic-Level Three-Class Detection
109
+
110
+ Simultaneously models `complete / incomplete / invalid` states — closer to real conversational behavior than traditional binary classification, and currently the **only open-source solution with native invalid utterance detection**.
111
+
112
+ ### ⚡ Ultra-Lightweight, Ultra-Fast Inference
113
+
114
+ Only **47M** parameters (INT8 version ~50MB). CPU inference latency: p50 ≈ **54.65ms**, p90 ≈ **58.00ms** — meets the strict requirements of real-time interaction **without a GPU**.
115
+
116
+ ### 🎯 Leading Accuracy
117
+
118
+ Achieves **F1 96.35%** (complete) and **F1 96.32%** (incomplete) on easyturn_real_test_ZH (300 samples), and **F1 92.30%** (complete) and **F1 91.62%** (incomplete) on semantic_test_ZH (2000 samples) — best or runner-up among all comparable models.
119
+
120
+ ### 🚫 Invalid Input Filtering
121
+
122
+ On the NonverbalVocalization test set, invalid utterance precision reaches **100%** with recall of **90.37%** (F1 = 94.34%), effectively suppressing false triggers from non-verbal sounds and noise.
123
+
124
+ ### ⚖️ More Robust Turn Decisions
125
+
126
+ Balances precision and recall in semantically ambiguous, pause-heavy, or colloquial scenarios, reducing both premature responses and missed responses.
127
+
128
+ ### 📊 Reproducible Evaluation Framework
129
+
130
+ Ships with a complete evaluation pipeline and scripts, supporting unified metric comparison and performance regression analysis for full reproducibility.
131
+
132
+ ### 🤗 Open-Source Friendly, Plug-and-Play
133
+
134
+ Standardized repository structure with FP32 / INT8 ONNX models — from installation to inference in just a few minutes.
135
+
136
+ <br/>
137
+
138
+ ---
139
+
140
+ <br/>
141
+
142
+ ## 📐 Model Size Comparison
143
+
144
+ <div align="center">
145
+
146
+ | Model | Parameters | Three-Class | Link |
147
+ | :--- | :---: | :---: | :--- |
148
+ | TEN-Turn | **7B** | ❌ | [TEN-framework/TEN_Turn_Detection](https://huggingface.co/TEN-framework/TEN_Turn_Detection) |
149
+ | Easy-Turn | 850M | ❌ | [ASLP-lab/Easy-Turn](https://huggingface.co/ASLP-lab/Easy-Turn) |
150
+ | NAMO-Turn-Detector (ZH) | 66M | ❌ | [videosdk-live/Namo-Turn-Detector-v1-Multilingual](https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Multilingual) |
151
+ | **⭐ TurnSense** | **47M** | **✅** | [**Baiji-Team/TurnSense**](https://huggingface.co/Baiji-Team/TurnSense) |
152
+ | Smart-Turn-v3 | 8M | ❌ | [pipecat-ai/smart-turn-v3](https://huggingface.co/pipecat-ai/smart-turn-v3) |
153
+ | FireRedChat-turn-detector | -- | ❌ | [FireRedTeam/FireRedChat-turn-detector](https://huggingface.co/FireRedTeam/FireRedChat-turn-detector) |
154
+
155
+ </div>
156
+
157
+ > 💡 With only **47M** parameters, TurnSense achieves three-class capability — the best balance between accuracy and model size.
158
+
159
+ <br/>
160
+
161
+ ---
162
+
163
+ <br/>
164
+
165
+ ## 📊 Benchmark Results
166
+
167
+ > All results below are based on open-source Chinese evaluation sets. Latency marked with `(GPU)` indicates GPU environment; otherwise, latency was measured on **CPU**.
168
+
169
+ <br/>
170
+
171
+ ### 📋 easyturn_real_test_ZH (300 samples)
172
+
173
+ > Data source: Real data samples from [Easy-Turn-Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset)
174
+
175
+ | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
176
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
177
+ | Easy-Turn | 97.26% | 94.67% | 95.95% | 94.81% | 97.33% | 96.05% | 183.87 (GPU) | 300.37 (GPU) |
178
+ | Smart-Turn-v3 | 64.97% | 76.67% | 70.34% | 71.54% | 58.67% | 64.47% | 36.84 | 39.10 |
179
+ | TEN-Turn | **99.25%** | 88.00% | 93.29% | 89.22% | **99.33%** | 94.01% | 17.66 (GPU) | 19.41 (GPU) |
180
+ | FireRedChat | 70.65% | 94.67% | 80.91% | 91.92% | 60.67% | 73.09% | 98.30 | 99.42 |
181
+ | NAMO-Turn | 81.53% | 85.33% | 83.39% | 84.62% | 80.67% | 82.59% | 3.60 | 83.44 |
182
+ | **⭐ TurnSense** | 96.03% | **96.67%** | **🏆 96.35%** | **96.64%** | 96.00% | **🏆 96.32%** | 54.65 | 58.00 |
183
+
184
+ > **🔍 Key Finding:** TurnSense achieves the **highest F1** on both complete and incomplete classes, and is the only model with CPU p50 < 60ms while maintaining F1 > 96%.
185
+
186
+ <br/>
187
+
188
+ ### 📋 semantic_test_ZH (2000 samples)
189
+
190
+ > Data source: Chinese test split from [KE-Team/SemanticVAD-Dataset](https://huggingface.co/datasets/KE-Team/SemanticVAD-Dataset)
191
+
192
+ | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
193
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
194
+ | Easy-Turn | 78.14% | 98.30% | 87.07% | 97.64% | 70.30% | 81.74% | 183.87 (GPU) | 300.37 (GPU) |
195
+ | Smart-Turn-v3 | 59.25% | 88.10% | 70.85% | 76.80% | 39.40% | 52.08% | 36.84 | 39.10 |
196
+ | TEN-Turn | 85.25% | **99.60%** | 91.87% | **99.52%** | 82.70% | 90.33% | 17.66 (GPU) | 19.41 (GPU) |
197
+ | FireRedChat | 66.76% | 99.40% | 79.87% | 98.83% | 50.50% | 66.84% | 98.30 | 99.42 |
198
+ | NAMO-Turn | 71.48% | 86.70% | 78.36% | 83.10% | 65.40% | 73.20% | 3.60 | 83.44 |
199
+ | **⭐ TurnSense** | **88.96%** | 95.90% | **🏆 92.30%** | 95.55% | **88.00%** | **🏆 91.62%** | 54.65 | 58.00 |
200
+
201
+ > **🔍 Key Finding:** On the larger 2000-sample test set, TurnSense still maintains the best F1, demonstrating strong generalization capability.
202
+
203
+ <br/>
204
+
205
+ ### 📋 NonverbalVocalization_invalid (728 samples)
206
+
207
+ > Data source: OpenSLR [Deeply Nonverbal Vocalization Dataset (SLR99)](https://openslr.elda.org/99/)
208
+
209
+ | Model | P (invalid) | R (invalid) | **F1 (invalid)** |
210
+ | :--- | :---: | :---: | :---: |
211
+ | **⭐ TurnSense** | **100.00%** | **90.37%** | **🏆 94.34%** |
212
+
213
+ > **🔍 Key Finding:** TurnSense is currently the only model that supports invalid utterance detection. A precision of **100%** means zero false positives — effectively preventing noise from triggering system responses.
214
+
215
+ <br/>
216
+
217
+ ---
218
+
219
+ <br/>
220
+
221
+ ## 🚀 Quick Start
222
+
223
+ ### 1. Installation
224
+
225
+ ```bash
226
+ git clone https://github.com/Baiji-Team/TurnSense.git
227
+ cd TurnSense
228
+
229
+ pip install -U numpy onnxruntime torch librosa soundfile pandas scikit-learn huggingface_hub
230
+ ```
231
+
232
+ ### 2. Model Weights
233
+
234
+ TurnSense model weights are available on Hugging Face: [Baiji-Team/TurnSense](https://huggingface.co/Baiji-Team/TurnSense)
235
+
236
+ | Version | Size | Use Case |
237
+ | :--- | :--- | :--- |
238
+ | FP32 | ~191 MB | Accuracy-first |
239
+ | INT8 | ~50 MB | Deployment-first (recommended) |
240
+
241
+ **Download Options:**
242
+
243
+ **Option 1: Auto-download (Recommended)**
244
+ The inference script includes built-in Hugging Face download logic. The model will be automatically fetched and cached on first run.
245
+
246
+ **Option 2: Git LFS**
247
+
248
+ ```bash
249
+ git lfs install
250
+ git clone https://huggingface.co/Baiji-Team/TurnSense
251
+ ```
252
+
253
+ **Option 3: Hugging Face Hub**
254
+
255
+ ```python
256
+ from huggingface_hub import snapshot_download
257
+ snapshot_download(repo_id="Baiji-Team/TurnSense")
258
+ ```
259
+
260
+ ### 3. Inference
261
+
262
+ ```bash
263
+ python infer.py
264
+ ```
265
+
266
+ Example output:
267
+
268
+ ```
269
+ Loading model from Baiji-Team/TurnSense...
270
+ Running inference on: "我想问一下那个订单就是昨天..."
271
+
272
+ Results:
273
+ Input: "我想问一下那个订单就是昨天..."
274
+ TurnSense Detection Result: "incomplete"
275
+ ```
276
+
277
+ <br/>
278
+
279
+ ---
280
+
281
+ <br/>
282
+
283
+ ## 🧪 Evaluation Guide
284
+
285
+ ### 1) Evaluation Pipeline
286
+
287
+ 1. Load the `.jsonl` test dataset (line-by-line JSONL)
288
+ 2. Warm up each model (default `warmup_iters=20`)
289
+ 3. Run per-sample inference, collecting classification and performance metrics
290
+ 4. Automatically generate summary and detail files
291
+
292
+ Output files include:
293
+
294
+ | File | Description |
295
+ | :--- | :--- |
296
+ | `report.md` | Summary evaluation report |
297
+ | `results.json` | Structured evaluation results |
298
+ | `config.json` | Evaluation configuration |
299
+ | `per_sample__*.jsonl` | Per-sample prediction details |
300
+
301
+ ### 2) Data Format (JSONL)
302
+
303
+ Each line is a JSON object containing at least the following fields:
304
+
305
+ | Field | Description |
306
+ | :--- | :--- |
307
+ | `audio_path` | Path to the audio file |
308
+ | `text` | Text content |
309
+ | `label` | Label (`complete` / `incomplete` / `invalid`) |
310
+
311
+ Example:
312
+
313
+ ```jsonl
314
+ {"audio_path":"/001.wav","text":"帮我查一下明天上海天气","label":"complete"}
315
+ {"audio_path":"/002.wav","text":"我想问一下那个订单就是昨天...","label":"incomplete"}
316
+ {"audio_path":"/003.wav","text":"啊…嗯…(持续噪声)","label":"invalid"}
317
+ ```
318
+
319
+ ### 3) Run Evaluation
320
+
321
+ ```bash
322
+ python TurnSense/Turn_benchmark/benchmark.py
323
+ ```
324
+
325
+ <br/>
326
+
327
+ ---
328
+
329
+ <br/>
330
+
331
+ ## 📚 Citation
332
+
333
+ If you use TurnSense in your research or product, please cite:
334
+
335
+ ```bibtex
336
+ @misc{turnsense2026,
337
+ author = {Baiji Team},
338
+ title = {TurnSense: A Three-Class Semantic Detection Model for Complete, Incomplete, and Invalid Utterances},
339
+ year = {2026},
340
+ publisher = {Hugging Face},
341
+ howpublished = {\url{https://huggingface.co/Baiji-Team/TurnSense}},
342
+ }
343
+ ```
344
+
345
+ <br/>
346
+
347
+ ## ❓ Contact & Community
348
+
349
+ If you have questions or suggestions, feel free to reach out:
350
+
351
+ | Channel | Contact |
352
+ | :--- | :--- |
353
+ | 📧 Email | huan.shen@brgroup.com · yingao.wang@brgroup.com · wei.zou@brgroup.com |
354
+ | 💬 WeChat | h2538406363 |
355
+ | 🐛 Issues | [GitHub Issues](https://github.com/Baiji-Team/TurnSense/issues) |
356
+ | 🔀 PR | [Pull Requests](https://github.com/Baiji-Team/TurnSense/pulls) |
357
+
358
+ <br/>
359
+
360
+ ## 📄 License
361
+
362
+ This project is released under the **Apache License 2.0** with certain additional conditions. See [LICENSE](./LICENSE) for details.
363
+
364
+ <br/>
365
+
366
+ ---
367
+
368
+ <div align="center">
369
+
370
+ **Built with ❤️ by [Baiji Team](https://github.com/Baiji-Team)**
371
+
372
+ </div>