ONNX
Chinese
English
YingaoWang-casia commited on
Commit
fb4bf47
·
verified ·
1 Parent(s): 8aa0919

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -79
README.md CHANGED
@@ -4,7 +4,7 @@ language:
4
  - zh
5
  - en
6
  widget:
7
- - text: TurnSense 三分类语音轮次判别演示
8
  output:
9
  url: image/PR_new.mp4
10
  ---
@@ -21,8 +21,7 @@ widget:
21
 
22
  <br/>
23
 
24
- <center><strong>47M 参数 | CPU 延迟 ~55ms | F1 高达 96.35% | 无效语义过滤</strong></center>
25
-
26
 
27
  <br/>
28
 
@@ -39,21 +38,22 @@ widget:
39
 
40
  <br/>
41
 
42
- > **⭐ If TurnSense is useful to you, please give us a Star!** It helps us keep improving the model and documentation.
43
 
44
  <br/>
45
 
46
  ## 📖 Table of Contents
47
 
 
48
  - [Why TurnSense](#-why-turnsense)
49
- - [Overview](#-overview)
50
- - [Key Features](#-key-features)
51
  - [Model Size Comparison](#-model-size-comparison)
52
  - [Benchmark Results](#-benchmark-results)
53
  - [Quick Start](#-quick-start)
54
  - [Evaluation Guide](#-evaluation-guide)
55
  - [Citation](#-citation)
56
- - [Contact & Community](#-contact--community)
57
  - [License](#-license)
58
 
59
  <br/>
@@ -62,18 +62,28 @@ widget:
62
 
63
  <br/>
64
 
 
 
 
 
 
 
 
 
 
 
65
  ## 🏆 Why TurnSense
66
 
67
  <div align="center">
68
 
69
  | Dimension | TurnSense Performance |
70
  | :---: | :---: |
71
- | 🎯 **Accuracy** | F1 **96.35%** (easyturn_real_test_ZH) — best in class |
72
- | ⚡ **Inference Latency** | CPU p50 ≈ **54.65ms** — real-time interaction ready |
73
- | 📦 **Model Size** | Only **47M** parameters, INT8 version only **~50MB** |
74
- | 🧠 **Classification** | First open-source model natively supporting **complete / incomplete / invalid** three-class detection |
75
- | 🚫 **Invalid Filtering** | Invalid utterance F1 reaches **94.34%**, effectively suppressing noise-triggered responses |
76
- | 🤗 **Open-Source Friendly** | FP32 / INT8 ONNX provided, ready to use out of the box |
77
 
78
  </div>
79
 
@@ -83,19 +93,18 @@ widget:
83
 
84
  <br/>
85
 
86
- ## 📌 Overview
87
 
88
- **TurnSense** is a **three-class semantic detection model** designed for human-machine voice interaction, focused on solving a critical problem in dialogue systems:
89
 
90
- > **During a user's speech, should the system respond immediately, or continue waiting?**
91
 
92
- Traditional approaches typically rely on a simple binary classification — "finished or not." **TurnSense goes further** by simultaneously modeling semantic completeness and invalid input detection, enabling more natural turn-taking in complex real-world scenarios and **significantly reducing false interruptions, premature responses, and noise-triggered activations**.
93
 
94
  <div align="center">
95
- <img src="./image/TurnSense.svg" alt="TurnSense Three-Class Illustration" width="820"/>
96
  </div>
97
 
98
-
99
  <div align="center">
100
  <video src="https://huggingface.co/brgroup/TurnSense/resolve/main/image/PR_new.mp4"
101
  width="820"
@@ -108,16 +117,15 @@ Traditional approaches typically rely on a simple binary classification — "fin
108
  </video>
109
  </div>
110
 
111
-
112
  TurnSense classifies user input into three semantic states:
113
 
114
- | State | Description | Example |
115
  | :---: | :--- | :--- |
116
- | ✅ **Complete** | The user has expressed a complete intent; the system can respond | `"Check tomorrow's weather in Shanghai for me."` |
117
- | ⏳ **Incomplete** | The user's expression is unfinished truncated, paused, or trailing off | `"I'd like to ask about that order from yesterday..."` |
118
- | 🔇 **Invalid** | The input does not constitute meaningful speech and should not trigger a response | `"...(continuous noise / non-verbal vocalization)"` |
119
 
120
- These three labels enable the system to determine not only **"should I respond?"** but also **"is it worth responding to?"** significantly improving interaction naturalness and system stability in voice assistants, real-time calls, intelligent customer service, and more.
121
 
122
  <br/>
123
 
@@ -125,35 +133,35 @@ These three labels enable the system to determine not only **"should I respond?"
125
 
126
  <br/>
127
 
128
- ## ✨ Key Features
129
 
130
  ### 🧠 Semantic-Level Three-Class Detection
131
 
132
- Simultaneously models `complete / incomplete / invalid` states closer to real conversational behavior than traditional binary classification, and currently the **only open-source solution with native invalid utterance detection**.
133
 
134
- ### ⚡ Ultra-Lightweight, Ultra-Fast Inference
135
 
136
- Only **47M** parameters (INT8 version ~50MB). CPU inference latency: p50 **54.65ms**, p90 **58.00ms** meets the strict requirements of real-time interaction **without a GPU**.
137
 
138
- ### 🎯 Leading Accuracy
139
 
140
- Achieves **F1 96.35%** (complete) and **F1 96.32%** (incomplete) on easyturn_real_test_ZH (300 samples), and **F1 92.30%** (complete) and **F1 91.62%** (incomplete) on semantic_test_ZH (2000 samples) — best or runner-up among all comparable models.
141
 
142
  ### 🚫 Invalid Input Filtering
143
 
144
- On the NonverbalVocalization test set, invalid utterance precision reaches **100%** with recall of **90.37%** (F1 = 94.34%), effectively suppressing false triggers from non-verbal sounds and noise.
145
 
146
- ### ⚖️ More Robust Turn Decisions
147
 
148
- Balances precision and recall in semantically ambiguous, pause-heavy, or colloquial scenarios, reducing both premature responses and missed responses.
149
 
150
- ### 📊 Reproducible Evaluation Framework
151
 
152
- Ships with a complete evaluation pipeline and scripts, supporting unified metric comparison and performance regression analysis for full reproducibility.
153
 
154
- ### 🤗 Open-Source Friendly, Plug-and-Play
155
 
156
- Standardized repository structure with FP32 / INT8 ONNX models from installation to inference in just a few minutes.
157
 
158
  <br/>
159
 
@@ -176,7 +184,7 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
176
 
177
  </div>
178
 
179
- > 💡 With only **47M** parameters, TurnSense achieves three-class capability the best balance between accuracy and model size.
180
 
181
  <br/>
182
 
@@ -186,13 +194,13 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
186
 
187
  ## 📊 Benchmark Results
188
 
189
- > All results below are based on open-source Chinese evaluation sets. Latency marked with `(GPU)` indicates GPU environment; otherwise, latency was measured on **CPU**.
190
 
191
  <br/>
192
 
193
- ### 📋 easyturn_real_test_ZH (300 samples)
194
 
195
- > Data source: Real data samples from [Easy-Turn-Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset)
196
 
197
  | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
198
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
@@ -203,13 +211,13 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
203
  | NAMO-Turn | 81.53% | 85.33% | 83.39% | 84.62% | 80.67% | 82.59% | 3.60 | 83.44 |
204
  | **⭐ TurnSense** | 96.03% | **96.67%** | **🏆 96.35%** | **96.64%** | 96.00% | **🏆 96.32%** | 54.65 | 58.00 |
205
 
206
- > **🔍 Key Finding:** TurnSense achieves the **highest F1** on both complete and incomplete classes, and is the only model with CPU p50 < 60ms while maintaining F1 > 96%.
207
 
208
  <br/>
209
 
210
- ### 📋 semantic_test_ZH (2000 samples)
211
 
212
- > Data source: Chinese test split from [KE-Team/SemanticVAD-Dataset](https://huggingface.co/datasets/KE-Team/SemanticVAD-Dataset)
213
 
214
  | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
215
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
@@ -220,19 +228,47 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
220
  | NAMO-Turn | 71.48% | 86.70% | 78.36% | 83.10% | 65.40% | 73.20% | 3.60 | 83.44 |
221
  | **⭐ TurnSense** | **88.96%** | 95.90% | **🏆 92.30%** | 95.55% | **88.00%** | **🏆 91.62%** | 54.65 | 58.00 |
222
 
223
- > **🔍 Key Finding:** On the larger 2000-sample test set, TurnSense still maintains the best F1, demonstrating strong generalization capability.
224
 
225
  <br/>
226
 
227
- ### 📋 NonverbalVocalization_invalid (728 samples)
 
 
 
 
228
 
229
- > Data source: OpenSLR [Deeply Nonverbal Vocalization Dataset (SLR99)](https://openslr.elda.org/99/)
230
 
231
- | Model | P (invalid) | R (invalid) | **F1 (invalid)** |
232
- | :--- | :---: | :---: | :---: |
233
- | **⭐ TurnSense** | **100.00%** | **90.37%** | **🏆 94.34%** |
 
 
 
 
234
 
235
- > **🔍 Key Finding:** TurnSense is currently the only model that supports invalid utterance detection. A precision of **100%** means zero false positives — effectively preventing noise from triggering system responses.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
 
237
  <br/>
238
 
@@ -251,19 +287,20 @@ cd TurnSense
251
  pip install -U numpy onnxruntime torch librosa soundfile pandas scikit-learn huggingface_hub
252
  ```
253
 
254
- ### 2. Model Weights
255
 
256
  TurnSense model weights are available on Hugging Face: [Baiji-Team/TurnSense](https://huggingface.co/brgroup/TurnSense)
257
 
258
  | Version | Size | Use Case |
259
  | :--- | :--- | :--- |
260
- | FP32 | ~191 MB | Accuracy-first |
261
- | INT8 | ~50 MB | Deployment-first (recommended) |
 
 
262
 
263
- **Download Options:**
264
 
265
- **Option 1: Auto-download (Recommended)**
266
- The inference script includes built-in Hugging Face download logic. The model will be automatically fetched and cached on first run.
267
 
268
  **Option 2: Git LFS**
269
 
@@ -287,12 +324,12 @@ python infer.py
287
 
288
  Example output:
289
 
290
- ```
291
  Loading model from brgroup/TurnSense...
292
- Running inference on: "我想问一下那个订单就是昨天..."
293
 
294
  Results:
295
- Input: "我想问一下那个订单就是昨天..."
296
  TurnSense Detection Result: "incomplete"
297
  ```
298
 
@@ -304,12 +341,12 @@ Results:
304
 
305
  ## 🧪 Evaluation Guide
306
 
307
- ### 1) Evaluation Pipeline
308
 
309
- 1. Load the `.jsonl` test dataset (line-by-line JSONL)
310
- 2. Warm up each model (default `warmup_iters=20`)
311
- 3. Run per-sample inference, collecting classification and performance metrics
312
- 4. Automatically generate summary and detail files
313
 
314
  Output files include:
315
 
@@ -318,27 +355,27 @@ Output files include:
318
  | `report.md` | Summary evaluation report |
319
  | `results.json` | Structured evaluation results |
320
  | `config.json` | Evaluation configuration |
321
- | `per_sample__*.jsonl` | Per-sample prediction details |
322
 
323
- ### 2) Data Format (JSONL)
324
 
325
- Each line is a JSON object containing at least the following fields:
326
 
327
  | Field | Description |
328
  | :--- | :--- |
329
  | `audio_path` | Path to the audio file |
330
  | `text` | Text content |
331
- | `label` | Label (`complete` / `incomplete` / `invalid`) |
332
 
333
  Example:
334
 
335
  ```jsonl
336
- {"audio_path":"/001.wav","text":"帮我查一下明天上海天气","label":"complete"}
337
- {"audio_path":"/002.wav","text":"我想问一下那个订单就是昨天...","label":"incomplete"}
338
- {"audio_path":"/003.wav","text":"啊…嗯…(持续噪声)","label":"invalid"}
339
  ```
340
 
341
- ### 3) Run Evaluation
342
 
343
  ```bash
344
  python TurnSense/Turn_benchmark/benchmark.py
@@ -366,13 +403,15 @@ If you use TurnSense in your research or product, please cite:
366
 
367
  <br/>
368
 
369
- ## ❓ Contact & Community
 
 
370
 
371
- If you have questions or suggestions, feel free to reach out:
372
 
373
  | Channel | Contact |
374
  | :--- | :--- |
375
- | 📧 Email | [huan.shen@brgroup.com](mailto:huan.shen@brgroup.com) · [yingao.wang@brgroup.com](mailto:yingao.wang@brgroup.com) · [wei.zou@brgroup.com](mailto:wei.zou@brgroup.com) |
376
  | 💬 WeChat | h2538406363 |
377
  | 👥 WeChat Group | Scan the QR code to join the group<br><img src="image/wechat.jpg" alt="WeChat group QR code" width="220" /> |
378
  | 🐛 Issues | [GitHub Issues](https://github.com/Bairong-Xdynamics/TurnSense/issues) |
@@ -382,7 +421,7 @@ If you have questions or suggestions, feel free to reach out:
382
 
383
  ## 📄 License
384
 
385
- This project is released under the **Apache License 2.0** with certain additional conditions. See [LICENSE](./LICENSE) for details.
386
 
387
  <br/>
388
 
@@ -392,4 +431,4 @@ This project is released under the **Apache License 2.0** with certain additiona
392
 
393
  **Built with ❤️ by [Baiji Team](https://github.com/Bairong-Xdynamics)**
394
 
395
- </div>
 
4
  - zh
5
  - en
6
  widget:
7
+ - text: TurnSense three-class speech turn detection demo
8
  output:
9
  url: image/PR_new.mp4
10
  ---
 
21
 
22
  <br/>
23
 
24
+ <center><strong>47M Parameters | CPU Latency ~55ms | F1 up to 96.35% | Invalid Utterance Filtering</strong></center>
 
25
 
26
  <br/>
27
 
 
38
 
39
  <br/>
40
 
41
+ > **⭐ If TurnSense is useful to you, please give us a Star!** This helps us continue improving the model and documentation.
42
 
43
  <br/>
44
 
45
  ## 📖 Table of Contents
46
 
47
+ - [News](#-news)
48
  - [Why TurnSense](#-why-turnsense)
49
+ - [Introduction](#-introduction)
50
+ - [Core Features](#-core-features)
51
  - [Model Size Comparison](#-model-size-comparison)
52
  - [Benchmark Results](#-benchmark-results)
53
  - [Quick Start](#-quick-start)
54
  - [Evaluation Guide](#-evaluation-guide)
55
  - [Citation](#-citation)
56
+ - [Questions and Contact](#-questions-and-contact)
57
  - [License](#-license)
58
 
59
  <br/>
 
62
 
63
  <br/>
64
 
65
+ ## 📰 News
66
+
67
+ - **2026.05.22**: Released **TurnSense 1.1**, an English-enhanced version focused on improving `complete / incomplete` semantic completeness detection in English scenarios. It is suitable for Chinese-English mixed dialogue scenarios. The model is available on Hugging Face: [brgroup/TurnSense](https://huggingface.co/brgroup/TurnSense).
68
+
69
+ <br/>
70
+
71
+ ---
72
+
73
+ <br/>
74
+
75
  ## 🏆 Why TurnSense
76
 
77
  <div align="center">
78
 
79
  | Dimension | TurnSense Performance |
80
  | :---: | :---: |
81
+ | 🎯 **Accuracy** | F1 **96.35%** on `easyturn_real_test_ZH` — best among comparable models |
82
+ | ⚡ **Inference Latency** | CPU p50 ≈ **54.65ms** — suitable for real-time interaction |
83
+ | 📦 **Model Size** | Only **47M** parameters, with an INT8 version of about **50MB** |
84
+ | 🧠 **Classification Ability** | The first open-source model to natively support **complete / incomplete / invalid** three-class detection |
85
+ | 🚫 **Invalid Filtering** | Invalid utterance F1 reaches **94.34%**, effectively reducing noise-triggered false activations |
86
+ | 🤗 **Open-Source Friendly** | Provides FP32 / INT8 ONNX models, ready to use out of the box |
87
 
88
  </div>
89
 
 
93
 
94
  <br/>
95
 
96
+ ## 📌 Introduction
97
 
98
+ **TurnSense** is a **three-class semantic turn detection model** designed for human-machine speech interaction. It focuses on a core problem in conversational systems:
99
 
100
+ > **Should the system respond immediately while the user is speaking, or should it keep waiting?**
101
 
102
+ Traditional approaches usually perform only binary "end-of-turn" detection. **TurnSense goes further** by jointly modeling semantic completeness and invalid input detection. This helps systems achieve more natural turn-taking in complex real-world scenarios and significantly reduces premature interruption, overlapping speech, and invalid triggers.
103
 
104
  <div align="center">
105
+ <img src="./image/TurnSense.svg" alt="TurnSense three-class diagram" width="820"/>
106
  </div>
107
 
 
108
  <div align="center">
109
  <video src="https://huggingface.co/brgroup/TurnSense/resolve/main/image/PR_new.mp4"
110
  width="820"
 
117
  </video>
118
  </div>
119
 
 
120
  TurnSense classifies user input into three semantic states:
121
 
122
+ | State | Meaning | Example |
123
  | :---: | :--- | :--- |
124
+ | ✅ **Complete** | The user's expression forms a complete intent, and the system can respond | `"Please check tomorrow's weather in Shanghai."` |
125
+ | ⏳ **Incomplete** | The user's expression is not finished and may continue after a pause or truncation | `"I want to ask about that order from yesterday..."` |
126
+ | 🔇 **Invalid** | The input does not form valid semantic content and should not trigger a response | `"...(continuous noise / nonverbal vocalization)"` |
127
 
128
+ These three labels allow the system to determine not only **"whether it should take the turn"**, but also **"whether the input is worth responding to"**. This improves interaction naturalness and system stability in voice assistants, real-time calls, intelligent customer service, and other speech interaction scenarios.
129
 
130
  <br/>
131
 
 
133
 
134
  <br/>
135
 
136
+ ## ✨ Core Features
137
 
138
  ### 🧠 Semantic-Level Three-Class Detection
139
 
140
+ TurnSense jointly models `complete / incomplete / invalid` states. Compared with traditional binary turn detection, this is closer to real conversational behavior. It is also the only open-source solution that natively supports invalid semantic detection.
141
 
142
+ ### ⚡ Extremely Lightweight and Fast
143
 
144
+ TurnSense has only **47M** parameters. The INT8 version is about **50MB**. In CPU environments, it achieves p50 latency of about **54.65ms** and p90 latency of about **58.00ms**, enabling real-time interaction without requiring a GPU.
145
 
146
+ ### 🎯 Strong Accuracy
147
 
148
+ On `easyturn_real_test_ZH` with 300 samples, TurnSense achieves **F1 96.35%** for `complete` and **F1 96.32%** for `incomplete`. On `semantic_test_ZH` with 2000 samples, it achieves **F1 92.30%** for `complete` and **F1 91.62%** for `incomplete`, reaching best or second-best performance among comparable models.
149
 
150
  ### 🚫 Invalid Input Filtering
151
 
152
+ On the NonverbalVocalization dataset, invalid utterance detection reaches **100% precision**, **90.37% recall**, and **94.34% F1**, effectively suppressing false activations caused by nonverbal vocalizations and noise.
153
 
154
+ ### ⚖️ More Robust Turn-Taking Decisions
155
 
156
+ TurnSense balances precision and recall in semantically ambiguous, paused, or colloquial speech scenarios, reducing premature responses and missed responses.
157
 
158
+ ### 📊 Reproducible Evaluation Pipeline
159
 
160
+ The project includes a complete evaluation workflow and scripts, supporting unified metric comparison and performance regression analysis to ensure reproducibility.
161
 
162
+ ### 🤗 Open-Source Friendly and Ready to Use
163
 
164
+ TurnSense provides a standardized repository structure and FP32 / INT8 ONNX models. Installation and inference can be completed within minutes.
165
 
166
  <br/>
167
 
 
184
 
185
  </div>
186
 
187
+ > 💡 With only **47M** parameters, TurnSense provides native three-class detection and achieves a strong balance between accuracy and model size.
188
 
189
  <br/>
190
 
 
194
 
195
  ## 📊 Benchmark Results
196
 
197
+ > The following results cover Chinese, English, and invalid-utterance test sets. Chinese results mainly demonstrate the capability of the initial TurnSense version, while English results show the enhanced performance of TurnSense 1.1.
198
 
199
  <br/>
200
 
201
+ ### 📋 easyturn_real_test_ZH300 samples
202
 
203
+ > Data source: real samples from [Easy-Turn-Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset)
204
 
205
  | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
206
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 
211
  | NAMO-Turn | 81.53% | 85.33% | 83.39% | 84.62% | 80.67% | 82.59% | 3.60 | 83.44 |
212
  | **⭐ TurnSense** | 96.03% | **96.67%** | **🏆 96.35%** | **96.64%** | 96.00% | **🏆 96.32%** | 54.65 | 58.00 |
213
 
214
+ > **🔍 Key finding:** TurnSense achieves the highest F1 for both `complete` and `incomplete`, and is the only model that reaches F1 > 96% with CPU p50 latency below 60ms.
215
 
216
  <br/>
217
 
218
+ ### 📋 semantic_test_ZH2000 samples
219
 
220
+ > Data source: Chinese test set from [KE-Team/SemanticVAD-Dataset](https://huggingface.co/datasets/KE-Team/SemanticVAD-Dataset)
221
 
222
  | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
223
  | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 
228
  | NAMO-Turn | 71.48% | 86.70% | 78.36% | 83.10% | 65.40% | 73.20% | 3.60 | 83.44 |
229
  | **⭐ TurnSense** | **88.96%** | 95.90% | **🏆 92.30%** | 95.55% | **88.00%** | **🏆 91.62%** | 54.65 | 58.00 |
230
 
231
+ > **🔍 Key finding:** On the larger 2000-sample test set, TurnSense continues to maintain the best F1 performance, demonstrating strong generalization.
232
 
233
  <br/>
234
 
235
+ ### 📋 TurnSense 1.1 English Enhancement Results
236
+
237
+ > Model download: [Hugging Face - brgroup/TurnSense](https://huggingface.co/brgroup/TurnSense)
238
+
239
+ > TurnSense 1.1 focuses on improving semantic completeness detection in English scenarios. The following results show its `complete / incomplete` performance on English test sets.
240
 
241
+ #### ten_test_EN
242
 
243
+ | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** |
244
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: |
245
+ | Smart-Turn-v3 | 70.66% | 72.46% | 71.55% | 65.05% | 63.02% | 64.02% |
246
+ | TEN-Turn | **98.61%** | 90.25% | **94.25%** | 89.15% | **98.44%** | **93.56%** |
247
+ | FireRedChat | 76.41% | **97.46%** | 85.66% | **95.28%** | 63.02% | 75.86% |
248
+ | NAMO-Turn | <u>92.65%</u> | 26.69% | 41.45% | 51.94% | <u>97.40%</u> | 67.75% |
249
+ | **⭐ TurnSense 1.1 int8** | 83.01% | 91.10% | 86.87% | 87.57% | 77.08% | <u>81.99%</u> |
250
 
251
+ #### semantic_test_EN
252
+
253
+ | Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** |
254
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: |
255
+ | Smart-Turn-v3 | 68.18% | 75.00% | 71.43% | 72.22% | 65.00% | 68.42% |
256
+ | TEN-Turn | **97.98%** | 97.00% | **97.49%** | **97.03%** | **98.00%** | **97.51%** |
257
+ | FireRedChat | 72.06% | **98.00%** | 83.05% | 96.88% | 62.00% | 75.61% |
258
+ | NAMO-Turn | <u>93.55%</u> | 87.00% | <u>90.16%</u> | 87.85% | <u>94.00%</u> | <u>90.82%</u> |
259
+ | **⭐ TurnSense 1.1 int8** | 74.60% | 94.00% | 83.19% | <u>91.89%</u> | 68.00% | 78.16% |
260
+
261
+ <br/>
262
+
263
+ ### 📋 NonverbalVocalization_invalid(728 samples)
264
+
265
+ > Data source: OpenSLR [Deeply Nonverbal Vocalization Dataset(SLR99)](https://openslr.elda.org/99/)
266
+
267
+ | Model | R (invalid) |
268
+ | :--- | :---: |
269
+ | **⭐ TurnSense** | **90.37%** |
270
+
271
+ > **🔍 Key finding:** TurnSense supports invalid semantic detection and can effectively reduce system responses triggered by nonverbal vocalizations or noise.
272
 
273
  <br/>
274
 
 
287
  pip install -U numpy onnxruntime torch librosa soundfile pandas scikit-learn huggingface_hub
288
  ```
289
 
290
+ ### 2. Download Model Weights
291
 
292
  TurnSense model weights are available on Hugging Face: [Baiji-Team/TurnSense](https://huggingface.co/brgroup/TurnSense)
293
 
294
  | Version | Size | Use Case |
295
  | :--- | :--- | :--- |
296
+ | FP32 | ~191 MB | Accuracy-first scenarios |
297
+ | INT8 | ~50 MB | Deployment-first scenarios, recommended |
298
+
299
+ **Download options:**
300
 
301
+ **Option 1: Automatic download, recommended**
302
 
303
+ The inference script includes Hugging Face download logic and will automatically download and cache the model during the first run.
 
304
 
305
  **Option 2: Git LFS**
306
 
 
324
 
325
  Example output:
326
 
327
+ ```text
328
  Loading model from brgroup/TurnSense...
329
+ Running inference on: "I want to ask about that order from yesterday..."
330
 
331
  Results:
332
+ Input: "I want to ask about that order from yesterday..."
333
  TurnSense Detection Result: "incomplete"
334
  ```
335
 
 
341
 
342
  ## 🧪 Evaluation Guide
343
 
344
+ ### 1. Evaluation Pipeline
345
 
346
+ 1. Read test datasets in `.jsonl` format.
347
+ 2. Warm up each model first. The default value is `warmup_iters=20`.
348
+ 3. Run inference sample by sample and collect classification and performance metrics.
349
+ 4. Automatically export summary reports and detailed result files.
350
 
351
  Output files include:
352
 
 
355
  | `report.md` | Summary evaluation report |
356
  | `results.json` | Structured evaluation results |
357
  | `config.json` | Evaluation configuration |
358
+ | `per_sample__*.jsonl` | Per-sample prediction results |
359
 
360
+ ### 2. Data Format Requirements(JSONL
361
 
362
+ Each line should be a JSON object containing at least the following fields:
363
 
364
  | Field | Description |
365
  | :--- | :--- |
366
  | `audio_path` | Path to the audio file |
367
  | `text` | Text content |
368
+ | `label` | Label: `complete` / `incomplete` / `invalid` |
369
 
370
  Example:
371
 
372
  ```jsonl
373
+ {"audio_path":"/001.wav","text":"Please check tomorrow's weather in Shanghai.","label":"complete"}
374
+ {"audio_path":"/002.wav","text":"I want to ask about that order from yesterday...","label":"incomplete"}
375
+ {"audio_path":"/003.wav","text":"uh... hmm... continuous noise","label":"invalid"}
376
  ```
377
 
378
+ ### 3. Run Evaluation
379
 
380
  ```bash
381
  python TurnSense/Turn_benchmark/benchmark.py
 
403
 
404
  <br/>
405
 
406
+ <br/>
407
+
408
+ ## ❓ Questions and Contact
409
 
410
+ If you have questions or suggestions, feel free to contact us through the following channels:
411
 
412
  | Channel | Contact |
413
  | :--- | :--- |
414
+ | 📧 Email | [huan.shen@brgroup.com](mailto:huan.shen@brgroup.com) [yingao.wang@brgroup.com](mailto:yingao.wang@brgroup.com) [wei.zou@brgroup.com](mailto:wei.zou@brgroup.com) |
415
  | 💬 WeChat | h2538406363 |
416
  | 👥 WeChat Group | Scan the QR code to join the group<br><img src="image/wechat.jpg" alt="WeChat group QR code" width="220" /> |
417
  | 🐛 Issues | [GitHub Issues](https://github.com/Bairong-Xdynamics/TurnSense/issues) |
 
421
 
422
  ## 📄 License
423
 
424
+ This project is released under the **Apache License 2.0** with additional specific restrictions. See [LICENSE](./LICENSE) for details.
425
 
426
  <br/>
427
 
 
431
 
432
  **Built with ❤️ by [Baiji Team](https://github.com/Bairong-Xdynamics)**
433
 
434
+ </div>