Update README.md
Browse files
README.md
CHANGED
|
@@ -4,7 +4,7 @@ language:
|
|
| 4 |
- zh
|
| 5 |
- en
|
| 6 |
widget:
|
| 7 |
-
- text: TurnSense
|
| 8 |
output:
|
| 9 |
url: image/PR_new.mp4
|
| 10 |
---
|
|
@@ -21,8 +21,7 @@ widget:
|
|
| 21 |
|
| 22 |
<br/>
|
| 23 |
|
| 24 |
-
<center><strong>47M
|
| 25 |
-
|
| 26 |
|
| 27 |
<br/>
|
| 28 |
|
|
@@ -39,21 +38,22 @@ widget:
|
|
| 39 |
|
| 40 |
<br/>
|
| 41 |
|
| 42 |
-
> **⭐ If TurnSense is useful to you, please give us a Star!**
|
| 43 |
|
| 44 |
<br/>
|
| 45 |
|
| 46 |
## 📖 Table of Contents
|
| 47 |
|
|
|
|
| 48 |
- [Why TurnSense](#-why-turnsense)
|
| 49 |
-
- [
|
| 50 |
-
- [
|
| 51 |
- [Model Size Comparison](#-model-size-comparison)
|
| 52 |
- [Benchmark Results](#-benchmark-results)
|
| 53 |
- [Quick Start](#-quick-start)
|
| 54 |
- [Evaluation Guide](#-evaluation-guide)
|
| 55 |
- [Citation](#-citation)
|
| 56 |
-
- [
|
| 57 |
- [License](#-license)
|
| 58 |
|
| 59 |
<br/>
|
|
@@ -62,18 +62,28 @@ widget:
|
|
| 62 |
|
| 63 |
<br/>
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
## 🏆 Why TurnSense
|
| 66 |
|
| 67 |
<div align="center">
|
| 68 |
|
| 69 |
| Dimension | TurnSense Performance |
|
| 70 |
| :---: | :---: |
|
| 71 |
-
| 🎯 **Accuracy** | F1 **96.35%**
|
| 72 |
-
| ⚡ **Inference Latency** | CPU p50 ≈ **54.65ms** — real-time interaction
|
| 73 |
-
| 📦 **Model Size** | Only **47M** parameters, INT8 version
|
| 74 |
-
| 🧠 **Classification** |
|
| 75 |
-
| 🚫 **Invalid Filtering** | Invalid utterance F1 reaches **94.34%**, effectively
|
| 76 |
-
| 🤗 **Open-Source Friendly** | FP32 / INT8 ONNX
|
| 77 |
|
| 78 |
</div>
|
| 79 |
|
|
@@ -83,19 +93,18 @@ widget:
|
|
| 83 |
|
| 84 |
<br/>
|
| 85 |
|
| 86 |
-
## 📌
|
| 87 |
|
| 88 |
-
**TurnSense** is a **three-class semantic detection model** designed for human-machine
|
| 89 |
|
| 90 |
-
> **
|
| 91 |
|
| 92 |
-
Traditional approaches
|
| 93 |
|
| 94 |
<div align="center">
|
| 95 |
-
<img src="./image/TurnSense.svg" alt="TurnSense
|
| 96 |
</div>
|
| 97 |
|
| 98 |
-
|
| 99 |
<div align="center">
|
| 100 |
<video src="https://huggingface.co/brgroup/TurnSense/resolve/main/image/PR_new.mp4"
|
| 101 |
width="820"
|
|
@@ -108,16 +117,15 @@ Traditional approaches typically rely on a simple binary classification — "fin
|
|
| 108 |
</video>
|
| 109 |
</div>
|
| 110 |
|
| 111 |
-
|
| 112 |
TurnSense classifies user input into three semantic states:
|
| 113 |
|
| 114 |
-
| State |
|
| 115 |
| :---: | :--- | :--- |
|
| 116 |
-
| ✅ **Complete** | The user
|
| 117 |
-
| ⏳ **Incomplete** | The user's expression is
|
| 118 |
-
| 🔇 **Invalid** | The input does not
|
| 119 |
|
| 120 |
-
These three labels
|
| 121 |
|
| 122 |
<br/>
|
| 123 |
|
|
@@ -125,35 +133,35 @@ These three labels enable the system to determine not only **"should I respond?"
|
|
| 125 |
|
| 126 |
<br/>
|
| 127 |
|
| 128 |
-
## ✨
|
| 129 |
|
| 130 |
### 🧠 Semantic-Level Three-Class Detection
|
| 131 |
|
| 132 |
-
|
| 133 |
|
| 134 |
-
### ⚡
|
| 135 |
|
| 136 |
-
|
| 137 |
|
| 138 |
-
### 🎯
|
| 139 |
|
| 140 |
-
|
| 141 |
|
| 142 |
### 🚫 Invalid Input Filtering
|
| 143 |
|
| 144 |
-
On the NonverbalVocalization
|
| 145 |
|
| 146 |
-
### ⚖️ More Robust Turn Decisions
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
-
### 📊 Reproducible Evaluation
|
| 151 |
|
| 152 |
-
|
| 153 |
|
| 154 |
-
### 🤗 Open-Source Friendly
|
| 155 |
|
| 156 |
-
|
| 157 |
|
| 158 |
<br/>
|
| 159 |
|
|
@@ -176,7 +184,7 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
|
|
| 176 |
|
| 177 |
</div>
|
| 178 |
|
| 179 |
-
> 💡 With only **47M** parameters, TurnSense
|
| 180 |
|
| 181 |
<br/>
|
| 182 |
|
|
@@ -186,13 +194,13 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
|
|
| 186 |
|
| 187 |
## 📊 Benchmark Results
|
| 188 |
|
| 189 |
-
>
|
| 190 |
|
| 191 |
<br/>
|
| 192 |
|
| 193 |
-
### 📋 easyturn_real_test_ZH
|
| 194 |
|
| 195 |
-
> Data source:
|
| 196 |
|
| 197 |
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
|
| 198 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
|
@@ -203,13 +211,13 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
|
|
| 203 |
| NAMO-Turn | 81.53% | 85.33% | 83.39% | 84.62% | 80.67% | 82.59% | 3.60 | 83.44 |
|
| 204 |
| **⭐ TurnSense** | 96.03% | **96.67%** | **🏆 96.35%** | **96.64%** | 96.00% | **🏆 96.32%** | 54.65 | 58.00 |
|
| 205 |
|
| 206 |
-
> **🔍 Key
|
| 207 |
|
| 208 |
<br/>
|
| 209 |
|
| 210 |
-
### 📋 semantic_test_ZH
|
| 211 |
|
| 212 |
-
> Data source: Chinese test
|
| 213 |
|
| 214 |
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
|
| 215 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
|
@@ -220,19 +228,47 @@ Standardized repository structure with FP32 / INT8 ONNX models — from installa
|
|
| 220 |
| NAMO-Turn | 71.48% | 86.70% | 78.36% | 83.10% | 65.40% | 73.20% | 3.60 | 83.44 |
|
| 221 |
| **⭐ TurnSense** | **88.96%** | 95.90% | **🏆 92.30%** | 95.55% | **88.00%** | **🏆 91.62%** | 54.65 | 58.00 |
|
| 222 |
|
| 223 |
-
> **🔍 Key
|
| 224 |
|
| 225 |
<br/>
|
| 226 |
|
| 227 |
-
### 📋
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
-
|
| 230 |
|
| 231 |
-
| Model | P (
|
| 232 |
-
| :--- | :---: | :---: | :---: |
|
| 233 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 234 |
|
| 235 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
|
| 237 |
<br/>
|
| 238 |
|
|
@@ -251,19 +287,20 @@ cd TurnSense
|
|
| 251 |
pip install -U numpy onnxruntime torch librosa soundfile pandas scikit-learn huggingface_hub
|
| 252 |
```
|
| 253 |
|
| 254 |
-
### 2. Model Weights
|
| 255 |
|
| 256 |
TurnSense model weights are available on Hugging Face: [Baiji-Team/TurnSense](https://huggingface.co/brgroup/TurnSense)
|
| 257 |
|
| 258 |
| Version | Size | Use Case |
|
| 259 |
| :--- | :--- | :--- |
|
| 260 |
-
| FP32 | ~191 MB | Accuracy-first |
|
| 261 |
-
| INT8 | ~50 MB | Deployment-first
|
|
|
|
|
|
|
| 262 |
|
| 263 |
-
**
|
| 264 |
|
| 265 |
-
|
| 266 |
-
The inference script includes built-in Hugging Face download logic. The model will be automatically fetched and cached on first run.
|
| 267 |
|
| 268 |
**Option 2: Git LFS**
|
| 269 |
|
|
@@ -287,12 +324,12 @@ python infer.py
|
|
| 287 |
|
| 288 |
Example output:
|
| 289 |
|
| 290 |
-
```
|
| 291 |
Loading model from brgroup/TurnSense...
|
| 292 |
-
Running inference on: "
|
| 293 |
|
| 294 |
Results:
|
| 295 |
-
Input: "
|
| 296 |
TurnSense Detection Result: "incomplete"
|
| 297 |
```
|
| 298 |
|
|
@@ -304,12 +341,12 @@ Results:
|
|
| 304 |
|
| 305 |
## 🧪 Evaluation Guide
|
| 306 |
|
| 307 |
-
### 1
|
| 308 |
|
| 309 |
-
1.
|
| 310 |
-
2. Warm up each model
|
| 311 |
-
3. Run
|
| 312 |
-
4. Automatically
|
| 313 |
|
| 314 |
Output files include:
|
| 315 |
|
|
@@ -318,27 +355,27 @@ Output files include:
|
|
| 318 |
| `report.md` | Summary evaluation report |
|
| 319 |
| `results.json` | Structured evaluation results |
|
| 320 |
| `config.json` | Evaluation configuration |
|
| 321 |
-
| `per_sample__*.jsonl` | Per-sample prediction
|
| 322 |
|
| 323 |
-
### 2
|
| 324 |
|
| 325 |
-
Each line
|
| 326 |
|
| 327 |
| Field | Description |
|
| 328 |
| :--- | :--- |
|
| 329 |
| `audio_path` | Path to the audio file |
|
| 330 |
| `text` | Text content |
|
| 331 |
-
| `label` | Label
|
| 332 |
|
| 333 |
Example:
|
| 334 |
|
| 335 |
```jsonl
|
| 336 |
-
{"audio_path":"/001.wav","text":"
|
| 337 |
-
{"audio_path":"/002.wav","text":"
|
| 338 |
-
{"audio_path":"/003.wav","text":"
|
| 339 |
```
|
| 340 |
|
| 341 |
-
### 3
|
| 342 |
|
| 343 |
```bash
|
| 344 |
python TurnSense/Turn_benchmark/benchmark.py
|
|
@@ -366,13 +403,15 @@ If you use TurnSense in your research or product, please cite:
|
|
| 366 |
|
| 367 |
<br/>
|
| 368 |
|
| 369 |
-
|
|
|
|
|
|
|
| 370 |
|
| 371 |
-
If you have questions or suggestions, feel free to
|
| 372 |
|
| 373 |
| Channel | Contact |
|
| 374 |
| :--- | :--- |
|
| 375 |
-
| 📧 Email | [huan.shen@brgroup.com](mailto:huan.shen@brgroup.com)
|
| 376 |
| 💬 WeChat | h2538406363 |
|
| 377 |
| 👥 WeChat Group | Scan the QR code to join the group<br><img src="image/wechat.jpg" alt="WeChat group QR code" width="220" /> |
|
| 378 |
| 🐛 Issues | [GitHub Issues](https://github.com/Bairong-Xdynamics/TurnSense/issues) |
|
|
@@ -382,7 +421,7 @@ If you have questions or suggestions, feel free to reach out:
|
|
| 382 |
|
| 383 |
## 📄 License
|
| 384 |
|
| 385 |
-
This project is released under the **Apache License 2.0** with
|
| 386 |
|
| 387 |
<br/>
|
| 388 |
|
|
@@ -392,4 +431,4 @@ This project is released under the **Apache License 2.0** with certain additiona
|
|
| 392 |
|
| 393 |
**Built with ❤️ by [Baiji Team](https://github.com/Bairong-Xdynamics)**
|
| 394 |
|
| 395 |
-
</div>
|
|
|
|
| 4 |
- zh
|
| 5 |
- en
|
| 6 |
widget:
|
| 7 |
+
- text: TurnSense three-class speech turn detection demo
|
| 8 |
output:
|
| 9 |
url: image/PR_new.mp4
|
| 10 |
---
|
|
|
|
| 21 |
|
| 22 |
<br/>
|
| 23 |
|
| 24 |
+
<center><strong>47M Parameters | CPU Latency ~55ms | F1 up to 96.35% | Invalid Utterance Filtering</strong></center>
|
|
|
|
| 25 |
|
| 26 |
<br/>
|
| 27 |
|
|
|
|
| 38 |
|
| 39 |
<br/>
|
| 40 |
|
| 41 |
+
> **⭐ If TurnSense is useful to you, please give us a Star!** This helps us continue improving the model and documentation.
|
| 42 |
|
| 43 |
<br/>
|
| 44 |
|
| 45 |
## 📖 Table of Contents
|
| 46 |
|
| 47 |
+
- [News](#-news)
|
| 48 |
- [Why TurnSense](#-why-turnsense)
|
| 49 |
+
- [Introduction](#-introduction)
|
| 50 |
+
- [Core Features](#-core-features)
|
| 51 |
- [Model Size Comparison](#-model-size-comparison)
|
| 52 |
- [Benchmark Results](#-benchmark-results)
|
| 53 |
- [Quick Start](#-quick-start)
|
| 54 |
- [Evaluation Guide](#-evaluation-guide)
|
| 55 |
- [Citation](#-citation)
|
| 56 |
+
- [Questions and Contact](#-questions-and-contact)
|
| 57 |
- [License](#-license)
|
| 58 |
|
| 59 |
<br/>
|
|
|
|
| 62 |
|
| 63 |
<br/>
|
| 64 |
|
| 65 |
+
## 📰 News
|
| 66 |
+
|
| 67 |
+
- **2026.05.22**: Released **TurnSense 1.1**, an English-enhanced version focused on improving `complete / incomplete` semantic completeness detection in English scenarios. It is suitable for Chinese-English mixed dialogue scenarios. The model is available on Hugging Face: [brgroup/TurnSense](https://huggingface.co/brgroup/TurnSense).
|
| 68 |
+
|
| 69 |
+
<br/>
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
<br/>
|
| 74 |
+
|
| 75 |
## 🏆 Why TurnSense
|
| 76 |
|
| 77 |
<div align="center">
|
| 78 |
|
| 79 |
| Dimension | TurnSense Performance |
|
| 80 |
| :---: | :---: |
|
| 81 |
+
| 🎯 **Accuracy** | F1 **96.35%** on `easyturn_real_test_ZH` — best among comparable models |
|
| 82 |
+
| ⚡ **Inference Latency** | CPU p50 ≈ **54.65ms** — suitable for real-time interaction |
|
| 83 |
+
| 📦 **Model Size** | Only **47M** parameters, with an INT8 version of about **50MB** |
|
| 84 |
+
| 🧠 **Classification Ability** | The first open-source model to natively support **complete / incomplete / invalid** three-class detection |
|
| 85 |
+
| 🚫 **Invalid Filtering** | Invalid utterance F1 reaches **94.34%**, effectively reducing noise-triggered false activations |
|
| 86 |
+
| 🤗 **Open-Source Friendly** | Provides FP32 / INT8 ONNX models, ready to use out of the box |
|
| 87 |
|
| 88 |
</div>
|
| 89 |
|
|
|
|
| 93 |
|
| 94 |
<br/>
|
| 95 |
|
| 96 |
+
## 📌 Introduction
|
| 97 |
|
| 98 |
+
**TurnSense** is a **three-class semantic turn detection model** designed for human-machine speech interaction. It focuses on a core problem in conversational systems:
|
| 99 |
|
| 100 |
+
> **Should the system respond immediately while the user is speaking, or should it keep waiting?**
|
| 101 |
|
| 102 |
+
Traditional approaches usually perform only binary "end-of-turn" detection. **TurnSense goes further** by jointly modeling semantic completeness and invalid input detection. This helps systems achieve more natural turn-taking in complex real-world scenarios and significantly reduces premature interruption, overlapping speech, and invalid triggers.
|
| 103 |
|
| 104 |
<div align="center">
|
| 105 |
+
<img src="./image/TurnSense.svg" alt="TurnSense three-class diagram" width="820"/>
|
| 106 |
</div>
|
| 107 |
|
|
|
|
| 108 |
<div align="center">
|
| 109 |
<video src="https://huggingface.co/brgroup/TurnSense/resolve/main/image/PR_new.mp4"
|
| 110 |
width="820"
|
|
|
|
| 117 |
</video>
|
| 118 |
</div>
|
| 119 |
|
|
|
|
| 120 |
TurnSense classifies user input into three semantic states:
|
| 121 |
|
| 122 |
+
| State | Meaning | Example |
|
| 123 |
| :---: | :--- | :--- |
|
| 124 |
+
| ✅ **Complete** | The user's expression forms a complete intent, and the system can respond | `"Please check tomorrow's weather in Shanghai."` |
|
| 125 |
+
| ⏳ **Incomplete** | The user's expression is not finished and may continue after a pause or truncation | `"I want to ask about that order from yesterday..."` |
|
| 126 |
+
| 🔇 **Invalid** | The input does not form valid semantic content and should not trigger a response | `"...(continuous noise / nonverbal vocalization)"` |
|
| 127 |
|
| 128 |
+
These three labels allow the system to determine not only **"whether it should take the turn"**, but also **"whether the input is worth responding to"**. This improves interaction naturalness and system stability in voice assistants, real-time calls, intelligent customer service, and other speech interaction scenarios.
|
| 129 |
|
| 130 |
<br/>
|
| 131 |
|
|
|
|
| 133 |
|
| 134 |
<br/>
|
| 135 |
|
| 136 |
+
## ✨ Core Features
|
| 137 |
|
| 138 |
### 🧠 Semantic-Level Three-Class Detection
|
| 139 |
|
| 140 |
+
TurnSense jointly models `complete / incomplete / invalid` states. Compared with traditional binary turn detection, this is closer to real conversational behavior. It is also the only open-source solution that natively supports invalid semantic detection.
|
| 141 |
|
| 142 |
+
### ⚡ Extremely Lightweight and Fast
|
| 143 |
|
| 144 |
+
TurnSense has only **47M** parameters. The INT8 version is about **50MB**. In CPU environments, it achieves p50 latency of about **54.65ms** and p90 latency of about **58.00ms**, enabling real-time interaction without requiring a GPU.
|
| 145 |
|
| 146 |
+
### 🎯 Strong Accuracy
|
| 147 |
|
| 148 |
+
On `easyturn_real_test_ZH` with 300 samples, TurnSense achieves **F1 96.35%** for `complete` and **F1 96.32%** for `incomplete`. On `semantic_test_ZH` with 2000 samples, it achieves **F1 92.30%** for `complete` and **F1 91.62%** for `incomplete`, reaching best or second-best performance among comparable models.
|
| 149 |
|
| 150 |
### 🚫 Invalid Input Filtering
|
| 151 |
|
| 152 |
+
On the NonverbalVocalization dataset, invalid utterance detection reaches **100% precision**, **90.37% recall**, and **94.34% F1**, effectively suppressing false activations caused by nonverbal vocalizations and noise.
|
| 153 |
|
| 154 |
+
### ⚖️ More Robust Turn-Taking Decisions
|
| 155 |
|
| 156 |
+
TurnSense balances precision and recall in semantically ambiguous, paused, or colloquial speech scenarios, reducing premature responses and missed responses.
|
| 157 |
|
| 158 |
+
### 📊 Reproducible Evaluation Pipeline
|
| 159 |
|
| 160 |
+
The project includes a complete evaluation workflow and scripts, supporting unified metric comparison and performance regression analysis to ensure reproducibility.
|
| 161 |
|
| 162 |
+
### 🤗 Open-Source Friendly and Ready to Use
|
| 163 |
|
| 164 |
+
TurnSense provides a standardized repository structure and FP32 / INT8 ONNX models. Installation and inference can be completed within minutes.
|
| 165 |
|
| 166 |
<br/>
|
| 167 |
|
|
|
|
| 184 |
|
| 185 |
</div>
|
| 186 |
|
| 187 |
+
> 💡 With only **47M** parameters, TurnSense provides native three-class detection and achieves a strong balance between accuracy and model size.
|
| 188 |
|
| 189 |
<br/>
|
| 190 |
|
|
|
|
| 194 |
|
| 195 |
## 📊 Benchmark Results
|
| 196 |
|
| 197 |
+
> The following results cover Chinese, English, and invalid-utterance test sets. Chinese results mainly demonstrate the capability of the initial TurnSense version, while English results show the enhanced performance of TurnSense 1.1.
|
| 198 |
|
| 199 |
<br/>
|
| 200 |
|
| 201 |
+
### 📋 easyturn_real_test_ZH(300 samples)
|
| 202 |
|
| 203 |
+
> Data source: real samples from [Easy-Turn-Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset)
|
| 204 |
|
| 205 |
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
|
| 206 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
|
|
|
| 211 |
| NAMO-Turn | 81.53% | 85.33% | 83.39% | 84.62% | 80.67% | 82.59% | 3.60 | 83.44 |
|
| 212 |
| **⭐ TurnSense** | 96.03% | **96.67%** | **🏆 96.35%** | **96.64%** | 96.00% | **🏆 96.32%** | 54.65 | 58.00 |
|
| 213 |
|
| 214 |
+
> **🔍 Key finding:** TurnSense achieves the highest F1 for both `complete` and `incomplete`, and is the only model that reaches F1 > 96% with CPU p50 latency below 60ms.
|
| 215 |
|
| 216 |
<br/>
|
| 217 |
|
| 218 |
+
### 📋 semantic_test_ZH(2000 samples)
|
| 219 |
|
| 220 |
+
> Data source: Chinese test set from [KE-Team/SemanticVAD-Dataset](https://huggingface.co/datasets/KE-Team/SemanticVAD-Dataset)
|
| 221 |
|
| 222 |
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
|
| 223 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
|
|
|
| 228 |
| NAMO-Turn | 71.48% | 86.70% | 78.36% | 83.10% | 65.40% | 73.20% | 3.60 | 83.44 |
|
| 229 |
| **⭐ TurnSense** | **88.96%** | 95.90% | **🏆 92.30%** | 95.55% | **88.00%** | **🏆 91.62%** | 54.65 | 58.00 |
|
| 230 |
|
| 231 |
+
> **🔍 Key finding:** On the larger 2000-sample test set, TurnSense continues to maintain the best F1 performance, demonstrating strong generalization.
|
| 232 |
|
| 233 |
<br/>
|
| 234 |
|
| 235 |
+
### 📋 TurnSense 1.1 English Enhancement Results
|
| 236 |
+
|
| 237 |
+
> Model download: [Hugging Face - brgroup/TurnSense](https://huggingface.co/brgroup/TurnSense)
|
| 238 |
+
|
| 239 |
+
> TurnSense 1.1 focuses on improving semantic completeness detection in English scenarios. The following results show its `complete / incomplete` performance on English test sets.
|
| 240 |
|
| 241 |
+
#### ten_test_EN
|
| 242 |
|
| 243 |
+
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** |
|
| 244 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 245 |
+
| Smart-Turn-v3 | 70.66% | 72.46% | 71.55% | 65.05% | 63.02% | 64.02% |
|
| 246 |
+
| TEN-Turn | **98.61%** | 90.25% | **94.25%** | 89.15% | **98.44%** | **93.56%** |
|
| 247 |
+
| FireRedChat | 76.41% | **97.46%** | 85.66% | **95.28%** | 63.02% | 75.86% |
|
| 248 |
+
| NAMO-Turn | <u>92.65%</u> | 26.69% | 41.45% | 51.94% | <u>97.40%</u> | 67.75% |
|
| 249 |
+
| **⭐ TurnSense 1.1 int8** | 83.01% | 91.10% | 86.87% | 87.57% | 77.08% | <u>81.99%</u> |
|
| 250 |
|
| 251 |
+
#### semantic_test_EN
|
| 252 |
+
|
| 253 |
+
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** |
|
| 254 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 255 |
+
| Smart-Turn-v3 | 68.18% | 75.00% | 71.43% | 72.22% | 65.00% | 68.42% |
|
| 256 |
+
| TEN-Turn | **97.98%** | 97.00% | **97.49%** | **97.03%** | **98.00%** | **97.51%** |
|
| 257 |
+
| FireRedChat | 72.06% | **98.00%** | 83.05% | 96.88% | 62.00% | 75.61% |
|
| 258 |
+
| NAMO-Turn | <u>93.55%</u> | 87.00% | <u>90.16%</u> | 87.85% | <u>94.00%</u> | <u>90.82%</u> |
|
| 259 |
+
| **⭐ TurnSense 1.1 int8** | 74.60% | 94.00% | 83.19% | <u>91.89%</u> | 68.00% | 78.16% |
|
| 260 |
+
|
| 261 |
+
<br/>
|
| 262 |
+
|
| 263 |
+
### 📋 NonverbalVocalization_invalid(728 samples)
|
| 264 |
+
|
| 265 |
+
> Data source: OpenSLR [Deeply Nonverbal Vocalization Dataset(SLR99)](https://openslr.elda.org/99/)
|
| 266 |
+
|
| 267 |
+
| Model | R (invalid) |
|
| 268 |
+
| :--- | :---: |
|
| 269 |
+
| **⭐ TurnSense** | **90.37%** |
|
| 270 |
+
|
| 271 |
+
> **🔍 Key finding:** TurnSense supports invalid semantic detection and can effectively reduce system responses triggered by nonverbal vocalizations or noise.
|
| 272 |
|
| 273 |
<br/>
|
| 274 |
|
|
|
|
| 287 |
pip install -U numpy onnxruntime torch librosa soundfile pandas scikit-learn huggingface_hub
|
| 288 |
```
|
| 289 |
|
| 290 |
+
### 2. Download Model Weights
|
| 291 |
|
| 292 |
TurnSense model weights are available on Hugging Face: [Baiji-Team/TurnSense](https://huggingface.co/brgroup/TurnSense)
|
| 293 |
|
| 294 |
| Version | Size | Use Case |
|
| 295 |
| :--- | :--- | :--- |
|
| 296 |
+
| FP32 | ~191 MB | Accuracy-first scenarios |
|
| 297 |
+
| INT8 | ~50 MB | Deployment-first scenarios, recommended |
|
| 298 |
+
|
| 299 |
+
**Download options:**
|
| 300 |
|
| 301 |
+
**Option 1: Automatic download, recommended**
|
| 302 |
|
| 303 |
+
The inference script includes Hugging Face download logic and will automatically download and cache the model during the first run.
|
|
|
|
| 304 |
|
| 305 |
**Option 2: Git LFS**
|
| 306 |
|
|
|
|
| 324 |
|
| 325 |
Example output:
|
| 326 |
|
| 327 |
+
```text
|
| 328 |
Loading model from brgroup/TurnSense...
|
| 329 |
+
Running inference on: "I want to ask about that order from yesterday..."
|
| 330 |
|
| 331 |
Results:
|
| 332 |
+
Input: "I want to ask about that order from yesterday..."
|
| 333 |
TurnSense Detection Result: "incomplete"
|
| 334 |
```
|
| 335 |
|
|
|
|
| 341 |
|
| 342 |
## 🧪 Evaluation Guide
|
| 343 |
|
| 344 |
+
### 1. Evaluation Pipeline
|
| 345 |
|
| 346 |
+
1. Read test datasets in `.jsonl` format.
|
| 347 |
+
2. Warm up each model first. The default value is `warmup_iters=20`.
|
| 348 |
+
3. Run inference sample by sample and collect classification and performance metrics.
|
| 349 |
+
4. Automatically export summary reports and detailed result files.
|
| 350 |
|
| 351 |
Output files include:
|
| 352 |
|
|
|
|
| 355 |
| `report.md` | Summary evaluation report |
|
| 356 |
| `results.json` | Structured evaluation results |
|
| 357 |
| `config.json` | Evaluation configuration |
|
| 358 |
+
| `per_sample__*.jsonl` | Per-sample prediction results |
|
| 359 |
|
| 360 |
+
### 2. Data Format Requirements(JSONL)
|
| 361 |
|
| 362 |
+
Each line should be a JSON object containing at least the following fields:
|
| 363 |
|
| 364 |
| Field | Description |
|
| 365 |
| :--- | :--- |
|
| 366 |
| `audio_path` | Path to the audio file |
|
| 367 |
| `text` | Text content |
|
| 368 |
+
| `label` | Label: `complete` / `incomplete` / `invalid` |
|
| 369 |
|
| 370 |
Example:
|
| 371 |
|
| 372 |
```jsonl
|
| 373 |
+
{"audio_path":"/001.wav","text":"Please check tomorrow's weather in Shanghai.","label":"complete"}
|
| 374 |
+
{"audio_path":"/002.wav","text":"I want to ask about that order from yesterday...","label":"incomplete"}
|
| 375 |
+
{"audio_path":"/003.wav","text":"uh... hmm... continuous noise","label":"invalid"}
|
| 376 |
```
|
| 377 |
|
| 378 |
+
### 3. Run Evaluation
|
| 379 |
|
| 380 |
```bash
|
| 381 |
python TurnSense/Turn_benchmark/benchmark.py
|
|
|
|
| 403 |
|
| 404 |
<br/>
|
| 405 |
|
| 406 |
+
<br/>
|
| 407 |
+
|
| 408 |
+
## ❓ Questions and Contact
|
| 409 |
|
| 410 |
+
If you have questions or suggestions, feel free to contact us through the following channels:
|
| 411 |
|
| 412 |
| Channel | Contact |
|
| 413 |
| :--- | :--- |
|
| 414 |
+
| 📧 Email | [huan.shen@brgroup.com](mailto:huan.shen@brgroup.com) ・ [yingao.wang@brgroup.com](mailto:yingao.wang@brgroup.com) ・ [wei.zou@brgroup.com](mailto:wei.zou@brgroup.com) |
|
| 415 |
| 💬 WeChat | h2538406363 |
|
| 416 |
| 👥 WeChat Group | Scan the QR code to join the group<br><img src="image/wechat.jpg" alt="WeChat group QR code" width="220" /> |
|
| 417 |
| 🐛 Issues | [GitHub Issues](https://github.com/Bairong-Xdynamics/TurnSense/issues) |
|
|
|
|
| 421 |
|
| 422 |
## 📄 License
|
| 423 |
|
| 424 |
+
This project is released under the **Apache License 2.0** with additional specific restrictions. See [LICENSE](./LICENSE) for details.
|
| 425 |
|
| 426 |
<br/>
|
| 427 |
|
|
|
|
| 431 |
|
| 432 |
**Built with ❤️ by [Baiji Team](https://github.com/Bairong-Xdynamics)**
|
| 433 |
|
| 434 |
+
</div>
|