---
license: apache-2.0
---
# TurnSense
### ๐ฏ Lightweight ยท Accurate ยท Three-Class โ Redefining Speech Turn Detection
47M ๅๆฐ ๏ฝ CPU ๅปถ่ฟ ~55ms ๏ฝ F1 ้ซ่พพ 96.35% ๏ฝ ๆ ๆ่ฏญไน่ฟๆปค
[](https://github.com/Bairong-Xdynamics/TurnSense)
[](https://huggingface.co/brgroup/TurnSense)
[](./LICENSE)
[](https://github.com/Bairong-Xdynamics/TurnSense)
**Language**: **English** | [ไธญๆ](./README_zh.md)
> **โญ If TurnSense is useful to you, please give us a Star!** It helps us keep improving the model and documentation.
## ๐ Table of Contents
- [Why TurnSense](#-why-turnsense)
- [Overview](#-overview)
- [Key Features](#-key-features)
- [Model Size Comparison](#-model-size-comparison)
- [Benchmark Results](#-benchmark-results)
- [Quick Start](#-quick-start)
- [Evaluation Guide](#-evaluation-guide)
- [Citation](#-citation)
- [Contact & Community](#-contact--community)
- [License](#-license)
---
## ๐ Why TurnSense
| Dimension | TurnSense Performance |
| :---: | :---: |
| ๐ฏ **Accuracy** | F1 **96.35%** (easyturn_real_test_ZH) โ best in class |
| โก **Inference Latency** | CPU p50 โ **54.65ms** โ real-time interaction ready |
| ๐ฆ **Model Size** | Only **47M** parameters, INT8 version only **~50MB** |
| ๐ง **Classification** | First open-source model natively supporting **complete / incomplete / invalid** three-class detection |
| ๐ซ **Invalid Filtering** | Invalid utterance F1 reaches **94.34%**, effectively suppressing noise-triggered responses |
| ๐ค **Open-Source Friendly** | FP32 / INT8 ONNX provided, ready to use out of the box |
---
## ๐ Overview
**TurnSense** is a **three-class semantic detection model** designed for human-machine voice interaction, focused on solving a critical problem in dialogue systems:
> **During a user's speech, should the system respond immediately, or continue waiting?**
Traditional approaches typically rely on a simple binary classification โ "finished or not." **TurnSense goes further** by simultaneously modeling semantic completeness and invalid input detection, enabling more natural turn-taking in complex real-world scenarios and **significantly reducing false interruptions, premature responses, and noise-triggered activations**.
TurnSense classifies user input into three semantic states:
| State | Description | Example |
| :---: | :--- | :--- |
| โ
**Complete** | The user has expressed a complete intent; the system can respond | `"Check tomorrow's weather in Shanghai for me."` |
| โณ **Incomplete** | The user's expression is unfinished โ truncated, paused, or trailing off | `"I'd like to ask about that order from yesterday..."` |
| ๐ **Invalid** | The input does not constitute meaningful speech and should not trigger a response | `"...(continuous noise / non-verbal vocalization)"` |
These three labels enable the system to determine not only **"should I respond?"** but also **"is it worth responding to?"** โ significantly improving interaction naturalness and system stability in voice assistants, real-time calls, intelligent customer service, and more.
---
## โจ Key Features
### ๐ง Semantic-Level Three-Class Detection
Simultaneously models `complete / incomplete / invalid` states โ closer to real conversational behavior than traditional binary classification, and currently the **only open-source solution with native invalid utterance detection**.
### โก Ultra-Lightweight, Ultra-Fast Inference
Only **47M** parameters (INT8 version ~50MB). CPU inference latency: p50 โ **54.65ms**, p90 โ **58.00ms** โ meets the strict requirements of real-time interaction **without a GPU**.
### ๐ฏ Leading Accuracy
Achieves **F1 96.35%** (complete) and **F1 96.32%** (incomplete) on easyturn_real_test_ZH (300 samples), and **F1 92.30%** (complete) and **F1 91.62%** (incomplete) on semantic_test_ZH (2000 samples) โ best or runner-up among all comparable models.
### ๐ซ Invalid Input Filtering
On the NonverbalVocalization test set, invalid utterance precision reaches **100%** with recall of **90.37%** (F1 = 94.34%), effectively suppressing false triggers from non-verbal sounds and noise.
### โ๏ธ More Robust Turn Decisions
Balances precision and recall in semantically ambiguous, pause-heavy, or colloquial scenarios, reducing both premature responses and missed responses.
### ๐ Reproducible Evaluation Framework
Ships with a complete evaluation pipeline and scripts, supporting unified metric comparison and performance regression analysis for full reproducibility.
### ๐ค Open-Source Friendly, Plug-and-Play
Standardized repository structure with FP32 / INT8 ONNX models โ from installation to inference in just a few minutes.
---
## ๐ Model Size Comparison
| Model | Parameters | Three-Class | Link |
| :--- | :---: | :---: | :--- |
| TEN-Turn | **7B** | โ | [TEN-framework/TEN_Turn_Detection](https://huggingface.co/TEN-framework/TEN_Turn_Detection) |
| Easy-Turn | 850M | โ | [ASLP-lab/Easy-Turn](https://huggingface.co/ASLP-lab/Easy-Turn) |
| NAMO-Turn-Detector (ZH) | 66M | โ | [videosdk-live/Namo-Turn-Detector-v1-Multilingual](https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Multilingual) |
| **โญ TurnSense** | **47M** | **โ
** | [**Baiji-Team/TurnSense**](https://huggingface.co/brgroup/TurnSense) |
| Smart-Turn-v3 | 8M | โ | [pipecat-ai/smart-turn-v3](https://huggingface.co/pipecat-ai/smart-turn-v3) |
| FireRedChat-turn-detector | -- | โ | [FireRedTeam/FireRedChat-turn-detector](https://huggingface.co/FireRedTeam/FireRedChat-turn-detector) |
> ๐ก With only **47M** parameters, TurnSense achieves three-class capability โ the best balance between accuracy and model size.
---
## ๐ Benchmark Results
> All results below are based on open-source Chinese evaluation sets. Latency marked with `(GPU)` indicates GPU environment; otherwise, latency was measured on **CPU**.
### ๐ easyturn_real_test_ZH (300 samples)
> Data source: Real data samples from [Easy-Turn-Testset](https://huggingface.co/datasets/ASLP-lab/Easy-Turn-Testset)
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Easy-Turn | 97.26% | 94.67% | 95.95% | 94.81% | 97.33% | 96.05% | 183.87 (GPU) | 300.37 (GPU) |
| Smart-Turn-v3 | 64.97% | 76.67% | 70.34% | 71.54% | 58.67% | 64.47% | 36.84 | 39.10 |
| TEN-Turn | **99.25%** | 88.00% | 93.29% | 89.22% | **99.33%** | 94.01% | 17.66 (GPU) | 19.41 (GPU) |
| FireRedChat | 70.65% | 94.67% | 80.91% | 91.92% | 60.67% | 73.09% | 98.30 | 99.42 |
| NAMO-Turn | 81.53% | 85.33% | 83.39% | 84.62% | 80.67% | 82.59% | 3.60 | 83.44 |
| **โญ TurnSense** | 96.03% | **96.67%** | **๐ 96.35%** | **96.64%** | 96.00% | **๐ 96.32%** | 54.65 | 58.00 |
> **๐ Key Finding:** TurnSense achieves the **highest F1** on both complete and incomplete classes, and is the only model with CPU p50 < 60ms while maintaining F1 > 96%.
### ๐ semantic_test_ZH (2000 samples)
> Data source: Chinese test split from [KE-Team/SemanticVAD-Dataset](https://huggingface.co/datasets/KE-Team/SemanticVAD-Dataset)
| Model | P (complete) | R (complete) | **F1 (complete)** | P (incomplete) | R (incomplete) | **F1 (incomplete)** | p50 Latency | p90 Latency |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Easy-Turn | 78.14% | 98.30% | 87.07% | 97.64% | 70.30% | 81.74% | 183.87 (GPU) | 300.37 (GPU) |
| Smart-Turn-v3 | 59.25% | 88.10% | 70.85% | 76.80% | 39.40% | 52.08% | 36.84 | 39.10 |
| TEN-Turn | 85.25% | **99.60%** | 91.87% | **99.52%** | 82.70% | 90.33% | 17.66 (GPU) | 19.41 (GPU) |
| FireRedChat | 66.76% | 99.40% | 79.87% | 98.83% | 50.50% | 66.84% | 98.30 | 99.42 |
| NAMO-Turn | 71.48% | 86.70% | 78.36% | 83.10% | 65.40% | 73.20% | 3.60 | 83.44 |
| **โญ TurnSense** | **88.96%** | 95.90% | **๐ 92.30%** | 95.55% | **88.00%** | **๐ 91.62%** | 54.65 | 58.00 |
> **๐ Key Finding:** On the larger 2000-sample test set, TurnSense still maintains the best F1, demonstrating strong generalization capability.
### ๐ NonverbalVocalization_invalid (728 samples)
> Data source: OpenSLR [Deeply Nonverbal Vocalization Dataset (SLR99)](https://openslr.elda.org/99/)
| Model | P (invalid) | R (invalid) | **F1 (invalid)** |
| :--- | :---: | :---: | :---: |
| **โญ TurnSense** | **100.00%** | **90.37%** | **๐ 94.34%** |
> **๐ Key Finding:** TurnSense is currently the only model that supports invalid utterance detection. A precision of **100%** means zero false positives โ effectively preventing noise from triggering system responses.
---
## ๐ Quick Start
### 1. Installation
```bash
git clone https://github.com/Bairong-Xdynamics/TurnSense.git
cd TurnSense
pip install -U numpy onnxruntime torch librosa soundfile pandas scikit-learn huggingface_hub
```
### 2. Model Weights
TurnSense model weights are available on Hugging Face: [Baiji-Team/TurnSense](https://huggingface.co/brgroup/TurnSense)
| Version | Size | Use Case |
| :--- | :--- | :--- |
| FP32 | ~191 MB | Accuracy-first |
| INT8 | ~50 MB | Deployment-first (recommended) |
**Download Options:**
**Option 1: Auto-download (Recommended)**
The inference script includes built-in Hugging Face download logic. The model will be automatically fetched and cached on first run.
**Option 2: Git LFS**
```bash
git lfs install
git clone https://huggingface.co/brgroup/TurnSense
```
**Option 3: Hugging Face Hub**
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="brgroup/TurnSense")
```
### 3. Inference
```bash
python infer.py
```
Example output:
```
Loading model from brgroup/TurnSense...
Running inference on: "ๆๆณ้ฎไธไธ้ฃไธช่ฎขๅๅฐฑๆฏๆจๅคฉ..."
Results:
Input: "ๆๆณ้ฎไธไธ้ฃไธช่ฎขๅๅฐฑๆฏๆจๅคฉ..."
TurnSense Detection Result: "incomplete"
```
---
## ๐งช Evaluation Guide
### 1) Evaluation Pipeline
1. Load the `.jsonl` test dataset (line-by-line JSONL)
2. Warm up each model (default `warmup_iters=20`)
3. Run per-sample inference, collecting classification and performance metrics
4. Automatically generate summary and detail files
Output files include:
| File | Description |
| :--- | :--- |
| `report.md` | Summary evaluation report |
| `results.json` | Structured evaluation results |
| `config.json` | Evaluation configuration |
| `per_sample__*.jsonl` | Per-sample prediction details |
### 2) Data Format (JSONL)
Each line is a JSON object containing at least the following fields:
| Field | Description |
| :--- | :--- |
| `audio_path` | Path to the audio file |
| `text` | Text content |
| `label` | Label (`complete` / `incomplete` / `invalid`) |
Example:
```jsonl
{"audio_path":"/001.wav","text":"ๅธฎๆๆฅไธไธๆๅคฉไธๆตทๅคฉๆฐ","label":"complete"}
{"audio_path":"/002.wav","text":"ๆๆณ้ฎไธไธ้ฃไธช่ฎขๅๅฐฑๆฏๆจๅคฉ...","label":"incomplete"}
{"audio_path":"/003.wav","text":"ๅโฆๅฏโฆ๏ผๆ็ปญๅชๅฃฐ๏ผ","label":"invalid"}
```
### 3) Run Evaluation
```bash
python TurnSense/Turn_benchmark/benchmark.py
```
---
## ๐ Citation
If you use TurnSense in your research or product, please cite:
```bibtex
@misc{turnsense2026,
author = {Baiji Team},
title = {TurnSense: A Three-Class Semantic Detection Model for Complete, Incomplete, and Invalid Utterances},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/brgroup/TurnSense}},
}
```
## โ Contact & Community
If you have questions or suggestions, feel free to reach out:
| Channel | Contact |
| :--- | :--- |
| ๐ง Email | [huan.shen@brgroup.com](mailto:huan.shen@brgroup.com) ยท [yingao.wang@brgroup.com](mailto:yingao.wang@brgroup.com) ยท [wei.zou@brgroup.com](mailto:wei.zou@brgroup.com) |
| ๐ฌ WeChat | h2538406363 |
| ๐ฅ WeChat Group | Scan the QR code to join the group
|
| ๐ Issues | [GitHub Issues](https://github.com/Bairong-Xdynamics/TurnSense/issues) |
| ๐ PR | [Pull Requests](https://github.com/Bairong-Xdynamics/TurnSense/pulls) |
## ๐ License
This project is released under the **Apache License 2.0** with certain additional conditions. See [LICENSE](./LICENSE) for details.
---
**Built with โค๏ธ by [Baiji Team](https://github.com/Bairong-Xdynamics)**