yepher commited on
Commit
ec32f95
·
verified ·
1 Parent(s): 06ee2e6

Update README.md

Browse files

Adds some details about the model

Files changed (1) hide show
  1. README.md +131 -140
README.md CHANGED
@@ -1,199 +1,190 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
 
 
 
 
 
 
 
 
 
 
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
 
 
 
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
 
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
 
170
 
171
- ## Citation [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
 
 
188
 
189
- ## More Information [optional]
190
 
191
- [More Information Needed]
 
 
 
 
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - es
5
+ - fr
6
+ - de
7
+ - it
8
+ - pt
9
+ - nl
10
+ - zh
11
+ - ja
12
+ - ko
13
+ - id
14
+ - tr
15
+ - ru
16
+ - hi
17
+ license: other
18
+ license_name: livekit-model-license
19
+ license_link: LICENSE
20
  library_name: transformers
21
+ pipeline_tag: text-classification
22
+ base_model: Qwen/Qwen2.5-0.5B-Instruct
23
+ tags:
24
+ - voice-ai
25
+ - turn-detection
26
+ - end-of-utterance
27
+ - end-of-turn
28
+ - conversational-ai
29
+ - livekit
30
+ - onnx
31
+ - quantized
32
+ - knowledge-distillation
33
  ---
34
 
35
+ # LiveKit Turn Detector
36
 
37
+ An open-weights language model for contextually-aware end-of-utterance (EOU) detection in voice AI applications. The model predicts whether a user has finished speaking based on the semantic content of their transcribed speech, providing a critical complement to voice activity detection (VAD) systems.
38
 
39
+ > **📖 For installation, usage examples, and integration guides, see the [LiveKit documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/).**
40
 
41
+ ## Table of Contents
42
 
43
+ - [Overview](#overview)
44
+ - [Model Variants](#model-variants)
45
+ - [How It Works](#how-it-works)
46
+ - [Architecture and Training](#architecture-and-training)
47
+ - [Supported Languages](#supported-languages)
48
+ - [Benchmarks](#benchmarks)
49
+ - [Usage](#usage)
50
+ - [Deployment Requirements](#deployment-requirements)
51
+ - [Limitations](#limitations)
52
+ - [License](#license)
53
+ - [Resources](#resources)
54
 
55
+ ## Overview
56
 
57
+ Traditional voice agents rely on voice activity detection (VAD) to determine when a user has finished speaking. VAD works by detecting the presence or absence of speech in an audio signal and applying a silence timer. While effective for detecting pauses, VAD lacks language understanding and frequently causes false positives. For example, a user who says *"I need to think about that for a moment..."* and then pauses will be interrupted by a VAD-only system, even though they clearly intend to continue.
58
 
59
+ This model adds semantic understanding to the turn detection process. It analyzes the transcribed text of a conversation in real time and predicts the probability that the user has completed their turn. When integrated into a voice pipeline alongside VAD, it substantially reduces unwanted interruptions while maintaining responsiveness.
60
 
61
+ The model is particularly effective in scenarios involving structured data input — such as dictating addresses, phone numbers, email addresses, and credit card numbers — where natural pauses between segments do not indicate completion.
 
 
 
 
 
 
62
 
63
+ ## Model Variants
64
 
65
+ | Variant | Revision | Base Model | Size on Disk | Inference Latency | RAM |
66
+ |---|---|---|---|---|---|
67
+ | **Multilingual** (recommended) | `v0.4.1-intl` | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | ~396 MB | ~50–160 ms | <500 MB |
68
+ | **English-only** (deprecated) | `v1.2.2-en` | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | ~200 MB | ~10 ms | <200 MB |
69
 
70
+ Both variants are distributed as INT8 quantized ONNX models (`model_q8.onnx`) optimized for CPU inference.
 
 
71
 
72
+ > **⚠️ The English-only model (`EnglishModel`) is deprecated.** Use the **multilingual model (`MultilingualModel`)** for all new projects, including English-only applications. The multilingual model provides better accuracy across all languages — including English — thanks to knowledge distillation from a larger teacher model and an expanded training dataset. The English-only variant will not receive further updates.
73
 
74
+ ## How It Works
75
 
76
+ The model operates on transcribed text from a speech-to-text (STT) system, not raw audio.
77
 
78
+ 1. **Input**: The recent conversation history (up to **6 turns**, truncated to **128 tokens**) is formatted using the [Qwen chat template](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) with `<|im_start|>` / `<|im_end|>` delimiters. The final user message is left *without* the closing `<|im_end|>` token.
79
 
80
+ 2. **Prediction**: The model predicts the probability of the `<|im_end|>` token appearing next. A **high probability** indicates the user has likely finished their utterance. A **low probability** indicates they are likely to continue.
81
 
82
+ 3. **Thresholding**: Per-language thresholds (stored in `languages.json`) convert the raw probability into a binary decision. These thresholds are tuned to balance responsiveness and accuracy for each supported language.
83
 
84
+ 4. **Integration with VAD**: In the LiveKit Agents framework, the model works alongside the [Silero VAD](https://docs.livekit.io/agents/logic/turns/vad/) plugin. VAD handles speech presence detection and interruption triggering, while this model provides the semantic signal for when to commit a turn.
85
 
86
+ ### Text Preprocessing
87
 
88
+ The **multilingual** variant applies the following normalization before inference:
89
 
90
+ - NFKC unicode normalization
91
+ - Lowercasing
92
+ - Punctuation removal (preserving apostrophes and hyphens)
93
+ - Whitespace collapsing
94
 
95
+ The **English-only** variant passes raw transcribed text without normalization.
96
 
97
+ ## Architecture and Training
98
 
99
+ ### Base Model
100
 
101
+ Both variants are fine-tuned from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), selected for its strong performance on this task while enabling low-latency CPU inference.
102
 
103
+ ### Knowledge Distillation
104
 
105
+ A **Qwen2.5-7B-Instruct** model was first fine-tuned as a teacher on end-of-turn prediction. Its knowledge was then distilled into the 0.5B student model. The distilled model approaches teacher-level accuracy while maintaining the efficiency of the smaller architecture, converging after approximately 1,500 training steps.
 
 
 
 
 
 
 
 
 
 
106
 
107
  ### Training Data
108
 
109
+ The training dataset is a mix of:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ - **Real call center transcripts** covering diverse conversational patterns
112
+ - **Synthetic dialogues** emphasizing structured data input — addresses, email addresses, phone numbers, and credit card numbers
113
+ - **Multi-format STT outputs** to handle provider variation (e.g., "forty two" vs. "42"), ensuring consistent predictions across different STT engines without runtime overhead
 
 
114
 
115
+ Although structured data enhancements were added only to the English training set, performance improvements generalized across languages due to the multilingual knowledge encoded in the Qwen2.5 base model.
116
 
117
+ ### Quantization
118
 
119
+ The trained model is exported to ONNX format and quantized to INT8 (`model_q8.onnx`), enabling efficient CPU-only inference with ONNX Runtime.
120
 
121
+ ## Supported Languages
122
 
123
+ The multilingual model supports 14 languages. The model relies on the STT provider to report the detected language, which is then used to select the appropriate per-language threshold.
124
 
125
+ English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Indonesian, Turkish, Russian, Hindi
126
 
127
+ ## Benchmarks
128
 
129
+ ### Detection Accuracy (Multilingual Variant)
130
 
131
+ - **True positive** — the model correctly identifies the user has finished speaking.
132
+ - **True negative** — the model correctly identifies the user will continue speaking.
133
 
134
+ | Language | True Positive Rate | True Negative Rate |
135
+ |---|---|---|
136
+ | Hindi | 99.4% | 96.3% |
137
+ | Korean | 99.3% | 94.5% |
138
+ | French | 99.3% | 88.9% |
139
+ | Indonesian | 99.3% | 89.4% |
140
+ | Japanese | 99.3% | 88.8% |
141
+ | Dutch | 99.3% | 88.1% |
142
+ | Russian | 99.3% | 88.0% |
143
+ | German | 99.3% | 87.8% |
144
+ | Portuguese | 99.4% | 87.4% |
145
+ | Turkish | 99.3% | 87.3% |
146
+ | English | 99.3% | 87.0% |
147
+ | Chinese | 99.3% | 86.6% |
148
+ | Spanish | 99.3% | 86.0% |
149
+ | Italian | 99.3% | 85.1% |
150
 
151
+ ### Improvement Over Prior Version
152
 
153
+ The multilingual v0.4.1 release achieved a **39.23% relative improvement** in handling structured inputs (emails, addresses, phone numbers, credit card numbers) compared to the prior version, reducing premature interruptions during data collection scenarios.
154
 
155
+ ## Usage
156
 
157
+ The model is designed for use as a turn detection plugin within the [LiveKit Agents](https://github.com/livekit/agents) framework.
158
 
159
+ For complete installation instructions, code examples (Python and Node.js), and configuration options, see the **[LiveKit turn detector plugin documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**.
160
 
161
+ For broader context on how turn detection fits into the voice pipeline — including VAD configuration, interruption handling, and manual turn control — see the **[Turns overview](https://docs.livekit.io/agents/logic/turns/)**.
162
 
163
+ ## Deployment Requirements
164
 
165
+ - **Runtime**: CPU-only (no GPU required). Uses [ONNX Runtime](https://onnxruntime.ai/) with the `CPUExecutionProvider`.
166
+ - **RAM**: <500 MB for the multilingual model.
167
+ - **Instance type**: Use compute-optimized instances (e.g., AWS c6i, c7i). Avoid burstable instances (e.g., AWS t3, t4g) to prevent inference timeouts from CPU credit exhaustion.
168
+ - **LiveKit Cloud**: The model is deployed globally on LiveKit Cloud. Agents running there automatically use the optimized remote inference service with no local resource requirements.
169
 
170
+ ## Limitations
171
 
172
+ - **Text-only input**: The model operates on STT-transcribed text and cannot incorporate prosodic cues such as pauses, intonation, or emphasis. Future versions may integrate multimodal audio features.
173
+ - **STT dependency**: Prediction quality depends on the accuracy and output format of the upstream STT provider. Mismatches between training and deployment STT formats may degrade performance.
174
+ - **Context window**: Limited to 128 tokens across a maximum of 6 conversation turns.
175
+ - **Language coverage**: Currently supports 14 languages. Performance on unsupported languages is undefined.
176
+ - **Realtime model compatibility**: Cannot be used with audio-native realtime models (e.g., OpenAI Realtime API) without adding a separate STT service, which incurs additional cost and latency.
177
 
178
+ ## License
179
 
180
+ This model is released under the [LiveKit Model License](LICENSE).
181
 
182
+ ## Resources
183
 
184
+ - **[Documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**: Full plugin documentation, installation, and integration guide.
185
+ - **[Turns Overview](https://docs.livekit.io/agents/logic/turns/)**: How turn detection fits into the LiveKit Agents voice pipeline.
186
+ - **[Blog: Improved End-of-Turn Model](https://blog.livekit.io/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/)**: Technical deep dive on the multilingual distillation approach and benchmarks.
187
+ - **[Blog: Using a Transformer for Turn Detection](https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection/)**: Original blog post introducing the concept and architecture.
188
+ - **[Video: LiveKit Turn Detector](https://youtu.be/OZG0oZKctgw)**: Overview video demonstrating the plugin.
189
+ - **[GitHub: Plugin Source](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector)**: Source code for the `livekit-plugins-turn-detector` package.
190
+ - **[PyPI](https://pypi.org/project/livekit-plugins-turn-detector/)** | **[npm](https://www.npmjs.com/package/@livekit/agents-plugin-livekit)**: Package registries.