luht commited on
Commit
7a79bff
·
verified ·
1 Parent(s): e6c7fc4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -3
README.md CHANGED
@@ -1,3 +1,139 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Spoken-Dialogue-Turn-Detection
6
+
7
+ [![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=Spoken%20Dialogue%20Turn-Detection%20🤠&text2=💖%20Detect%20User's%20End-of-Query&width=800&height=300)](https://github.com/Akshay090/svg-banners)
8
+
9
+ [![Star](https://img.shields.io/github/stars/wbb921/speech-turn-detection?style=social)](https://github.com/wbb921/speech-turn-detection/stargazers)
10
+ [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow)](https://huggingface.co/luht/speech-turn-detection/tree/main)
11
+ ![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)
12
+ [![Discussions](https://img.shields.io/github/discussions/wbb921/speech-turn-detection)](https://github.com/wbb921/speech-turn-detection/discussions)
13
+
14
+
15
+
16
+
17
+ Spoken Dialogue Turn Detection refers to distinguishing between a short pause and the actual end of a user’s query.
18
+
19
+ Traditional approaches rely on Voice Activity Detection (VAD) with a fixed delay, which often misinterprets short pauses as endpoints, leading to delayed responses or premature cut-offs.
20
+
21
+ This repository provides an implementation of spoken dialogue turn detection, which directly takes speech as inputs instead of texts and outputs turn-taking patterns along with speaker turns.
22
+
23
+
24
+ ## Installation
25
+
26
+ ```bash
27
+ conda create -n turn-detection python=3.10
28
+ apt-get install libsndfile1
29
+ git clone https://github.com/wbb921/spoken-dialogue-turn-detection.git
30
+ cd spoken-dialogue-turn-detection
31
+ pip install -r requirements
32
+ ```
33
+
34
+ ## Checkpoints
35
+
36
+ The model is trained on SpokenWOZ (249h) and Fisher (1960h)
37
+
38
+ The checkpoints can be downloaded from
39
+
40
+ https://huggingface.co/luht/speech-turn-detection/blob/main/model_spokenwoz.pt
41
+
42
+ https://huggingface.co/luht/speech-turn-detection/blob/main/model_fisher_spokenwoz.pt
43
+
44
+ place the pt file under the ckpt directory once downloaded
45
+
46
+ ## Model Inputs/Outputs
47
+
48
+ ### Inputs
49
+
50
+ Inputs should be stereo audios, with 24kHz sampling rate, some samples can be found in the "data/" directory
51
+
52
+ ### Outputs
53
+
54
+ The model outputs several turn-taking patterns: IPU(0), Listen(1), Gap(2), Pause(3), Overlap(4). Gap refers to mutual silence with speaker change before and after. Pause refers to mutual silence without speaker change.
55
+
56
+ The endpoint(speaker turn point) can be seen as the timestamp where IPU(0) turns into Gap(2).
57
+
58
+
59
+ The outputs will be
60
+ ```bash
61
+ ## Channel 0 State Transitions ##
62
+ 0.00s -> 2.88s ( 2.88s) | State: Gap
63
+ 2.88s -> 3.28s ( 0.40s) | State: Speak
64
+ 3.28s -> 4.08s ( 0.80s) | State: Gap
65
+ ......
66
+
67
+ ## Channel 1 State Transitions ##
68
+ 0.00s -> 2.88s ( 2.88s) | State: Gap
69
+ 2.88s -> 3.28s ( 0.40s) | State: Listen
70
+ 3.28s -> 4.08s ( 0.80s) | State: Gap
71
+ ```
72
+ which is printed on the screen
73
+
74
+ and a numpy array which stores the turn-taking patterns as defined above with shape (2, T)
75
+
76
+ The model outputs with a frequency of 12.5Hz (80 ms a frame)
77
+
78
+ ## Usage
79
+
80
+ The model is totally causal, which can be used in offline or streaming manner.
81
+
82
+ Offline inference
83
+ ```bash
84
+ python infer.py --audio_path "./data/MUL0001.wav" --checkpoint_path "./ckpt/model_spokenwoz.pt" --output_dir "./inference_results"
85
+ ```
86
+
87
+ Streaming Inference
88
+ ```bash
89
+ python infer_streaming.py --audio_path "./data/MUL0001.wav" --checkpoint_path "./ckpt/model_spokenwoz.pt" --output_dir "./inference_results"
90
+ ```
91
+
92
+ The turn-taking states will be printed on the screen, while the numpy array which stores the turn-taking patterns will be saved in ./inference_results with the same name as the input audio, e.g. "MUL0001.npy"
93
+
94
+ ## Train
95
+
96
+ ### Data Preparation
97
+
98
+ Two things have to be prepared for training:
99
+
100
+ 1. Training audio files (24kHz, 16-bit, stereo), placed under /path/to/your/audio_dir:
101
+ ```bash
102
+ audio_1.wav
103
+ audio_2.wav
104
+ audio_3.wav
105
+ ...
106
+ ```
107
+ 2. Turn-taking pattern labels, numpy arrays, same name as the training audio files, placed under /path/to/your/label_dir:
108
+ ```bash
109
+ audio_1.npy
110
+ audio_2.npy
111
+ audio_3.npy
112
+ ...
113
+ ```
114
+ Turn-taking pattern labels' time frequency is 12.5 Hz (80 ms a frame), the shape of the numpy array should be (2, T), T = audio_duration / 80ms.
115
+
116
+ In the 'data_utils' directory, you can find scripts for preparing turn-taking pattern labels from SpokenWOZ dataset annotations:
117
+
118
+ 1. Using silero_vad to refine the utterance timestamps.
119
+ 2. Generating the turn-taking labels.
120
+
121
+ ### Start Training
122
+
123
+ After data preparation, use the following command to start training:
124
+
125
+ ```bash
126
+ python train.py --audio_dir /path/to/your/audio_dir --label_dir /path/to/your/label_dir --batch_size 32 --exp_name test
127
+ ```
128
+ ## Results
129
+
130
+ The model achieves an ep-cutoff rate of 4.72% on SpokenWOZ test set.
131
+ | Method | ep-50 (ms) | ep-90 (ms) | ep-cutoff (%) |
132
+ |------------------------------|------------|------------|---------------|
133
+ | Silero_vad (200ms latency) | 240 | 320 | 35.86 |
134
+ | Silero_vad (500ms latency) | 560 | 640 | 23.11 |
135
+ | The proposed model | 80 | 400 | 4.72 |
136
+
137
+
138
+
139
+