Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,139 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# Spoken-Dialogue-Turn-Detection
|
| 6 |
+
|
| 7 |
+
[](https://github.com/Akshay090/svg-banners)
|
| 8 |
+
|
| 9 |
+
[](https://github.com/wbb921/speech-turn-detection/stargazers)
|
| 10 |
+
[](https://huggingface.co/luht/speech-turn-detection/tree/main)
|
| 11 |
+

|
| 12 |
+
[](https://github.com/wbb921/speech-turn-detection/discussions)
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
Spoken Dialogue Turn Detection refers to distinguishing between a short pause and the actual end of a user’s query.
|
| 18 |
+
|
| 19 |
+
Traditional approaches rely on Voice Activity Detection (VAD) with a fixed delay, which often misinterprets short pauses as endpoints, leading to delayed responses or premature cut-offs.
|
| 20 |
+
|
| 21 |
+
This repository provides an implementation of spoken dialogue turn detection, which directly takes speech as inputs instead of texts and outputs turn-taking patterns along with speaker turns.
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
## Installation
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
conda create -n turn-detection python=3.10
|
| 28 |
+
apt-get install libsndfile1
|
| 29 |
+
git clone https://github.com/wbb921/spoken-dialogue-turn-detection.git
|
| 30 |
+
cd spoken-dialogue-turn-detection
|
| 31 |
+
pip install -r requirements
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## Checkpoints
|
| 35 |
+
|
| 36 |
+
The model is trained on SpokenWOZ (249h) and Fisher (1960h)
|
| 37 |
+
|
| 38 |
+
The checkpoints can be downloaded from
|
| 39 |
+
|
| 40 |
+
https://huggingface.co/luht/speech-turn-detection/blob/main/model_spokenwoz.pt
|
| 41 |
+
|
| 42 |
+
https://huggingface.co/luht/speech-turn-detection/blob/main/model_fisher_spokenwoz.pt
|
| 43 |
+
|
| 44 |
+
place the pt file under the ckpt directory once downloaded
|
| 45 |
+
|
| 46 |
+
## Model Inputs/Outputs
|
| 47 |
+
|
| 48 |
+
### Inputs
|
| 49 |
+
|
| 50 |
+
Inputs should be stereo audios, with 24kHz sampling rate, some samples can be found in the "data/" directory
|
| 51 |
+
|
| 52 |
+
### Outputs
|
| 53 |
+
|
| 54 |
+
The model outputs several turn-taking patterns: IPU(0), Listen(1), Gap(2), Pause(3), Overlap(4). Gap refers to mutual silence with speaker change before and after. Pause refers to mutual silence without speaker change.
|
| 55 |
+
|
| 56 |
+
The endpoint(speaker turn point) can be seen as the timestamp where IPU(0) turns into Gap(2).
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
The outputs will be
|
| 60 |
+
```bash
|
| 61 |
+
## Channel 0 State Transitions ##
|
| 62 |
+
0.00s -> 2.88s ( 2.88s) | State: Gap
|
| 63 |
+
2.88s -> 3.28s ( 0.40s) | State: Speak
|
| 64 |
+
3.28s -> 4.08s ( 0.80s) | State: Gap
|
| 65 |
+
......
|
| 66 |
+
|
| 67 |
+
## Channel 1 State Transitions ##
|
| 68 |
+
0.00s -> 2.88s ( 2.88s) | State: Gap
|
| 69 |
+
2.88s -> 3.28s ( 0.40s) | State: Listen
|
| 70 |
+
3.28s -> 4.08s ( 0.80s) | State: Gap
|
| 71 |
+
```
|
| 72 |
+
which is printed on the screen
|
| 73 |
+
|
| 74 |
+
and a numpy array which stores the turn-taking patterns as defined above with shape (2, T)
|
| 75 |
+
|
| 76 |
+
The model outputs with a frequency of 12.5Hz (80 ms a frame)
|
| 77 |
+
|
| 78 |
+
## Usage
|
| 79 |
+
|
| 80 |
+
The model is totally causal, which can be used in offline or streaming manner.
|
| 81 |
+
|
| 82 |
+
Offline inference
|
| 83 |
+
```bash
|
| 84 |
+
python infer.py --audio_path "./data/MUL0001.wav" --checkpoint_path "./ckpt/model_spokenwoz.pt" --output_dir "./inference_results"
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
Streaming Inference
|
| 88 |
+
```bash
|
| 89 |
+
python infer_streaming.py --audio_path "./data/MUL0001.wav" --checkpoint_path "./ckpt/model_spokenwoz.pt" --output_dir "./inference_results"
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
The turn-taking states will be printed on the screen, while the numpy array which stores the turn-taking patterns will be saved in ./inference_results with the same name as the input audio, e.g. "MUL0001.npy"
|
| 93 |
+
|
| 94 |
+
## Train
|
| 95 |
+
|
| 96 |
+
### Data Preparation
|
| 97 |
+
|
| 98 |
+
Two things have to be prepared for training:
|
| 99 |
+
|
| 100 |
+
1. Training audio files (24kHz, 16-bit, stereo), placed under /path/to/your/audio_dir:
|
| 101 |
+
```bash
|
| 102 |
+
audio_1.wav
|
| 103 |
+
audio_2.wav
|
| 104 |
+
audio_3.wav
|
| 105 |
+
...
|
| 106 |
+
```
|
| 107 |
+
2. Turn-taking pattern labels, numpy arrays, same name as the training audio files, placed under /path/to/your/label_dir:
|
| 108 |
+
```bash
|
| 109 |
+
audio_1.npy
|
| 110 |
+
audio_2.npy
|
| 111 |
+
audio_3.npy
|
| 112 |
+
...
|
| 113 |
+
```
|
| 114 |
+
Turn-taking pattern labels' time frequency is 12.5 Hz (80 ms a frame), the shape of the numpy array should be (2, T), T = audio_duration / 80ms.
|
| 115 |
+
|
| 116 |
+
In the 'data_utils' directory, you can find scripts for preparing turn-taking pattern labels from SpokenWOZ dataset annotations:
|
| 117 |
+
|
| 118 |
+
1. Using silero_vad to refine the utterance timestamps.
|
| 119 |
+
2. Generating the turn-taking labels.
|
| 120 |
+
|
| 121 |
+
### Start Training
|
| 122 |
+
|
| 123 |
+
After data preparation, use the following command to start training:
|
| 124 |
+
|
| 125 |
+
```bash
|
| 126 |
+
python train.py --audio_dir /path/to/your/audio_dir --label_dir /path/to/your/label_dir --batch_size 32 --exp_name test
|
| 127 |
+
```
|
| 128 |
+
## Results
|
| 129 |
+
|
| 130 |
+
The model achieves an ep-cutoff rate of 4.72% on SpokenWOZ test set.
|
| 131 |
+
| Method | ep-50 (ms) | ep-90 (ms) | ep-cutoff (%) |
|
| 132 |
+
|------------------------------|------------|------------|---------------|
|
| 133 |
+
| Silero_vad (200ms latency) | 240 | 320 | 35.86 |
|
| 134 |
+
| Silero_vad (500ms latency) | 560 | 640 | 23.11 |
|
| 135 |
+
| The proposed model | 80 | 400 | 4.72 |
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
|