English
File size: 2,118 Bytes
441a1ce
 
 
 
 
ad81a11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
441a1ce
ad81a11
441a1ce
 
ad81a11
 
 
 
 
 
 
 
 
 
 
d594d63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: cc-by-sa-3.0
language:
- en
---
## Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means

[[paper](https://arxiv.org/abs/2601.19781)] [[demo](https://ondatk68.github.io/onda-demo/projects/phonological-tokenizer/)]

![arch](./arch.png)

Phonological Tokenizer is a single-codebook speech tokenizer that encodes linguistic and prosodic information. The tokenizer has intermediate properties between phonetic tokens and acoustic tokens.

This tokenizer is obtained by fine-tuning the phonetic tokens derived from an SSL model (wavlm-large) using differentiable k-means  in a multi-task manner with ASR and speech reconstruction. In this repository, we release the fine-tuned SSL model and cluster centroids, along with simple inference code.

For more details, please refer to [our paper](https://arxiv.org/abs/2601.19781).

### Usage
```
git clone https://huggingface.co/Sony/Phonological-Tokenizer
cd Phonological-Tokenizer

pip install -r requirements.txt

python inference.py [audio file path]
```

### License
This model is licensed under CC BY-SA 3.0. See the [LICENSE file](./LICENSE) for details.

### Citation
```
@inproceedings{onda2026phonological,
  title={Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means},
  author={Onda, Kentaro and Futami, Hayato and Kashiwagi, Yosuke and Tsunoo, Emiru and Watanabe, Shinji},
  booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={17817-17821},
  year={2026},
  organization={IEEE},
  doi={10.1109/ICASSP55912.2026.11464405}
}
```

### Reference
- Original SSL model: [WavLM-large](https://huggingface.co/microsoft/wavlm-large) (CC BY-SA 3.0)
- Training data: 
  - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) (CC BY 4.0)
  - [LibriSpeech](https://www.openslr.org/12) (CC BY 4.0; used a 30h random subset of train-clean-100 for centroid initialization)


### Contact
ondakentaro[at]gavo.t.u-tokyo.ac.jp; hayato.Futami[at]sony.com; Yosuke.Kashiwagi[at]sony.com