| --- |
| license: cc-by-sa-3.0 |
| language: |
| - en |
| --- |
| ## Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means |
|
|
| [[paper](https://arxiv.org/abs/2601.19781)] [[demo](https://ondatk68.github.io/onda-demo/projects/phonological-tokenizer/)] |
|
|
|  |
|
|
| Phonological Tokenizer is a single-codebook speech tokenizer that encodes linguistic and prosodic information. The tokenizer has intermediate properties between phonetic tokens and acoustic tokens. |
|
|
| This tokenizer is obtained by fine-tuning the phonetic tokens derived from an SSL model (wavlm-large) using differentiable k-means in a multi-task manner with ASR and speech reconstruction. In this repository, we release the fine-tuned SSL model and cluster centroids, along with simple inference code. |
|
|
| For more details, please refer to [our paper](https://arxiv.org/abs/2601.19781). |
|
|
| ### Usage |
| ``` |
| git clone https://huggingface.co/Sony/Phonological-Tokenizer |
| cd Phonological-Tokenizer |
| |
| pip install -r requirements.txt |
| |
| python inference.py [audio file path] |
| ``` |
|
|
| ### License |
| This model is licensed under CC BY-SA 3.0. See the [LICENSE file](./LICENSE) for details. |
|
|
| ### Citation |
| ``` |
| @inproceedings{onda2026phonological, |
| title={Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means}, |
| author={Onda, Kentaro and Futami, Hayato and Kashiwagi, Yosuke and Tsunoo, Emiru and Watanabe, Shinji}, |
| booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
| pages={17817-17821}, |
| year={2026}, |
| organization={IEEE}, |
| doi={10.1109/ICASSP55912.2026.11464405} |
| } |
| ``` |
|
|
| ### Reference |
| - Original SSL model: [WavLM-large](https://huggingface.co/microsoft/wavlm-large) (CC BY-SA 3.0) |
| - Training data: |
| - [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) (CC BY 4.0) |
| - [LibriSpeech](https://www.openslr.org/12) (CC BY 4.0; used a 30h random subset of train-clean-100 for centroid initialization) |
|
|
|
|
| ### Contact |
| ondakentaro[at]gavo.t.u-tokyo.ac.jp; hayato.Futami[at]sony.com; Yosuke.Kashiwagi[at]sony.com |