Sony
/

Phonological-Tokenizer

Model card Files Files and versions

Phonological-Tokenizer / README.md

YosukeKashiwagi's picture

YosukeKashiwagi

add: citation pages

441a1ce 27 days ago

|

history blame contribute delete

2.12 kB

	---
	license: cc-by-sa-3.0
	language:
	- en
	---
	## Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means

	[[paper](https://arxiv.org/abs/2601.19781)] [[demo](https://ondatk68.github.io/onda-demo/projects/phonological-tokenizer/)]

	![arch](./arch.png)

	Phonological Tokenizer is a single-codebook speech tokenizer that encodes linguistic and prosodic information. The tokenizer has intermediate properties between phonetic tokens and acoustic tokens.

	This tokenizer is obtained by fine-tuning the phonetic tokens derived from an SSL model (wavlm-large) using differentiable k-means in a multi-task manner with ASR and speech reconstruction. In this repository, we release the fine-tuned SSL model and cluster centroids, along with simple inference code.

	For more details, please refer to [our paper](https://arxiv.org/abs/2601.19781).

	### Usage
	```
	git clone https://huggingface.co/Sony/Phonological-Tokenizer
	cd Phonological-Tokenizer

	pip install -r requirements.txt

	python inference.py [audio file path]
	```

	### License
	This model is licensed under CC BY-SA 3.0. See the [LICENSE file](./LICENSE) for details.

	### Citation
	```
	@inproceedings{onda2026phonological,
	title={Phonological Tokenizer: Prosody-Aware Phonetic Token via Multi-Objective Fine-Tuning with Differentiable K-Means},
	author={Onda, Kentaro and Futami, Hayato and Kashiwagi, Yosuke and Tsunoo, Emiru and Watanabe, Shinji},
	booktitle={ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
	pages={17817-17821},
	year={2026},
	organization={IEEE},
	doi={10.1109/ICASSP55912.2026.11464405}
	}
	```

	### Reference
	- Original SSL model: [WavLM-large](https://huggingface.co/microsoft/wavlm-large) (CC BY-SA 3.0)
	- Training data:
	- [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) (CC BY 4.0)
	- [LibriSpeech](https://www.openslr.org/12) (CC BY 4.0; used a 30h random subset of train-clean-100 for centroid initialization)


	### Contact
	ondakentaro[at]gavo.t.u-tokyo.ac.jp; hayato.Futami[at]sony.com; Yosuke.Kashiwagi[at]sony.com