Update README.md

4f2447e verified 3 days ago

5.49 kB

	---
	license: mit
	---

	# UTMOS PyTorch

	## About

	This is an unofficial `fairseq`-free implementation of the UTMOS MOS Prediction system proposed in [UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022](https://arxiv.org/abs/2204.02152).

	The [original implementation](https://github.com/sarulab-speech/UTMOS22) is based on [fairseq](https://github.com/facebookresearch/fairseq). However, `fairseq` is difficult to install with recent Python, PyTorch, and dependency versions, which makes UTMOS hard to use in modern environments. [Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of UTMOS with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install UTMOS implementation.

	We provide a `fairseq`-free implementation written in `PyTorch` that matches the [original system](https://github.com/sarulab-speech/UTMOS22) using converted weights and re-written modules.

	We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package.

	The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.

	Note: As in the original version, we recommend running UTMOS with batch size 1 to avoid metric shifts caused by padding.

	See [GitHub repository](https://github.com/Blinorot/utmos-pytorch) for source.

	## Usage

	You can install the repo as a package:

	```bash
	pip install utmos-pytorch
	```

	Or from source:

	```bash
	git clone https://github.com/Blinorot/UTMOS-PyTorch.git
	cd UTMOS-PyTorch
	pip install -e .
	```

	The code requires:

	\| Package \| Version \|
	\| --------------- \| ------- \|
	\| Python \| >=3.9 \|
	\| PyTorch \| >=2.2.0 \|
	\| HuggingFace Hub \| >=0.20 \|

	The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. Loading it with older
	PyTorch versions is not guaranteed; `PyTorch >=2.5.1` is recommended for the
	TorchScript variant.

	Then, you can run the model as follows:

	```python
	import torchaudio
	from utmos_pytorch import UTMOSScoreTorch

	device = "cpu" # set to "cuda" to use on GPU
	utmos = UTMOSScoreTorch(device=device) # already in eval mode

	# load an audio file, e.g. using torchaudio
	audio_path = ... # path to an audio file
	wav, sr = torchaudio.load(audio_path)

	# convert to MONO 16 kHz
	TARGET_SR = 16000
	if wav.shape[0] != 1:
	wav = wav[0:1]
	if sr != TARGET_SR:
	wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)

	# put on device
	wav = wav.to(device)

	# calculate the score
	# accepts T, 1xT, Bx1xT
	utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
	```

	You can replace `UTMOSScoreTorch` with `UTMOSScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted UTMOS weights from [Hugging Face Hub](https://huggingface.co/Blinorot/UTMOS-PyTorch) and caches them locally using the Hugging Face cache.

	For `TorchScript`, you can avoid downloading the package and use the model directly:

	```python
	import torch
	import torchaudio
	import wget

	# download scripted checkpoint, e.g. using wget
	checkpoint_url = "https://huggingface.co/Blinorot/UTMOS-PyTorch/resolve/main/utmos_scripted.pt"
	checkpoint_path = ... # path to saved checkpoint
	wget.download(checkpoint_url, checkpoint_path)

	# load directly with torch.jit
	device = "cpu" # set to "cuda" to use on GPU
	utmos = torch.jit.load(checkpoint_path, map_location=device)
	utmos.eval()

	# load an audio file, e.g. using torchaudio
	audio_path = ... # path to an audio file
	wav, sr = torchaudio.load(audio_path)

	# convert to MONO 16 kHz
	TARGET_SR = 16000
	if wav.shape[0] != 1:
	wav = wav[0:1]
	if sr != TARGET_SR:
	wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)

	# put on device
	wav = wav.to(device)

	# calculate the score
	# accepts T, 1xT, Bx1xT
	with torch.no_grad():
	utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
	```

	### Notes

	The model expects audio sampled at 16 kHz.

	Accepted tensor shapes:

	\| Shape \| Meaning \|
	\| ----------- \| ------------------------------------------- \|
	\| `(T,)` \| single mono waveform \|
	\| `(1, T)` \| single mono waveform with channel dimension \|
	\| `(B, 1, T)` \| batch of mono waveforms \|

	The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `utmos.score(wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted MOS score. Higher is better. Batch size 1 is recommended to avoid padding-related score shifts.

	API classes:

	\| Class \| Description \|
	\| -------------------- \| ----------------------------------------------- \|
	\| `UTMOSScoreTorch` \| PyTorch implementation using converted weights. \|
	\| `UTMOSScoreScripted` \| Wrapper around the TorchScript checkpoint. \|


	## Citation

	If you use this package, please cite the original UTMOS paper:

	```bibtex
	@inproceedings{saeki22c_interspeech,
	title = {{UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022}},
	author = {Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari},
	year = {2022},
	booktitle = {{Interspeech 2022}},
	pages = {4521--4525},
	doi = {10.21437/Interspeech.2022-439},
	issn = {2958-1796},
	}
	```