--- license: mit --- # UTMOS PyTorch ## About This is an unofficial `fairseq`-free implementation of the UTMOS MOS Prediction system proposed in [UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022](https://arxiv.org/abs/2204.02152). The [original implementation](https://github.com/sarulab-speech/UTMOS22) is based on [fairseq](https://github.com/facebookresearch/fairseq). However, `fairseq` is difficult to install with recent Python, PyTorch, and dependency versions, which makes UTMOS hard to use in modern environments. [Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of UTMOS with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install UTMOS implementation. We provide a `fairseq`-free implementation written in `PyTorch` that matches the [original system](https://github.com/sarulab-speech/UTMOS22) using converted weights and re-written modules. We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package. The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores. **Note**: As in the original version, we recommend running UTMOS with batch size 1 to avoid metric shifts caused by padding. See [GitHub repository](https://github.com/Blinorot/utmos-pytorch) for source. ## Usage You can install the repo as a package: ```bash pip install utmos-pytorch ``` Or from source: ```bash git clone https://github.com/Blinorot/UTMOS-PyTorch.git cd UTMOS-PyTorch pip install -e . ``` The code requires: | Package | Version | | --------------- | ------- | | Python | >=3.9 | | PyTorch | >=2.2.0 | | HuggingFace Hub | >=0.20 | The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. Loading it with older PyTorch versions is not guaranteed; `PyTorch >=2.5.1` is recommended for the TorchScript variant. Then, you can run the model as follows: ```python import torchaudio from utmos_pytorch import UTMOSScoreTorch device = "cpu" # set to "cuda" to use on GPU utmos = UTMOSScoreTorch(device=device) # already in eval mode # load an audio file, e.g. using torchaudio audio_path = ... # path to an audio file wav, sr = torchaudio.load(audio_path) # convert to MONO 16 kHz TARGET_SR = 16000 if wav.shape[0] != 1: wav = wav[0:1] if sr != TARGET_SR: wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR) # put on device wav = wav.to(device) # calculate the score # accepts T, 1xT, Bx1xT utmos_score = utmos.score(wav) # tensor of shape (batch_size,) ``` You can replace `UTMOSScoreTorch` with `UTMOSScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted UTMOS weights from [Hugging Face Hub](https://huggingface.co/Blinorot/UTMOS-PyTorch) and caches them locally using the Hugging Face cache. For `TorchScript`, you can avoid downloading the package and use the model directly: ```python import torch import torchaudio import wget # download scripted checkpoint, e.g. using wget checkpoint_url = "https://huggingface.co/Blinorot/UTMOS-PyTorch/resolve/main/utmos_scripted.pt" checkpoint_path = ... # path to saved checkpoint wget.download(checkpoint_url, checkpoint_path) # load directly with torch.jit device = "cpu" # set to "cuda" to use on GPU utmos = torch.jit.load(checkpoint_path, map_location=device) utmos.eval() # load an audio file, e.g. using torchaudio audio_path = ... # path to an audio file wav, sr = torchaudio.load(audio_path) # convert to MONO 16 kHz TARGET_SR = 16000 if wav.shape[0] != 1: wav = wav[0:1] if sr != TARGET_SR: wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR) # put on device wav = wav.to(device) # calculate the score # accepts T, 1xT, Bx1xT with torch.no_grad(): utmos_score = utmos.score(wav) # tensor of shape (batch_size,) ``` ### Notes The model expects audio sampled at **16 kHz**. Accepted tensor shapes: | Shape | Meaning | | ----------- | ------------------------------------------- | | `(T,)` | single mono waveform | | `(1, T)` | single mono waveform with channel dimension | | `(B, 1, T)` | batch of mono waveforms | The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `utmos.score(wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted MOS score. Higher is better. **Batch size 1 is recommended to avoid padding-related score shifts.** API classes: | Class | Description | | -------------------- | ----------------------------------------------- | | `UTMOSScoreTorch` | PyTorch implementation using converted weights. | | `UTMOSScoreScripted` | Wrapper around the TorchScript checkpoint. | ## Citation If you use this package, please cite the original UTMOS paper: ```bibtex @inproceedings{saeki22c_interspeech, title = {{UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022}}, author = {Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari}, year = {2022}, booktitle = {{Interspeech 2022}}, pages = {4521--4525}, doi = {10.21437/Interspeech.2022-439}, issn = {2958-1796}, } ```