| --- |
| license: mit |
| --- |
| |
| # UTMOS PyTorch |
|
|
| ## About |
|
|
| This is an unofficial `fairseq`-free implementation of the UTMOS MOS Prediction system proposed in [UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022](https://arxiv.org/abs/2204.02152). |
|
|
| The [original implementation](https://github.com/sarulab-speech/UTMOS22) is based on [fairseq](https://github.com/facebookresearch/fairseq). However, `fairseq` is difficult to install with recent Python, PyTorch, and dependency versions, which makes UTMOS hard to use in modern environments. [Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of UTMOS with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install UTMOS implementation. |
|
|
| We provide a `fairseq`-free implementation written in `PyTorch` that matches the [original system](https://github.com/sarulab-speech/UTMOS22) using converted weights and re-written modules. |
|
|
| We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package. |
|
|
| The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores. |
|
|
| **Note**: As in the original version, we recommend running UTMOS with batch size 1 to avoid metric shifts caused by padding. |
|
|
| See [GitHub repository](https://github.com/Blinorot/utmos-pytorch) for source. |
|
|
| ## Usage |
|
|
| You can install the repo as a package: |
|
|
| ```bash |
| pip install utmos-pytorch |
| ``` |
|
|
| Or from source: |
|
|
| ```bash |
| git clone https://github.com/Blinorot/UTMOS-PyTorch.git |
| cd UTMOS-PyTorch |
| pip install -e . |
| ``` |
|
|
| The code requires: |
|
|
| | Package | Version | |
| | --------------- | ------- | |
| | Python | >=3.9 | |
| | PyTorch | >=2.2.0 | |
| | HuggingFace Hub | >=0.20 | |
|
|
| The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. Loading it with older |
| PyTorch versions is not guaranteed; `PyTorch >=2.5.1` is recommended for the |
| TorchScript variant. |
|
|
| Then, you can run the model as follows: |
|
|
| ```python |
| import torchaudio |
| from utmos_pytorch import UTMOSScoreTorch |
| |
| device = "cpu" # set to "cuda" to use on GPU |
| utmos = UTMOSScoreTorch(device=device) # already in eval mode |
| |
| # load an audio file, e.g. using torchaudio |
| audio_path = ... # path to an audio file |
| wav, sr = torchaudio.load(audio_path) |
| |
| # convert to MONO 16 kHz |
| TARGET_SR = 16000 |
| if wav.shape[0] != 1: |
| wav = wav[0:1] |
| if sr != TARGET_SR: |
| wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR) |
| |
| # put on device |
| wav = wav.to(device) |
| |
| # calculate the score |
| # accepts T, 1xT, Bx1xT |
| utmos_score = utmos.score(wav) # tensor of shape (batch_size,) |
| ``` |
|
|
| You can replace `UTMOSScoreTorch` with `UTMOSScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted UTMOS weights from [Hugging Face Hub](https://huggingface.co/Blinorot/UTMOS-PyTorch) and caches them locally using the Hugging Face cache. |
|
|
| For `TorchScript`, you can avoid downloading the package and use the model directly: |
|
|
| ```python |
| import torch |
| import torchaudio |
| import wget |
| |
| # download scripted checkpoint, e.g. using wget |
| checkpoint_url = "https://huggingface.co/Blinorot/UTMOS-PyTorch/resolve/main/utmos_scripted.pt" |
| checkpoint_path = ... # path to saved checkpoint |
| wget.download(checkpoint_url, checkpoint_path) |
| |
| # load directly with torch.jit |
| device = "cpu" # set to "cuda" to use on GPU |
| utmos = torch.jit.load(checkpoint_path, map_location=device) |
| utmos.eval() |
| |
| # load an audio file, e.g. using torchaudio |
| audio_path = ... # path to an audio file |
| wav, sr = torchaudio.load(audio_path) |
| |
| # convert to MONO 16 kHz |
| TARGET_SR = 16000 |
| if wav.shape[0] != 1: |
| wav = wav[0:1] |
| if sr != TARGET_SR: |
| wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR) |
| |
| # put on device |
| wav = wav.to(device) |
| |
| # calculate the score |
| # accepts T, 1xT, Bx1xT |
| with torch.no_grad(): |
| utmos_score = utmos.score(wav) # tensor of shape (batch_size,) |
| ``` |
|
|
| ### Notes |
|
|
| The model expects audio sampled at **16 kHz**. |
|
|
| Accepted tensor shapes: |
|
|
| | Shape | Meaning | |
| | ----------- | ------------------------------------------- | |
| | `(T,)` | single mono waveform | |
| | `(1, T)` | single mono waveform with channel dimension | |
| | `(B, 1, T)` | batch of mono waveforms | |
|
|
| The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `utmos.score(wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted MOS score. Higher is better. **Batch size 1 is recommended to avoid padding-related score shifts.** |
|
|
| API classes: |
|
|
| | Class | Description | |
| | -------------------- | ----------------------------------------------- | |
| | `UTMOSScoreTorch` | PyTorch implementation using converted weights. | |
| | `UTMOSScoreScripted` | Wrapper around the TorchScript checkpoint. | |
|
|
|
|
| ## Citation |
|
|
| If you use this package, please cite the original UTMOS paper: |
|
|
| ```bibtex |
| @inproceedings{saeki22c_interspeech, |
| title = {{UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022}}, |
| author = {Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari}, |
| year = {2022}, |
| booktitle = {{Interspeech 2022}}, |
| pages = {4521--4525}, |
| doi = {10.21437/Interspeech.2022-439}, |
| issn = {2958-1796}, |
| } |
| ``` |
|
|