File size: 5,494 Bytes
b1fea21
 
 
 
4f2447e
 
b1fea21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: mit
---

# UTMOS PyTorch

## About

This is an unofficial `fairseq`-free implementation of the UTMOS MOS Prediction system proposed in [UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022](https://arxiv.org/abs/2204.02152).

The [original implementation](https://github.com/sarulab-speech/UTMOS22) is based on [fairseq](https://github.com/facebookresearch/fairseq). However, `fairseq` is difficult to install with recent Python, PyTorch, and dependency versions, which makes UTMOS hard to use in modern environments. [Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of UTMOS with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install UTMOS implementation.

We provide a `fairseq`-free implementation written in `PyTorch` that matches the [original system](https://github.com/sarulab-speech/UTMOS22) using converted weights and re-written modules.

We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package.

The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.

**Note**: As in the original version, we recommend running UTMOS with batch size 1 to avoid metric shifts caused by padding.

See [GitHub repository](https://github.com/Blinorot/utmos-pytorch) for source.

## Usage

You can install the repo as a package:

```bash
pip install utmos-pytorch
```

Or from source:

```bash
git clone https://github.com/Blinorot/UTMOS-PyTorch.git
cd UTMOS-PyTorch
pip install -e .
```

The code requires:

| Package         | Version |
| --------------- | ------- |
| Python          | >=3.9   |
| PyTorch         | >=2.2.0 |
| HuggingFace Hub | >=0.20  |

The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. Loading it with older
PyTorch versions is not guaranteed; `PyTorch >=2.5.1` is recommended for the
TorchScript variant.

Then, you can run the model as follows:

```python
import torchaudio
from utmos_pytorch import UTMOSScoreTorch

device = "cpu" # set to "cuda" to use on GPU
utmos = UTMOSScoreTorch(device=device) # already in eval mode

# load an audio file, e.g. using torchaudio
audio_path = ... # path to an audio file
wav, sr = torchaudio.load(audio_path)

# convert to MONO 16 kHz
TARGET_SR = 16000
if wav.shape[0] != 1:
    wav = wav[0:1]
if sr != TARGET_SR:
    wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)

# put on device
wav = wav.to(device)

# calculate the score
# accepts T, 1xT, Bx1xT
utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
```

You can replace `UTMOSScoreTorch` with `UTMOSScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted UTMOS weights from [Hugging Face Hub](https://huggingface.co/Blinorot/UTMOS-PyTorch) and caches them locally using the Hugging Face cache.

For `TorchScript`, you can avoid downloading the package and use the model directly:

```python
import torch
import torchaudio
import wget

# download scripted checkpoint, e.g. using wget
checkpoint_url = "https://huggingface.co/Blinorot/UTMOS-PyTorch/resolve/main/utmos_scripted.pt"
checkpoint_path = ... # path to saved checkpoint
wget.download(checkpoint_url, checkpoint_path)

# load directly with torch.jit
device = "cpu" # set to "cuda" to use on GPU
utmos = torch.jit.load(checkpoint_path, map_location=device)
utmos.eval()

# load an audio file, e.g. using torchaudio
audio_path = ... # path to an audio file
wav, sr = torchaudio.load(audio_path)

# convert to MONO 16 kHz
TARGET_SR = 16000
if wav.shape[0] != 1:
    wav = wav[0:1]
if sr != TARGET_SR:
    wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)

# put on device
wav = wav.to(device)

# calculate the score
# accepts T, 1xT, Bx1xT
with torch.no_grad():
    utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
```

### Notes

The model expects audio sampled at **16 kHz**.

Accepted tensor shapes:

| Shape       | Meaning                                     |
| ----------- | ------------------------------------------- |
| `(T,)`      | single mono waveform                        |
| `(1, T)`    | single mono waveform with channel dimension |
| `(B, 1, T)` | batch of mono waveforms                     |

The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `utmos.score(wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted MOS score. Higher is better. **Batch size 1 is recommended to avoid padding-related score shifts.**

API classes:

| Class                | Description                                     |
| -------------------- | ----------------------------------------------- |
| `UTMOSScoreTorch`    | PyTorch implementation using converted weights. |
| `UTMOSScoreScripted` | Wrapper around the TorchScript checkpoint.      |


## Citation

If you use this package, please cite the original UTMOS paper:

```bibtex
@inproceedings{saeki22c_interspeech,
  title     = {{UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022}},
  author    = {Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari},
  year      = {2022},
  booktitle = {{Interspeech 2022}},
  pages     = {4521--4525},
  doi       = {10.21437/Interspeech.2022-439},
  issn      = {2958-1796},
}
```