Blinorot commited on
Commit
b1fea21
·
verified ·
1 Parent(s): 91a1fa1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ ## About
6
+
7
+ This is an unofficial `fairseq`-free implementation of the UTMOS MOS Prediction system proposed in [UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022](https://arxiv.org/abs/2204.02152).
8
+
9
+ The [original implementation](https://github.com/sarulab-speech/UTMOS22) is based on [fairseq](https://github.com/facebookresearch/fairseq). However, `fairseq` is difficult to install with recent Python, PyTorch, and dependency versions, which makes UTMOS hard to use in modern environments. [Recent study from ICASSP 2026](https://arxiv.org/abs/2509.24457) highlights the high correlation of UTMOS with subjective listening scores for neural codecs. Therefore, modern neural audio codec and TTS research benefits from an easy-to-install UTMOS implementation.
10
+
11
+ We provide a `fairseq`-free implementation written in `PyTorch` that matches the [original system](https://github.com/sarulab-speech/UTMOS22) using converted weights and re-written modules.
12
+
13
+ We also provide a `TorchScript` variant that can be loaded with only PyTorch, without installing this package.
14
+
15
+ The PyTorch and TorchScript versions are validated against the original implementation and produce matching scores.
16
+
17
+ **Note**: As in the original version, we recommend running UTMOS with batch size 1 to avoid metric shifts caused by padding.
18
+
19
+ See [GitHub repository](https://github.com/Blinorot/utmos-pytorch) for source.
20
+
21
+ ## Usage
22
+
23
+ You can install the repo as a package:
24
+
25
+ ```bash
26
+ pip install utmos-pytorch
27
+ ```
28
+
29
+ Or from source:
30
+
31
+ ```bash
32
+ git clone https://github.com/Blinorot/UTMOS-PyTorch.git
33
+ cd UTMOS-PyTorch
34
+ pip install -e .
35
+ ```
36
+
37
+ The code requires:
38
+
39
+ | Package | Version |
40
+ | --------------- | ------- |
41
+ | Python | >=3.9 |
42
+ | PyTorch | >=2.2.0 |
43
+ | HuggingFace Hub | >=0.20 |
44
+
45
+ The TorchScript checkpoint was scripted with `PyTorch 2.5.1`. Loading it with older
46
+ PyTorch versions is not guaranteed; `PyTorch >=2.5.1` is recommended for the
47
+ TorchScript variant.
48
+
49
+ Then, you can run the model as follows:
50
+
51
+ ```python
52
+ import torchaudio
53
+ from utmos_pytorch import UTMOSScoreTorch
54
+
55
+ device = "cpu" # set to "cuda" to use on GPU
56
+ utmos = UTMOSScoreTorch(device=device) # already in eval mode
57
+
58
+ # load an audio file, e.g. using torchaudio
59
+ audio_path = ... # path to an audio file
60
+ wav, sr = torchaudio.load(audio_path)
61
+
62
+ # convert to MONO 16 kHz
63
+ TARGET_SR = 16000
64
+ if wav.shape[0] != 1:
65
+ wav = wav[0:1]
66
+ if sr != TARGET_SR:
67
+ wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)
68
+
69
+ # put on device
70
+ wav = wav.to(device)
71
+
72
+ # calculate the score
73
+ # accepts T, 1xT, Bx1xT
74
+ utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
75
+ ```
76
+
77
+ You can replace `UTMOSScoreTorch` with `UTMOSScoreScripted` to use the `TorchScript` variant instead. On first use, the package downloads converted UTMOS weights from [Hugging Face Hub](https://huggingface.co/Blinorot/UTMOS-PyTorch) and caches them locally using the Hugging Face cache.
78
+
79
+ For `TorchScript`, you can avoid downloading the package and use the model directly:
80
+
81
+ ```python
82
+ import torch
83
+ import torchaudio
84
+ import wget
85
+
86
+ # download scripted checkpoint, e.g. using wget
87
+ checkpoint_url = "https://huggingface.co/Blinorot/UTMOS-PyTorch/resolve/main/utmos_scripted.pt"
88
+ checkpoint_path = ... # path to saved checkpoint
89
+ wget.download(checkpoint_url, checkpoint_path)
90
+
91
+ # load directly with torch.jit
92
+ device = "cpu" # set to "cuda" to use on GPU
93
+ utmos = torch.jit.load(checkpoint_path, map_location=device)
94
+ utmos.eval()
95
+
96
+ # load an audio file, e.g. using torchaudio
97
+ audio_path = ... # path to an audio file
98
+ wav, sr = torchaudio.load(audio_path)
99
+
100
+ # convert to MONO 16 kHz
101
+ TARGET_SR = 16000
102
+ if wav.shape[0] != 1:
103
+ wav = wav[0:1]
104
+ if sr != TARGET_SR:
105
+ wav = torchaudio.functional.resample(wav, orig_freq=sr, new_freq=TARGET_SR)
106
+
107
+ # put on device
108
+ wav = wav.to(device)
109
+
110
+ # calculate the score
111
+ # accepts T, 1xT, Bx1xT
112
+ with torch.no_grad():
113
+ utmos_score = utmos.score(wav) # tensor of shape (batch_size,)
114
+ ```
115
+
116
+ ### Notes
117
+
118
+ The model expects audio sampled at **16 kHz**.
119
+
120
+ Accepted tensor shapes:
121
+
122
+ | Shape | Meaning |
123
+ | ----------- | ------------------------------------------- |
124
+ | `(T,)` | single mono waveform |
125
+ | `(1, T)` | single mono waveform with channel dimension |
126
+ | `(B, 1, T)` | batch of mono waveforms |
127
+
128
+ The input should be a floating point PyTorch tensor. Stereo audio should be converted to mono before scoring. `utmos.score(wav)` returns a tensor of shape `(batch_size,)`, where each value is a predicted MOS score. Higher is better. **Batch size 1 is recommended to avoid padding-related score shifts.**
129
+
130
+ API classes:
131
+
132
+ | Class | Description |
133
+ | -------------------- | ----------------------------------------------- |
134
+ | `UTMOSScoreTorch` | PyTorch implementation using converted weights. |
135
+ | `UTMOSScoreScripted` | Wrapper around the TorchScript checkpoint. |
136
+
137
+
138
+ ## Citation
139
+
140
+ If you use this package, please cite the original UTMOS paper:
141
+
142
+ ```bibtex
143
+ @inproceedings{saeki22c_interspeech,
144
+ title = {{UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022}},
145
+ author = {Takaaki Saeki and Detai Xin and Wataru Nakata and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari},
146
+ year = {2022},
147
+ booktitle = {{Interspeech 2022}},
148
+ pages = {4521--4525},
149
+ doi = {10.21437/Interspeech.2022-439},
150
+ issn = {2958-1796},
151
+ }
152
+ ```