File size: 3,537 Bytes
46d4167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ba5117
387b4d5
46d4167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb2ddc2
46d4167
cb2ddc2
46d4167
cb2ddc2
 
 
46d4167
 
 
cb2ddc2
46d4167
cb2ddc2
46d4167
 
 
cb2ddc2
46d4167
 
 
 
 
 
 
 
 
 
 
cb2ddc2
46d4167
 
 
cb2ddc2
46d4167
 
 
 
 
 
 
 
cb2ddc2
46d4167
 
 
cb2ddc2
46d4167
 
 
 
 
 
 
cb2ddc2
 
 
46d4167
 
 
cb2ddc2
46d4167
 
 
cb2ddc2
 
 
46d4167
 
 
 
 
 
cb2ddc2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: apache-2.0
library_name: pytorch
tags:
- text-to-speech
- speech-synthesis
- discrete-speech-synthesis
- neural-codec-language-model
- spoof-detection
- hierarchical-decoding
- pytorch
---

# MSpoofTTS Discriminator Checkpoints

This repository provides the discriminator checkpoints used in **MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection**.

Paper: [Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection](https://arxiv.org/abs/2603.05373)

Demo: https://danny-nus.github.io/MSpoofTTS.github.io/

This repository is intended as a **checkpoint hosting repository**. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase.

## Checkpoints

| File | Model Type | Segment Length | Scale |
|---|---|---:|---:|
| `checkpoints/segment_len50.ckpt` | SegmentTokenDiscriminator | 50 | - |
| `checkpoints/segment_len25.ckpt` | SegmentTokenDiscriminator | 25 | - |
| `checkpoints/segment_len10.ckpt` | SegmentTokenDiscriminator | 10 | - |
| `checkpoints/strided_seg50_scale10.ckpt` | StridedSegmentTokenDiscriminator | 50 | 10 |
| `checkpoints/strided_seg50_scale25.ckpt` | StridedSegmentTokenDiscriminator | 50 | 25 |

## Model Configuration

All discriminators use the following base configuration:

```python
vocab_size = 65536
d_model = 256
nhead = 8
num_layers = 4
dim_feedforward = 1024
dropout = 0.1
```

The segment-level discriminators use `segment_len` values of 10, 25, and 50.

The strided discriminators use `segment_len=50` with scales 10 and 25.

## Usage

Install the Hugging Face Hub package:

```bash
pip install -U huggingface_hub
```

Download a checkpoint:

```python
from huggingface_hub import hf_hub_download

repo_id = "Chanson-0803/MSpoofTTS"

ckpt_path = hf_hub_download(
    repo_id=repo_id,
    filename="checkpoints/segment_len50.ckpt",
    repo_type="model",
)

print(ckpt_path)
```

Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase:

```python
import torch

# Import this from the official MSpoofTTS codebase.
# from your_mspoof_code import SegmentTokenDiscriminator

state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
```

For hierarchical decoding, use the following checkpoint files:

```python
checkpoint_files = {
    "segment_len50": "checkpoints/segment_len50.ckpt",
    "segment_len25": "checkpoints/segment_len25.ckpt",
    "segment_len10": "checkpoints/segment_len10.ckpt",
    "strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt",
    "strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt",
}
```

## Intended Use

These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding.

## Limitations

These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation.

## Citation

```bibtex
@article{zhao2026hierarchical,
  title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection},
  author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye},
  journal={arXiv preprint arXiv:2603.05373},
  year={2026}
}
```