---
license: apache-2.0
library_name: pytorch
tags:
- text-to-speech
- speech-synthesis
- discrete-speech-synthesis
- neural-codec-language-model
- spoof-detection
- hierarchical-decoding
- pytorch
---

# MSpoofTTS Discriminator Checkpoints

This repository provides the discriminator checkpoints used in **MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection**.

Paper: [Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection](https://arxiv.org/abs/2603.05373)

Demo: https://danny-nus.github.io/MSpoofTTS.github.io/

This repository is intended as a **checkpoint hosting repository**. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase.

## Checkpoints

| File | Model Type | Segment Length | Scale |
|---|---|---:|---:|
| `checkpoints/segment_len50.ckpt` | SegmentTokenDiscriminator | 50 | - |
| `checkpoints/segment_len25.ckpt` | SegmentTokenDiscriminator | 25 | - |
| `checkpoints/segment_len10.ckpt` | SegmentTokenDiscriminator | 10 | - |
| `checkpoints/strided_seg50_scale10.ckpt` | StridedSegmentTokenDiscriminator | 50 | 10 |
| `checkpoints/strided_seg50_scale25.ckpt` | StridedSegmentTokenDiscriminator | 50 | 25 |

## Model Configuration

All discriminators use the following base configuration:

```python
vocab_size = 65536
d_model = 256
nhead = 8
num_layers = 4
dim_feedforward = 1024
dropout = 0.1
```

The segment-level discriminators use `segment_len` values of 10, 25, and 50.

The strided discriminators use `segment_len=50` with scales 10 and 25.

## Usage

Install the Hugging Face Hub package:

```bash
pip install -U huggingface_hub
```

Download a checkpoint:

```python
from huggingface_hub import hf_hub_download

repo_id = "Chanson-0803/MSpoofTTS"

ckpt_path = hf_hub_download(
    repo_id=repo_id,
    filename="checkpoints/segment_len50.ckpt",
    repo_type="model",
)

print(ckpt_path)
```

Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase:

```python
import torch

# Import this from the official MSpoofTTS codebase.
# from your_mspoof_code import SegmentTokenDiscriminator

state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
```

For hierarchical decoding, use the following checkpoint files:

```python
checkpoint_files = {
    "segment_len50": "checkpoints/segment_len50.ckpt",
    "segment_len25": "checkpoints/segment_len25.ckpt",
    "segment_len10": "checkpoints/segment_len10.ckpt",
    "strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt",
    "strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt",
}
```

## Intended Use

These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding.

## Limitations

These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation.

## Citation

```bibtex
@article{zhao2026hierarchical,
  title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection},
  author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye},
  journal={arXiv preprint arXiv:2603.05373},
  year={2026}
}
```