MSpoofTTS / README.md
Chanson-0803's picture
Update model card
6ba5117 verified
|
Raw
History Blame Contribute Delete
3.54 kB
---
license: apache-2.0
library_name: pytorch
tags:
- text-to-speech
- speech-synthesis
- discrete-speech-synthesis
- neural-codec-language-model
- spoof-detection
- hierarchical-decoding
- pytorch
---
# MSpoofTTS Discriminator Checkpoints
This repository provides the discriminator checkpoints used in **MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection**.
Paper: [Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection](https://arxiv.org/abs/2603.05373)
Demo: https://danny-nus.github.io/MSpoofTTS.github.io/
This repository is intended as a **checkpoint hosting repository**. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase.
## Checkpoints
| File | Model Type | Segment Length | Scale |
|---|---|---:|---:|
| `checkpoints/segment_len50.ckpt` | SegmentTokenDiscriminator | 50 | - |
| `checkpoints/segment_len25.ckpt` | SegmentTokenDiscriminator | 25 | - |
| `checkpoints/segment_len10.ckpt` | SegmentTokenDiscriminator | 10 | - |
| `checkpoints/strided_seg50_scale10.ckpt` | StridedSegmentTokenDiscriminator | 50 | 10 |
| `checkpoints/strided_seg50_scale25.ckpt` | StridedSegmentTokenDiscriminator | 50 | 25 |
## Model Configuration
All discriminators use the following base configuration:
```python
vocab_size = 65536
d_model = 256
nhead = 8
num_layers = 4
dim_feedforward = 1024
dropout = 0.1
```
The segment-level discriminators use `segment_len` values of 10, 25, and 50.
The strided discriminators use `segment_len=50` with scales 10 and 25.
## Usage
Install the Hugging Face Hub package:
```bash
pip install -U huggingface_hub
```
Download a checkpoint:
```python
from huggingface_hub import hf_hub_download
repo_id = "Chanson-0803/MSpoofTTS"
ckpt_path = hf_hub_download(
repo_id=repo_id,
filename="checkpoints/segment_len50.ckpt",
repo_type="model",
)
print(ckpt_path)
```
Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase:
```python
import torch
# Import this from the official MSpoofTTS codebase.
# from your_mspoof_code import SegmentTokenDiscriminator
state = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state["model_state_dict"])
model.eval()
```
For hierarchical decoding, use the following checkpoint files:
```python
checkpoint_files = {
"segment_len50": "checkpoints/segment_len50.ckpt",
"segment_len25": "checkpoints/segment_len25.ckpt",
"segment_len10": "checkpoints/segment_len10.ckpt",
"strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt",
"strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt",
}
```
## Intended Use
These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding.
## Limitations
These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation.
## Citation
```bibtex
@article{zhao2026hierarchical,
title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection},
author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye},
journal={arXiv preprint arXiv:2603.05373},
year={2026}
}
```