Chanson-0803
/

MSpoofTTS

speech-synthesis

discrete-speech-synthesis

neural-codec-language-model

spoof-detection

hierarchical-decoding

Model card Files Files and versions

MSpoofTTS / README.md

Chanson-0803's picture

Update model card

6ba5117 verified 6 days ago

|

History Blame Contribute Delete

3.54 kB

	---
	license: apache-2.0
	library_name: pytorch
	tags:
	- text-to-speech
	- speech-synthesis
	- discrete-speech-synthesis
	- neural-codec-language-model
	- spoof-detection
	- hierarchical-decoding
	- pytorch
	---

	# MSpoofTTS Discriminator Checkpoints

	This repository provides the discriminator checkpoints used in MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection.

	Paper: [Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection](https://arxiv.org/abs/2603.05373)

	Demo: https://danny-nus.github.io/MSpoofTTS.github.io/

	This repository is intended as a checkpoint hosting repository. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase.

	## Checkpoints

	\| File \| Model Type \| Segment Length \| Scale \|
	\|---\|---\|---:\|---:\|
	\| `checkpoints/segment_len50.ckpt` \| SegmentTokenDiscriminator \| 50 \| - \|
	\| `checkpoints/segment_len25.ckpt` \| SegmentTokenDiscriminator \| 25 \| - \|
	\| `checkpoints/segment_len10.ckpt` \| SegmentTokenDiscriminator \| 10 \| - \|
	\| `checkpoints/strided_seg50_scale10.ckpt` \| StridedSegmentTokenDiscriminator \| 50 \| 10 \|
	\| `checkpoints/strided_seg50_scale25.ckpt` \| StridedSegmentTokenDiscriminator \| 50 \| 25 \|

	## Model Configuration

	All discriminators use the following base configuration:

	```python
	vocab_size = 65536
	d_model = 256
	nhead = 8
	num_layers = 4
	dim_feedforward = 1024
	dropout = 0.1
	```

	The segment-level discriminators use `segment_len` values of 10, 25, and 50.

	The strided discriminators use `segment_len=50` with scales 10 and 25.

	## Usage

	Install the Hugging Face Hub package:

	```bash
	pip install -U huggingface_hub
	```

	Download a checkpoint:

	```python
	from huggingface_hub import hf_hub_download

	repo_id = "Chanson-0803/MSpoofTTS"

	ckpt_path = hf_hub_download(
	repo_id=repo_id,
	filename="checkpoints/segment_len50.ckpt",
	repo_type="model",
	)

	print(ckpt_path)
	```

	Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase:

	```python
	import torch

	# Import this from the official MSpoofTTS codebase.
	# from your_mspoof_code import SegmentTokenDiscriminator

	state = torch.load(ckpt_path, map_location="cpu")
	model.load_state_dict(state["model_state_dict"])
	model.eval()
	```

	For hierarchical decoding, use the following checkpoint files:

	```python
	checkpoint_files = {
	"segment_len50": "checkpoints/segment_len50.ckpt",
	"segment_len25": "checkpoints/segment_len25.ckpt",
	"segment_len10": "checkpoints/segment_len10.ckpt",
	"strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt",
	"strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt",
	}
	```

	## Intended Use

	These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding.

	## Limitations

	These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation.

	## Citation

	```bibtex
	@article{zhao2026hierarchical,
	title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection},
	author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye},
	journal={arXiv preprint arXiv:2603.05373},
	year={2026}
	}
	```