| --- |
| license: apache-2.0 |
| library_name: pytorch |
| tags: |
| - text-to-speech |
| - speech-synthesis |
| - discrete-speech-synthesis |
| - neural-codec-language-model |
| - spoof-detection |
| - hierarchical-decoding |
| - pytorch |
| --- |
| |
| # MSpoofTTS Discriminator Checkpoints |
|
|
| This repository provides the discriminator checkpoints used in **MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection**. |
|
|
| Paper: [Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection](https://arxiv.org/abs/2603.05373) |
|
|
| Demo: https://danny-nus.github.io/MSpoofTTS.github.io/ |
|
|
| This repository is intended as a **checkpoint hosting repository**. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase. |
|
|
| ## Checkpoints |
|
|
| | File | Model Type | Segment Length | Scale | |
| |---|---|---:|---:| |
| | `checkpoints/segment_len50.ckpt` | SegmentTokenDiscriminator | 50 | - | |
| | `checkpoints/segment_len25.ckpt` | SegmentTokenDiscriminator | 25 | - | |
| | `checkpoints/segment_len10.ckpt` | SegmentTokenDiscriminator | 10 | - | |
| | `checkpoints/strided_seg50_scale10.ckpt` | StridedSegmentTokenDiscriminator | 50 | 10 | |
| | `checkpoints/strided_seg50_scale25.ckpt` | StridedSegmentTokenDiscriminator | 50 | 25 | |
|
|
| ## Model Configuration |
|
|
| All discriminators use the following base configuration: |
|
|
| ```python |
| vocab_size = 65536 |
| d_model = 256 |
| nhead = 8 |
| num_layers = 4 |
| dim_feedforward = 1024 |
| dropout = 0.1 |
| ``` |
|
|
| The segment-level discriminators use `segment_len` values of 10, 25, and 50. |
|
|
| The strided discriminators use `segment_len=50` with scales 10 and 25. |
|
|
| ## Usage |
|
|
| Install the Hugging Face Hub package: |
|
|
| ```bash |
| pip install -U huggingface_hub |
| ``` |
|
|
| Download a checkpoint: |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| |
| repo_id = "Chanson-0803/MSpoofTTS" |
| |
| ckpt_path = hf_hub_download( |
| repo_id=repo_id, |
| filename="checkpoints/segment_len50.ckpt", |
| repo_type="model", |
| ) |
| |
| print(ckpt_path) |
| ``` |
|
|
| Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase: |
|
|
| ```python |
| import torch |
| |
| # Import this from the official MSpoofTTS codebase. |
| # from your_mspoof_code import SegmentTokenDiscriminator |
| |
| state = torch.load(ckpt_path, map_location="cpu") |
| model.load_state_dict(state["model_state_dict"]) |
| model.eval() |
| ``` |
|
|
| For hierarchical decoding, use the following checkpoint files: |
|
|
| ```python |
| checkpoint_files = { |
| "segment_len50": "checkpoints/segment_len50.ckpt", |
| "segment_len25": "checkpoints/segment_len25.ckpt", |
| "segment_len10": "checkpoints/segment_len10.ckpt", |
| "strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt", |
| "strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt", |
| } |
| ``` |
|
|
| ## Intended Use |
|
|
| These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding. |
|
|
| ## Limitations |
|
|
| These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhao2026hierarchical, |
| title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection}, |
| author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye}, |
| journal={arXiv preprint arXiv:2603.05373}, |
| year={2026} |
| } |
| ``` |
|
|