--- license: apache-2.0 library_name: pytorch tags: - text-to-speech - speech-synthesis - discrete-speech-synthesis - neural-codec-language-model - spoof-detection - hierarchical-decoding - pytorch --- # MSpoofTTS Discriminator Checkpoints This repository provides the discriminator checkpoints used in **MSpoofTTS: Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection**. Paper: [Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection](https://arxiv.org/abs/2603.05373) Demo: https://danny-nus.github.io/MSpoofTTS.github.io/ This repository is intended as a **checkpoint hosting repository**. The discriminator architecture definitions are not included here. Please use these checkpoints together with the official MSpoofTTS codebase. ## Checkpoints | File | Model Type | Segment Length | Scale | |---|---|---:|---:| | `checkpoints/segment_len50.ckpt` | SegmentTokenDiscriminator | 50 | - | | `checkpoints/segment_len25.ckpt` | SegmentTokenDiscriminator | 25 | - | | `checkpoints/segment_len10.ckpt` | SegmentTokenDiscriminator | 10 | - | | `checkpoints/strided_seg50_scale10.ckpt` | StridedSegmentTokenDiscriminator | 50 | 10 | | `checkpoints/strided_seg50_scale25.ckpt` | StridedSegmentTokenDiscriminator | 50 | 25 | ## Model Configuration All discriminators use the following base configuration: ```python vocab_size = 65536 d_model = 256 nhead = 8 num_layers = 4 dim_feedforward = 1024 dropout = 0.1 ``` The segment-level discriminators use `segment_len` values of 10, 25, and 50. The strided discriminators use `segment_len=50` with scales 10 and 25. ## Usage Install the Hugging Face Hub package: ```bash pip install -U huggingface_hub ``` Download a checkpoint: ```python from huggingface_hub import hf_hub_download repo_id = "Chanson-0803/MSpoofTTS" ckpt_path = hf_hub_download( repo_id=repo_id, filename="checkpoints/segment_len50.ckpt", repo_type="model", ) print(ckpt_path) ``` Then load the checkpoint using the corresponding discriminator class from the MSpoofTTS codebase: ```python import torch # Import this from the official MSpoofTTS codebase. # from your_mspoof_code import SegmentTokenDiscriminator state = torch.load(ckpt_path, map_location="cpu") model.load_state_dict(state["model_state_dict"]) model.eval() ``` For hierarchical decoding, use the following checkpoint files: ```python checkpoint_files = { "segment_len50": "checkpoints/segment_len50.ckpt", "segment_len25": "checkpoints/segment_len25.ckpt", "segment_len10": "checkpoints/segment_len10.ckpt", "strided_seg50_scale10": "checkpoints/strided_seg50_scale10.ckpt", "strided_seg50_scale25": "checkpoints/strided_seg50_scale25.ckpt", } ``` ## Intended Use These checkpoints are intended for research on discrete speech synthesis, neural codec language models, inference-time decoding guidance, spoof detection for generated speech tokens, and hierarchical multi-resolution decoding. ## Limitations These checkpoints are designed for the speech-token vocabulary and discriminator architectures used in MSpoofTTS. They may not be directly compatible with other codec tokenizers, vocabulary layouts, or speech language models without adaptation. ## Citation ```bibtex @article{zhao2026hierarchical, title={Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection}, author={Zhao, Junchuan and Vu, Minh Duc and Wang, Ye}, journal={arXiv preprint arXiv:2603.05373}, year={2026} } ```