MOSS-TTS-GPTQ

This is a GPTQ 4-bit quantized version of MOSS-TTS (MossTTSDelay 8B) by the OpenMOSS Team, quantized using GPTQModel. The quantization targets the LLM backbone only; the non-backbone weights (emb_ext, lm_heads, language_model.norm) are preserved in their original precision and merged back into a single self-contained repo.

This repo is intended to be used together with the float16 audio tokenizer: 🤗 blazingbhavneek/MOSS-Audio-Tokenizer-FP16.

VRAM usage: ~9 GB (vs. ~23 GB for the original bf16 model + tokenizer), making this accessible on a single consumer GPU.

The quantization scripts and a working environment requirements.txt are available in the fork used to produce this model.

Original model resources:

🤗 OpenMOSS-Team/MOSS-TTS
💻 OpenMOSS/MOSS-TTS (GitHub)

Model Description

MOSS-TTS is a production-grade TTS foundation model focused on zero-shot voice cloning, ultra-long stable speech generation (up to 1 hour), token-level duration control, phoneme-level pronunciation control (Pinyin / IPA), and multilingual & code-switched synthesis. It is built on a clean autoregressive discrete-token recipe with a large-scale Transformer backbone (MossTTSDelay architecture).

Supported Languages

20 languages are supported:

Language	Code		Language	Code		Language	Code
Chinese	zh	🇨🇳	English	en	🇺🇸	German	de
Spanish	es	🇪🇸	French	fr	🇫🇷	Japanese	ja
Italian	it	🇮🇹	Hebrew	he	🇮🇱	Korean	ko
Russian	ru	🇷🇺	Persian (Farsi)	fa	🇮🇷	Arabic	ar
Polish	pl	🇵🇱	Portuguese	pt	🇵🇹	Czech	cs
Danish	da	🇩🇰	Swedish	sv	🇸🇪	Hungarian	hu
Greek	el	🇬🇷	Turkish	tr	🇹🇷

Quantization Details

Property	Value
Base model	MossTTSDelay-8B
Quantization method	GPTQ
Bits	4
Group size	128
Calibration dataset	wikitext2
Quantization library	GPTQModel
VRAM (this model + FP16 tokenizer)	~9 GB
VRAM (original bf16 model + tokenizer)	~23 GB

Evaluation — Quality Degradation

Benchmarks on seed-tts-eval. Lower WER is better.

Model	EN WER (%) ↓
MossTTSDelay-8B (original fp32/bf16)	1.79
MossTTSDelay-8B (this GPTQ 4-bit)	2.585

Installation

Install GPTQModel in addition to the standard MOSS-TTS dependencies — it is required to load the quantized backbone at inference time:

pip install gptqmodel

Then install MOSS-TTS from the fork that added GPTQ support:

git clone -b feature/gptq-quant-model-support https://github.com/blazingbhavneek/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .

Note: When loading this model you will see warnings about unexpected keys (e.g. model.layers.X.mlp.gate_proj.qweight). These are normal — they come from the GPTQModel library's quantized layer format and do not affect inference.

Inference

from pathlib import Path
import torch
import torchaudio
from moss_tts_delay.modeling_moss_tts import MossTTSDelayModel
from moss_tts_delay.processing_moss_tts import MossTTSDelayProcessor

# Disable problematic SDP backends
torch.backends.cuda.enable_cudnn_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)

MERGED_PATH = "blazingbhavneek/MOSS-TTS-GPTQ"
AUDIO_TOK_PATH = "blazingbhavneek/MOSS-Audio-Tokenizer-FP16"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = MossTTSDelayModel.from_pretrained(
    MERGED_PATH,
    gptq_device=device,
    trust_remote_code=True,
).eval()

processor = MossTTSDelayProcessor.from_pretrained(
    MERGED_PATH,
    codec_path=AUDIO_TOK_PATH,
    trust_remote_code=True,
)

# --- Example texts ---
text_en = "We stand on the threshold of the AI era. Artificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision."
text_zh = "亲爱的你，你好呀。今天，我想用最认真、最温柔的声音，对你说一些重要的话。"
text_pinyin = "nin2 hao3，qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4？"
text_ipa = "/həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/"

# Reference audio for voice cloning (URLs or local paths)
ref_audio_zh = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_en = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"

conversations = [
    # Direct TTS (no reference)
    [processor.build_user_message(text=text_zh)],
    [processor.build_user_message(text=text_en)],
    # Pronunciation control
    [processor.build_user_message(text=text_pinyin)],
    [processor.build_user_message(text=text_ipa)],
    # Voice cloning
    [processor.build_user_message(text=text_zh, reference=[ref_audio_zh])],
    [processor.build_user_message(text=text_en, reference=[ref_audio_en])],
    # Duration control (1s ≈ 12.5 tokens)
    [processor.build_user_message(text=text_en, tokens=325)],
    [processor.build_user_message(text=text_en, tokens=600)],
]

save_dir = Path("inference_root")
save_dir.mkdir(exist_ok=True, parents=True)

with torch.no_grad():
    for idx, conversation in enumerate(conversations):
        batch = processor([conversation], mode="generation")
        outputs = model.generate(
            input_ids=batch["input_ids"].to(device),
            attention_mask=batch["attention_mask"].to(device),
            max_new_tokens=4096,
        )
        for message in processor.decode([(sl, ids.long()) for sl, ids in outputs]):
            audio = message.audio_codes_list[0]
            out_path = save_dir / f"sample{idx}.wav"
            torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
            print(f"Saved {out_path}")

Input reference

UserMessage fields

Field	Type	Required	Description
`text`	`str`	✅	Text to synthesize. Supports all 20 languages, raw text, Pinyin, IPA, or any mix.
`reference`	`List[str]`	❌	Reference audio path(s) or URL(s) for zero-shot voice cloning. One audio expected.
`tokens`	`int`	❌	Target audio token count for duration control. 1 second ≈ 12.5 tokens.

Generation hyperparameters

Parameter	Default	Description
`max_new_tokens`	—	Total audio tokens to generate. Use the 1s ≈ 12.5 tokens rule.
`audio_temperature`	1.7	Higher = more variation; lower = more stable prosody.
`audio_top_p`	0.8	Nucleus sampling cutoff.
`audio_top_k`	25	Top-K sampling.
`audio_repetition_penalty`	1.0	Values > 1.0 discourage repeating patterns.

License

Apache 2.0, consistent with the original MOSS-TTS release.

Downloads last month: 8

Safetensors

Model size

3B params

Tensor type

F16

I32

BF16

Model tree for blazingbhavneek/MOSS-TTS-GPTQ

Base model

OpenMOSS-Team/MOSS-TTS

Finetuned

(1)

this model