MOSS-TTS-GPTQ

This is a GPTQ 4-bit quantized version of MOSS-TTS (MossTTSDelay 8B) by the OpenMOSS Team, quantized using GPTQModel. The quantization targets the LLM backbone only; the non-backbone weights (emb_ext, lm_heads, language_model.norm) are preserved in their original precision and merged back into a single self-contained repo.

This repo is intended to be used together with the float16 audio tokenizer: 🤗 blazingbhavneek/MOSS-Audio-Tokenizer-FP16.

VRAM usage: ~9 GB (vs. ~23 GB for the original bf16 model + tokenizer), making this accessible on a single consumer GPU.

The quantization scripts and a working environment requirements.txt are available in the fork used to produce this model.

Original model resources:


Model Description

MOSS-TTS is a production-grade TTS foundation model focused on zero-shot voice cloning, ultra-long stable speech generation (up to 1 hour), token-level duration control, phoneme-level pronunciation control (Pinyin / IPA), and multilingual & code-switched synthesis. It is built on a clean autoregressive discrete-token recipe with a large-scale Transformer backbone (MossTTSDelay architecture).

Supported Languages

20 languages are supported:

Language Code Language Code Language Code
Chinese zh 🇨🇳 English en 🇺🇸 German de
Spanish es 🇪🇸 French fr 🇫🇷 Japanese ja
Italian it 🇮🇹 Hebrew he 🇮🇱 Korean ko
Russian ru 🇷🇺 Persian (Farsi) fa 🇮🇷 Arabic ar
Polish pl 🇵🇱 Portuguese pt 🇵🇹 Czech cs
Danish da 🇩🇰 Swedish sv 🇸🇪 Hungarian hu
Greek el 🇬🇷 Turkish tr 🇹🇷

Quantization Details

Property Value
Base model MossTTSDelay-8B
Quantization method GPTQ
Bits 4
Group size 128
Calibration dataset wikitext2
Quantization library GPTQModel
VRAM (this model + FP16 tokenizer) ~9 GB
VRAM (original bf16 model + tokenizer) ~23 GB

Evaluation — Quality Degradation

Benchmarks on seed-tts-eval. Lower WER is better.

Model EN WER (%) ↓
MossTTSDelay-8B (original fp32/bf16) 1.79
MossTTSDelay-8B (this GPTQ 4-bit) 2.585

Installation

Install GPTQModel in addition to the standard MOSS-TTS dependencies — it is required to load the quantized backbone at inference time:

pip install gptqmodel

Then install MOSS-TTS from the fork that added GPTQ support:

git clone -b feature/gptq-quant-model-support https://github.com/blazingbhavneek/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .

Note: When loading this model you will see warnings about unexpected keys (e.g. model.layers.X.mlp.gate_proj.qweight). These are normal — they come from the GPTQModel library's quantized layer format and do not affect inference.


Inference

from pathlib import Path
import torch
import torchaudio
from moss_tts_delay.modeling_moss_tts import MossTTSDelayModel
from moss_tts_delay.processing_moss_tts import MossTTSDelayProcessor

# Disable problematic SDP backends
torch.backends.cuda.enable_cudnn_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)

MERGED_PATH = "blazingbhavneek/MOSS-TTS-GPTQ"
AUDIO_TOK_PATH = "blazingbhavneek/MOSS-Audio-Tokenizer-FP16"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = MossTTSDelayModel.from_pretrained(
    MERGED_PATH,
    gptq_device=device,
    trust_remote_code=True,
).eval()

processor = MossTTSDelayProcessor.from_pretrained(
    MERGED_PATH,
    codec_path=AUDIO_TOK_PATH,
    trust_remote_code=True,
)

# --- Example texts ---
text_en = "We stand on the threshold of the AI era. Artificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision."
text_zh = "亲爱的你,你好呀。今天,我想用最认真、最温柔的声音,对你说一些重要的话。"
text_pinyin = "nin2 hao3,qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4?"
text_ipa = "/həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/"

# Reference audio for voice cloning (URLs or local paths)
ref_audio_zh = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_en = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"

conversations = [
    # Direct TTS (no reference)
    [processor.build_user_message(text=text_zh)],
    [processor.build_user_message(text=text_en)],
    # Pronunciation control
    [processor.build_user_message(text=text_pinyin)],
    [processor.build_user_message(text=text_ipa)],
    # Voice cloning
    [processor.build_user_message(text=text_zh, reference=[ref_audio_zh])],
    [processor.build_user_message(text=text_en, reference=[ref_audio_en])],
    # Duration control (1s ≈ 12.5 tokens)
    [processor.build_user_message(text=text_en, tokens=325)],
    [processor.build_user_message(text=text_en, tokens=600)],
]

save_dir = Path("inference_root")
save_dir.mkdir(exist_ok=True, parents=True)

with torch.no_grad():
    for idx, conversation in enumerate(conversations):
        batch = processor([conversation], mode="generation")
        outputs = model.generate(
            input_ids=batch["input_ids"].to(device),
            attention_mask=batch["attention_mask"].to(device),
            max_new_tokens=4096,
        )
        for message in processor.decode([(sl, ids.long()) for sl, ids in outputs]):
            audio = message.audio_codes_list[0]
            out_path = save_dir / f"sample{idx}.wav"
            torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
            print(f"Saved {out_path}")

Input reference

UserMessage fields

Field Type Required Description
text str Text to synthesize. Supports all 20 languages, raw text, Pinyin, IPA, or any mix.
reference List[str] Reference audio path(s) or URL(s) for zero-shot voice cloning. One audio expected.
tokens int Target audio token count for duration control. 1 second ≈ 12.5 tokens.

Generation hyperparameters

Parameter Default Description
max_new_tokens Total audio tokens to generate. Use the 1s ≈ 12.5 tokens rule.
audio_temperature 1.7 Higher = more variation; lower = more stable prosody.
audio_top_p 0.8 Nucleus sampling cutoff.
audio_top_k 25 Top-K sampling.
audio_repetition_penalty 1.0 Values > 1.0 discourage repeating patterns.

License

Apache 2.0, consistent with the original MOSS-TTS release.

Downloads last month
45
Safetensors
Model size
3B params
Tensor type
F16
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for blazingbhavneek/MOSS-TTS-GPTQ

Finetuned
(1)
this model