MOSS-TTS-GPTQ
This is a GPTQ 4-bit quantized version of MOSS-TTS (MossTTSDelay 8B) by the OpenMOSS Team, quantized using GPTQModel. The quantization targets the LLM backbone only; the non-backbone weights (emb_ext, lm_heads, language_model.norm) are preserved in their original precision and merged back into a single self-contained repo.
This repo is intended to be used together with the float16 audio tokenizer: 🤗 blazingbhavneek/MOSS-Audio-Tokenizer-FP16.
VRAM usage: ~9 GB (vs. ~23 GB for the original bf16 model + tokenizer), making this accessible on a single consumer GPU.
The quantization scripts and a working environment requirements.txt are available in the fork used to produce this model.
Original model resources:
Model Description
MOSS-TTS is a production-grade TTS foundation model focused on zero-shot voice cloning, ultra-long stable speech generation (up to 1 hour), token-level duration control, phoneme-level pronunciation control (Pinyin / IPA), and multilingual & code-switched synthesis. It is built on a clean autoregressive discrete-token recipe with a large-scale Transformer backbone (MossTTSDelay architecture).
Supported Languages
20 languages are supported:
| Language | Code | Language | Code | Language | Code | ||
|---|---|---|---|---|---|---|---|
| Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de |
| Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja |
| Italian | it | 🇮🇹 | Hebrew | he | 🇮🇱 | Korean | ko |
| Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar |
| Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs |
| Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | Hungarian | hu |
| Greek | el | 🇬🇷 | Turkish | tr | 🇹🇷 |
Quantization Details
| Property | Value |
|---|---|
| Base model | MossTTSDelay-8B |
| Quantization method | GPTQ |
| Bits | 4 |
| Group size | 128 |
| Calibration dataset | wikitext2 |
| Quantization library | GPTQModel |
| VRAM (this model + FP16 tokenizer) | ~9 GB |
| VRAM (original bf16 model + tokenizer) | ~23 GB |
Evaluation — Quality Degradation
Benchmarks on seed-tts-eval. Lower WER is better.
| Model | EN WER (%) ↓ |
|---|---|
| MossTTSDelay-8B (original fp32/bf16) | 1.79 |
| MossTTSDelay-8B (this GPTQ 4-bit) | 2.585 |
Installation
Install GPTQModel in addition to the standard MOSS-TTS dependencies — it is required to load the quantized backbone at inference time:
pip install gptqmodel
Then install MOSS-TTS from the fork that added GPTQ support:
git clone -b feature/gptq-quant-model-support https://github.com/blazingbhavneek/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
Note: When loading this model you will see warnings about unexpected keys (e.g.
model.layers.X.mlp.gate_proj.qweight). These are normal — they come from the GPTQModel library's quantized layer format and do not affect inference.
Inference
from pathlib import Path
import torch
import torchaudio
from moss_tts_delay.modeling_moss_tts import MossTTSDelayModel
from moss_tts_delay.processing_moss_tts import MossTTSDelayProcessor
# Disable problematic SDP backends
torch.backends.cuda.enable_cudnn_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
MERGED_PATH = "blazingbhavneek/MOSS-TTS-GPTQ"
AUDIO_TOK_PATH = "blazingbhavneek/MOSS-Audio-Tokenizer-FP16"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = MossTTSDelayModel.from_pretrained(
MERGED_PATH,
gptq_device=device,
trust_remote_code=True,
).eval()
processor = MossTTSDelayProcessor.from_pretrained(
MERGED_PATH,
codec_path=AUDIO_TOK_PATH,
trust_remote_code=True,
)
# --- Example texts ---
text_en = "We stand on the threshold of the AI era. Artificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision."
text_zh = "亲爱的你,你好呀。今天,我想用最认真、最温柔的声音,对你说一些重要的话。"
text_pinyin = "nin2 hao3,qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4?"
text_ipa = "/həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/"
# Reference audio for voice cloning (URLs or local paths)
ref_audio_zh = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"
ref_audio_en = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"
conversations = [
# Direct TTS (no reference)
[processor.build_user_message(text=text_zh)],
[processor.build_user_message(text=text_en)],
# Pronunciation control
[processor.build_user_message(text=text_pinyin)],
[processor.build_user_message(text=text_ipa)],
# Voice cloning
[processor.build_user_message(text=text_zh, reference=[ref_audio_zh])],
[processor.build_user_message(text=text_en, reference=[ref_audio_en])],
# Duration control (1s ≈ 12.5 tokens)
[processor.build_user_message(text=text_en, tokens=325)],
[processor.build_user_message(text=text_en, tokens=600)],
]
save_dir = Path("inference_root")
save_dir.mkdir(exist_ok=True, parents=True)
with torch.no_grad():
for idx, conversation in enumerate(conversations):
batch = processor([conversation], mode="generation")
outputs = model.generate(
input_ids=batch["input_ids"].to(device),
attention_mask=batch["attention_mask"].to(device),
max_new_tokens=4096,
)
for message in processor.decode([(sl, ids.long()) for sl, ids in outputs]):
audio = message.audio_codes_list[0]
out_path = save_dir / f"sample{idx}.wav"
torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)
print(f"Saved {out_path}")
Input reference
UserMessage fields
| Field | Type | Required | Description |
|---|---|---|---|
text |
str |
✅ | Text to synthesize. Supports all 20 languages, raw text, Pinyin, IPA, or any mix. |
reference |
List[str] |
❌ | Reference audio path(s) or URL(s) for zero-shot voice cloning. One audio expected. |
tokens |
int |
❌ | Target audio token count for duration control. 1 second ≈ 12.5 tokens. |
Generation hyperparameters
| Parameter | Default | Description |
|---|---|---|
max_new_tokens |
— | Total audio tokens to generate. Use the 1s ≈ 12.5 tokens rule. |
audio_temperature |
1.7 | Higher = more variation; lower = more stable prosody. |
audio_top_p |
0.8 | Nucleus sampling cutoff. |
audio_top_k |
25 | Top-K sampling. |
audio_repetition_penalty |
1.0 | Values > 1.0 discourage repeating patterns. |
License
Apache 2.0, consistent with the original MOSS-TTS release.
- Downloads last month
- 45
Model tree for blazingbhavneek/MOSS-TTS-GPTQ
Base model
OpenMOSS-Team/MOSS-TTS