MOSS-TTS-Delay: llama.cpp Inference Backend
This package provides a torch-free (or torch-optional) end-to-end TTS inference pipeline for MOSS-TTS-Delay using:
- llama.cpp for the Qwen3 backbone (GGUF format, GPU/CPU)
- NumPy for embeddings, LM heads, delay state machine, and sampling
- ONNX Runtime or TensorRT for the audio tokenizer
When PyTorch is available, LM heads can optionally be GPU-accelerated (~30x faster).
Prerequisites
- llama.cpp โ compiled from source with shared library support
- Python >= 3.10
Installation
Minimal (torch-free, ONNX audio)
pip install -e ".[llama-cpp-onnx]"
With TensorRT audio (max performance)
pip install -e ".[llama-cpp-trt]"
With PyTorch LM heads acceleration
pip install -e ".[llama-cpp-trt,llama-cpp-torch]"
Weight Preparation
To convert weights from the original MOSS-TTS model yourself (instead of downloading pre-quantized ones), see the conversion guide.
Step 1: Download pre-quantized TTS backbone & weights
We provide pre-quantized GGUF backbone, embedding tables, and LM head matrices on HuggingFace:
# Download pre-built GGUF + embeddings + lm_heads
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF
This gives you:
weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.ggufโ Q4_K_M quantized backboneweights/MOSS-TTS-GGUF/embeddings/โ 33 embedding.npyfilesweights/MOSS-TTS-GGUF/lm_heads/โ 33 LM head.npyfilesweights/MOSS-TTS-GGUF/tokenizer/โ BPE tokenizer files
Step 2: Download ONNX audio tokenizer
We provide ONNX models for the audio tokenizer. TensorRT engines are not provided because they are tied to specific GPU architectures and TensorRT versions.
# Download ONNX encoder & decoder
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX
Step 3: Build the C bridge
# Clone and build llama.cpp (if not already done)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j
cd ..
# Build the C bridge shared library
cd moss_tts_delay/llama_cpp
bash build_bridge.sh /path/to/llama.cpp
Step 4 (Optional): Build TensorRT engines
Note: Only needed if you want to use
audio_backend: trtfor maximum audio tokenizer performance. Most users should use the ONNX backend.
bash moss_audio_tokenizer/trt/build_engine.sh \
weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx \
weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx \
weights/MOSS-Audio-Tokenizer-TRT
โ ๏ธ maxShapes determines the maximum audio length your engine can handle. The default builds support up to 40 seconds of audio. If you need longer audio, edit
MAX_AUDIO_SECONDSinbuild_engine.shbefore building. See the detailed shape โ duration table in the script's comments.
Usage
CLI
# Basic generation
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello, world!" \
--output output.wav
# With reference audio (voice cloning)
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello!" \
--reference ref.wav \
--output output.wav
# Force numpy LM heads (torch-free)
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello!" \
--heads-backend numpy
# With profiling
python -m moss_tts_delay.llama_cpp \
--config configs/llama_cpp/default.yaml \
--text "Hello!" \
--profile
Python API
from moss_tts_delay.llama_cpp import LlamaCppPipeline, PipelineConfig
config = PipelineConfig.from_yaml("configs/llama_cpp/default.yaml")
with LlamaCppPipeline(config) as pipeline:
waveform = pipeline.generate(
text="Hello, world!",
reference_audio="ref.wav", # optional
language="en",
)
import soundfile as sf
sf.write("output.wav", waveform, 24000)
Batch Evaluation
python scripts/batch_eval_llama_cpp.py \
--config configs/llama_cpp/default.yaml \
--benchmark-dir /path/to/eval/tts \
--result-dir results/llama_cpp_run \
--suite seed-tts
Benchmark
Quantization quality evaluated on Seed-TTS-eval zero-shot benchmark. Baseline is the original HuggingFace model; GGUF variants use the llama.cpp backend with TensorRT audio tokenizer.
| Quantization | EN WER (%) โ | EN SIM (%) โ | ZH CER (%) โ | ZH SIM (%) โ |
|---|---|---|---|---|
| Baseline (HuggingFace) | 1.79 | 71.46 | 1.32 | 77.05 |
| Q8_0 | 3.21 | 68.61 | 1.56 | 76.03 |
| Q6_K | 3.11 | 68.77 | 1.44 | 76.06 |
| Q5_K_M | 2.95 | 68.55 | 1.50 | 75.96 |
| Q4_K_M | 2.83 | 68.15 | 1.58 | 75.71 |
Configuration
Config Files
| Config | Audio Backend | Use Case |
|---|---|---|
configs/llama_cpp/default.yaml |
ONNX | Recommended starting point |
configs/llama_cpp/trt.yaml |
TensorRT | Maximum throughput |
configs/llama_cpp/cpu-only.yaml |
ONNX (CPU) | No GPU required |
Key Options
| Option | Values | Description |
|---|---|---|
heads_backend |
auto / numpy / torch |
LM heads computation backend. auto uses torch if available |
audio_backend |
onnx / trt / torch |
Audio tokenizer backend |
n_gpu_layers |
-1 / 0 / N |
GPU offload layers. -1 = all, 0 = CPU only |
n_ctx |
int | Context window size (prompt + generation) |
max_new_tokens |
int | Maximum generation steps |
Architecture
Input text
โ
โผ
Tokenizer (Rust BPE, tokenizers library)
โ
โผ
build_generation_prompt() โ input_ids (S, 33)
โ
โผ
EmbeddingLookup (NumPy .npy) โ embeddings (S, H)
โ
โผ
LlamaCppBackbone (GGUF, C bridge) โ hidden_state (H,)
โ
โโ [heads_backend=torch] TorchLMHeads (nn.Linear, GPU)
โ โโ audio_logits (32, 1025)
โ
โโ [heads_backend=numpy] NumpyLMHeads (CPU matmul)
โโ audio_logits (32, 1025)
โ
โผ
delay_step() + sampling (NumPy) โ next_ids (33,)
โ
โผ (loop until EOS)
โ
Audio codes โ AudioTokenizer (ONNX/TRT/Torch) โ waveform
File Structure
moss_tts_delay/llama_cpp/
โโโ __init__.py # Package entry, exports LlamaCppPipeline
โโโ __main__.py # python -m moss_tts_delay.llama_cpp
โโโ _constants.py # Token IDs (from config.json, torch-free)
โโโ pipeline.py # LlamaCppPipeline (main entry)
โโโ backbone.py # LlamaCppBackbone (C bridge wrapper)
โโโ backbone_bridge.c # C bridge source
โโโ build_bridge.sh # Build script
โโโ embedding.py # EmbeddingLookup (NumPy)
โโโ lm_heads.py # NumpyLMHeads + TorchLMHeads
โโโ delay_state.py # Delay state machine (NumPy)
โโโ sampling.py # top-k/p sampling (NumPy)
โโโ processor.py # Tokenizer + prompt builder
โโโ README.md # This file
โโโ README_zh.md # Chinese documentation
โโโ conversion/
โโโ extract_weights.py # Weight extraction script
โโโ README.md # Conversion guide (English)
โโโ README_zh.md # Conversion guide (Chinese)