MOSS-TTS-Delay: llama.cpp Inference Backend

English | 简体中文

This package provides a torch-free (or torch-optional) end-to-end TTS inference pipeline for MOSS-TTS-Delay using:

llama.cpp for the Qwen3 backbone (GGUF format, GPU/CPU)
NumPy for embeddings, LM heads, delay state machine, and sampling
ONNX Runtime or TensorRT for the audio tokenizer

When PyTorch is available, LM heads can optionally be GPU-accelerated (~30x faster).

Prerequisites

llama.cpp — compiled from source with shared library support
Python >= 3.10

Installation

Minimal (torch-free, ONNX audio)

pip install -e ".[llama-cpp-onnx]"

With TensorRT audio (max performance)

pip install -e ".[llama-cpp-trt]"

With PyTorch LM heads acceleration

pip install -e ".[llama-cpp-trt,llama-cpp-torch]"

Weight Preparation

To convert weights from the original MOSS-TTS model yourself (instead of downloading pre-quantized ones), see the conversion guide.

Step 1: Download pre-quantized TTS backbone & weights

We provide pre-quantized GGUF backbone, embedding tables, and LM head matrices on HuggingFace:

# Download pre-built GGUF + embeddings + lm_heads
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF

This gives you:

weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf — Q4_K_M quantized backbone
weights/MOSS-TTS-GGUF/embeddings/ — 33 embedding .npy files
weights/MOSS-TTS-GGUF/lm_heads/ — 33 LM head .npy files
weights/MOSS-TTS-GGUF/tokenizer/ — BPE tokenizer files

Step 2: Download ONNX audio tokenizer

We provide ONNX models for the audio tokenizer. TensorRT engines are not provided because they are tied to specific GPU architectures and TensorRT versions.

# Download ONNX encoder & decoder
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX

Step 3: Build the C bridge

# Clone and build llama.cpp (if not already done)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j
cd ..

# Build the C bridge shared library
cd moss_tts_delay/llama_cpp
bash build_bridge.sh /path/to/llama.cpp

Step 4 (Optional): Build TensorRT engines

Note: Only needed if you want to use audio_backend: trt for maximum audio tokenizer performance. Most users should use the ONNX backend.

bash moss_audio_tokenizer/trt/build_engine.sh \
    weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx \
    weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx \
    weights/MOSS-Audio-Tokenizer-TRT

⚠️ maxShapes determines the maximum audio length your engine can handle. The default builds support up to 40 seconds of audio. If you need longer audio, edit MAX_AUDIO_SECONDS in build_engine.sh before building. See the detailed shape ↔ duration table in the script's comments.

Usage

CLI

# Basic generation
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello, world!" \
    --output output.wav

# With reference audio (voice cloning)
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --reference ref.wav \
    --output output.wav

# Force numpy LM heads (torch-free)
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --heads-backend numpy

# With profiling
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --profile

Python API

from moss_tts_delay.llama_cpp import LlamaCppPipeline, PipelineConfig

config = PipelineConfig.from_yaml("configs/llama_cpp/default.yaml")

with LlamaCppPipeline(config) as pipeline:
    waveform = pipeline.generate(
        text="Hello, world!",
        reference_audio="ref.wav",  # optional
        language="en",
    )

import soundfile as sf
sf.write("output.wav", waveform, 24000)

Batch Evaluation

python scripts/batch_eval_llama_cpp.py \
    --config configs/llama_cpp/default.yaml \
    --benchmark-dir /path/to/eval/tts \
    --result-dir results/llama_cpp_run \
    --suite seed-tts

Benchmark

Quantization quality evaluated on Seed-TTS-eval zero-shot benchmark. Baseline is the original HuggingFace model; GGUF variants use the llama.cpp backend with TensorRT audio tokenizer.

Quantization	EN WER (%) ↓	EN SIM (%) ↑	ZH CER (%) ↓	ZH SIM (%) ↑
Baseline (HuggingFace)	1.79	71.46	1.32	77.05
Q8_0	3.21	68.61	1.56	76.03
Q6_K	3.11	68.77	1.44	76.06
Q5_K_M	2.95	68.55	1.50	75.96
Q4_K_M	2.83	68.15	1.58	75.71

Configuration

Config Files

Config	Audio Backend	Use Case
`configs/llama_cpp/default.yaml`	ONNX	Recommended starting point
`configs/llama_cpp/trt.yaml`	TensorRT	Maximum throughput
`configs/llama_cpp/cpu-only.yaml`	ONNX (CPU)	No GPU required

Key Options

Option	Values	Description
`heads_backend`	`auto` / `numpy` / `torch`	LM heads computation backend. `auto` uses torch if available
`audio_backend`	`onnx` / `trt` / `torch`	Audio tokenizer backend
`n_gpu_layers`	`-1` / `0` / `N`	GPU offload layers. -1 = all, 0 = CPU only
`n_ctx`	int	Context window size (prompt + generation)
`max_new_tokens`	int	Maximum generation steps

Architecture

Input text
  │
  ▼
Tokenizer (Rust BPE, tokenizers library)
  │
  ▼
build_generation_prompt() → input_ids (S, 33)
  │
  ▼
EmbeddingLookup (NumPy .npy) → embeddings (S, H)
  │
  ▼
LlamaCppBackbone (GGUF, C bridge) → hidden_state (H,)
  │
  ├─ [heads_backend=torch] TorchLMHeads (nn.Linear, GPU)
  │                          └─ audio_logits (32, 1025)
  │
  └─ [heads_backend=numpy] NumpyLMHeads (CPU matmul)
                             └─ audio_logits (32, 1025)
  │
  ▼
delay_step() + sampling (NumPy) → next_ids (33,)
  │
  ▼ (loop until EOS)
  │
Audio codes → AudioTokenizer (ONNX/TRT/Torch) → waveform

File Structure

moss_tts_delay/llama_cpp/
├── __init__.py          # Package entry, exports LlamaCppPipeline
├── __main__.py          # python -m moss_tts_delay.llama_cpp
├── _constants.py        # Token IDs (from config.json, torch-free)
├── pipeline.py          # LlamaCppPipeline (main entry)
├── backbone.py          # LlamaCppBackbone (C bridge wrapper)
├── backbone_bridge.c    # C bridge source
├── build_bridge.sh      # Build script
├── embedding.py         # EmbeddingLookup (NumPy)
├── lm_heads.py          # NumpyLMHeads + TorchLMHeads
├── delay_state.py       # Delay state machine (NumPy)
├── sampling.py          # top-k/p sampling (NumPy)
├── processor.py         # Tokenizer + prompt builder
├── README.md            # This file
├── README_zh.md         # Chinese documentation
└── conversion/
    ├── extract_weights.py  # Weight extraction script
    ├── README.md           # Conversion guide (English)
    └── README_zh.md        # Conversion guide (Chinese)