groxaxo's picture
Upload MOSS-TTSD NF4 quantized model
3afa0cd verified

MOSS-TTS-Delay: llama.cpp Inference Backend

English | ็ฎ€ไฝ“ไธญๆ–‡

This package provides a torch-free (or torch-optional) end-to-end TTS inference pipeline for MOSS-TTS-Delay using:

  • llama.cpp for the Qwen3 backbone (GGUF format, GPU/CPU)
  • NumPy for embeddings, LM heads, delay state machine, and sampling
  • ONNX Runtime or TensorRT for the audio tokenizer

When PyTorch is available, LM heads can optionally be GPU-accelerated (~30x faster).

Prerequisites

  1. llama.cpp โ€” compiled from source with shared library support
  2. Python >= 3.10

Installation

Minimal (torch-free, ONNX audio)

pip install -e ".[llama-cpp-onnx]"

With TensorRT audio (max performance)

pip install -e ".[llama-cpp-trt]"

With PyTorch LM heads acceleration

pip install -e ".[llama-cpp-trt,llama-cpp-torch]"

Weight Preparation

To convert weights from the original MOSS-TTS model yourself (instead of downloading pre-quantized ones), see the conversion guide.

Step 1: Download pre-quantized TTS backbone & weights

We provide pre-quantized GGUF backbone, embedding tables, and LM head matrices on HuggingFace:

# Download pre-built GGUF + embeddings + lm_heads
huggingface-cli download OpenMOSS-Team/MOSS-TTS-GGUF --local-dir weights/MOSS-TTS-GGUF

This gives you:

  • weights/MOSS-TTS-GGUF/MOSS_TTS_Q4_K_M.gguf โ€” Q4_K_M quantized backbone
  • weights/MOSS-TTS-GGUF/embeddings/ โ€” 33 embedding .npy files
  • weights/MOSS-TTS-GGUF/lm_heads/ โ€” 33 LM head .npy files
  • weights/MOSS-TTS-GGUF/tokenizer/ โ€” BPE tokenizer files

Step 2: Download ONNX audio tokenizer

We provide ONNX models for the audio tokenizer. TensorRT engines are not provided because they are tied to specific GPU architectures and TensorRT versions.

# Download ONNX encoder & decoder
huggingface-cli download OpenMOSS-Team/MOSS-Audio-Tokenizer-ONNX --local-dir weights/MOSS-Audio-Tokenizer-ONNX

Step 3: Build the C bridge

# Clone and build llama.cpp (if not already done)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build --config Release -j
cd ..

# Build the C bridge shared library
cd moss_tts_delay/llama_cpp
bash build_bridge.sh /path/to/llama.cpp

Step 4 (Optional): Build TensorRT engines

Note: Only needed if you want to use audio_backend: trt for maximum audio tokenizer performance. Most users should use the ONNX backend.

bash moss_audio_tokenizer/trt/build_engine.sh \
    weights/MOSS-Audio-Tokenizer-ONNX/encoder.onnx \
    weights/MOSS-Audio-Tokenizer-ONNX/decoder.onnx \
    weights/MOSS-Audio-Tokenizer-TRT

โš ๏ธ maxShapes determines the maximum audio length your engine can handle. The default builds support up to 40 seconds of audio. If you need longer audio, edit MAX_AUDIO_SECONDS in build_engine.sh before building. See the detailed shape โ†” duration table in the script's comments.

Usage

CLI

# Basic generation
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello, world!" \
    --output output.wav

# With reference audio (voice cloning)
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --reference ref.wav \
    --output output.wav

# Force numpy LM heads (torch-free)
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --heads-backend numpy

# With profiling
python -m moss_tts_delay.llama_cpp \
    --config configs/llama_cpp/default.yaml \
    --text "Hello!" \
    --profile

Python API

from moss_tts_delay.llama_cpp import LlamaCppPipeline, PipelineConfig

config = PipelineConfig.from_yaml("configs/llama_cpp/default.yaml")

with LlamaCppPipeline(config) as pipeline:
    waveform = pipeline.generate(
        text="Hello, world!",
        reference_audio="ref.wav",  # optional
        language="en",
    )

import soundfile as sf
sf.write("output.wav", waveform, 24000)

Batch Evaluation

python scripts/batch_eval_llama_cpp.py \
    --config configs/llama_cpp/default.yaml \
    --benchmark-dir /path/to/eval/tts \
    --result-dir results/llama_cpp_run \
    --suite seed-tts

Benchmark

Quantization quality evaluated on Seed-TTS-eval zero-shot benchmark. Baseline is the original HuggingFace model; GGUF variants use the llama.cpp backend with TensorRT audio tokenizer.

Quantization EN WER (%) โ†“ EN SIM (%) โ†‘ ZH CER (%) โ†“ ZH SIM (%) โ†‘
Baseline (HuggingFace) 1.79 71.46 1.32 77.05
Q8_0 3.21 68.61 1.56 76.03
Q6_K 3.11 68.77 1.44 76.06
Q5_K_M 2.95 68.55 1.50 75.96
Q4_K_M 2.83 68.15 1.58 75.71

Configuration

Config Files

Config Audio Backend Use Case
configs/llama_cpp/default.yaml ONNX Recommended starting point
configs/llama_cpp/trt.yaml TensorRT Maximum throughput
configs/llama_cpp/cpu-only.yaml ONNX (CPU) No GPU required

Key Options

Option Values Description
heads_backend auto / numpy / torch LM heads computation backend. auto uses torch if available
audio_backend onnx / trt / torch Audio tokenizer backend
n_gpu_layers -1 / 0 / N GPU offload layers. -1 = all, 0 = CPU only
n_ctx int Context window size (prompt + generation)
max_new_tokens int Maximum generation steps

Architecture

Input text
  โ”‚
  โ–ผ
Tokenizer (Rust BPE, tokenizers library)
  โ”‚
  โ–ผ
build_generation_prompt() โ†’ input_ids (S, 33)
  โ”‚
  โ–ผ
EmbeddingLookup (NumPy .npy) โ†’ embeddings (S, H)
  โ”‚
  โ–ผ
LlamaCppBackbone (GGUF, C bridge) โ†’ hidden_state (H,)
  โ”‚
  โ”œโ”€ [heads_backend=torch] TorchLMHeads (nn.Linear, GPU)
  โ”‚                          โ””โ”€ audio_logits (32, 1025)
  โ”‚
  โ””โ”€ [heads_backend=numpy] NumpyLMHeads (CPU matmul)
                             โ””โ”€ audio_logits (32, 1025)
  โ”‚
  โ–ผ
delay_step() + sampling (NumPy) โ†’ next_ids (33,)
  โ”‚
  โ–ผ (loop until EOS)
  โ”‚
Audio codes โ†’ AudioTokenizer (ONNX/TRT/Torch) โ†’ waveform

File Structure

moss_tts_delay/llama_cpp/
โ”œโ”€โ”€ __init__.py          # Package entry, exports LlamaCppPipeline
โ”œโ”€โ”€ __main__.py          # python -m moss_tts_delay.llama_cpp
โ”œโ”€โ”€ _constants.py        # Token IDs (from config.json, torch-free)
โ”œโ”€โ”€ pipeline.py          # LlamaCppPipeline (main entry)
โ”œโ”€โ”€ backbone.py          # LlamaCppBackbone (C bridge wrapper)
โ”œโ”€โ”€ backbone_bridge.c    # C bridge source
โ”œโ”€โ”€ build_bridge.sh      # Build script
โ”œโ”€โ”€ embedding.py         # EmbeddingLookup (NumPy)
โ”œโ”€โ”€ lm_heads.py          # NumpyLMHeads + TorchLMHeads
โ”œโ”€โ”€ delay_state.py       # Delay state machine (NumPy)
โ”œโ”€โ”€ sampling.py          # top-k/p sampling (NumPy)
โ”œโ”€โ”€ processor.py         # Tokenizer + prompt builder
โ”œโ”€โ”€ README.md            # This file
โ”œโ”€โ”€ README_zh.md         # Chinese documentation
โ””โ”€โ”€ conversion/
    โ”œโ”€โ”€ extract_weights.py  # Weight extraction script
    โ”œโ”€โ”€ README.md           # Conversion guide (English)
    โ””โ”€โ”€ README_zh.md        # Conversion guide (Chinese)