MOSS-TTS-Local-Transformer-ONNX

Overview

MOSS-TTS-Local-Transformer-ONNX is an ONNX-optimized export of the MOSS-TTS Local architecture. It is designed for high-fidelity zero-shot voice cloning, multilingual speech synthesis, and cross-platform inference with ONNX Runtime.

This bundle targets the MOSS-TTS Local 1.7B model family and is packaged as a production-oriented ONNX layout for CPU and GPU-backed runtimes. The current export keeps the main acoustic stack in float32 and uses a weight-quantized decoder graph to reduce size while preserving voice cloning quality.

Model Description

MOSS-TTS is the flagship TTS family from MOSI.AI and the OpenMOSS team. The Local Transformer variant is the compact 1.7B architecture in that family, built around a global latent backbone plus a lightweight local autoregressive decoder over RVQ audio tokens.

Compared with the larger Delay-pattern model, the Local variant is smaller, easier to evaluate, and strong on objective benchmarks, while still supporting:

  • Zero-shot voice cloning from short reference audio
  • Multilingual synthesis and code-switching
  • Long-form speech generation
  • Production-oriented speech quality and speaker consistency

This ONNX export focuses on the MOSS-TTS Local single-speaker synthesis path.

Key Features

  • Zero-Shot Voice Cloning: Clone speaker timbre from short reference audio without speaker-specific fine-tuning
  • 20-Language Coverage: Supports multilingual and code-switched synthesis across the released MOSS-TTS language set
  • Local Transformer Architecture: Compact 1.7B model with a global backbone plus a local token block generator
  • Cross-Platform Inference: CPU, NVIDIA CUDA, and AMD ROCm execution are supported through ONNX Runtime
  • Compatibility-Oriented ONNX Bundle: Mixed export with float32 core graphs and a quantized decoder graph

Architecture

The ONNX export uses the MOSS-TTS Local architecture:

  • Embeddings: 33 channels per step, with 1 text stream and 32 audio RVQ codebooks
  • Backbone: 28-layer Qwen-style transformer with hidden size 2048
  • Local Transformer: 1536-dim local autoregressive transformer for block-wise token generation
  • Audio Tokenizer: RVQ-32 encoder/decoder pair operating at 24 kHz audio

Supported Languages

Chinese, English, German, Spanish, French, Japanese, Italian, Hebrew, Korean, Russian, Persian (Farsi), Arabic, Polish, Portuguese, Czech, Danish, Swedish, Hungarian, Greek, Turkish

Installation

Use any ONNX Runtime-compatible inference stack that supports multi-file ONNX models with external data.

This export was validated with:

  • ONNX Runtime 1.22.x
  • CPU inference
  • AMD ROCm execution provider
  • NVIDIA CUDA execution provider support in the runtime integration

Quick Start

This bundle is designed to be consumed directly from an ONNX runtime application. One supported reference runtime is onnx-server:

onnx-server tts-moss \
    --text "Hello, this is a voice cloning test." \
    --ref reference.wav \
    --output output.wav

For AMD ROCm:

onnx-server tts-moss \
    --text "Hello, this is a voice cloning test." \
    --ref reference.wav \
    --output output.wav \
    --rocm \
    --rocm-device-id 0

Command Line Usage

onnx-server tts-moss \
    --text "Your text here" \
    --ref reference.wav \
    --output output.wav

Parameters

  • --text: Text to synthesize (required)
  • --ref: Reference audio for voice cloning (optional, but required for cloning)
  • --output: Output WAV file path
  • --max-seconds: Maximum output duration in seconds
  • --text-temp: Text-token sampling temperature
  • --text-top-k: Text-token top-k sampling
  • --audio-temp: Audio-token sampling temperature
  • --audio-top-k: Audio-token top-k sampling
  • --audio-top-p: Audio-token top-p sampling
  • --audio-rep-penalty: Audio repetition penalty
  • --seed: Random seed for deterministic sampling
  • --cuda: Enable NVIDIA CUDA execution provider
  • --device-id: CUDA device ID
  • --rocm: Enable AMD ROCm execution provider
  • --rocm-device-id: ROCm device ID

Model Components

The ONNX export includes 8 runtime components:

Component Description Input Output
embeddings.onnx Multi-stream token embedding lookup (B, T, 33) token IDs (B, T, 2048) global embeddings
backbone_prefill.onnx Prompt prefill for the global backbone embeddings, mask, positions hidden states, KV cache
backbone_decode.onnx Single-step autoregressive backbone decode step embedding, mask, position, KV cache hidden state, updated KV cache
global_to_local.onnx Global-to-local projection global hidden, previous audio tokens local latent, projected audio features
token_to_local.onnx Per-channel local token projection 33-channel token block local token embeddings
local_trm.onnx Local autoregressive token block model (1, 33, 1536) local sequence text logits, audio logits
encoder.onnx Reference audio tokenizer waveform RVQ-32 audio codes
decoder.onnx Audio waveform decoder RVQ-32 audio codes 24 kHz mono waveform

Performance

Platform Status Notes
CPU Supported Full inference path is available on CPU
NVIDIA GPU Supported Backbone and local stack can run on CUDA-enabled ONNX Runtime builds
AMD GPU Supported Validated with ROCm for the backbone and local stack

Voice Cloning Quality

Upstream MOSS-TTS Local benchmark results reported on Seed-TTS-eval:

  • English WER: 1.85
  • English SIM: 73.42
  • Chinese CER: 1.20
  • Chinese SIM: 78.82

These values come from the original MOSS-TTS Local model release and are included as upstream reference metrics. Actual quality depends on runtime, decoding settings, and reference audio quality.

Technical Details

Model Specs

  • Architecture: MOSS-TTS Local Transformer
  • Global Hidden Size: 2048
  • Local Hidden Size: 1536
  • Backbone Layers: 28
  • KV Heads: 8
  • Head Dim: 128
  • RVQ Codebooks: 32
  • Text Vocabulary: 155648
  • Audio Vocabulary: 1024

Audio Specs

  • Sample Rate: 24000 Hz
  • Channels: Mono
  • Codec Rate: 12.5 audio frames per second
  • Encoder Stride: 1920 samples per audio frame

ONNX Specifications

  • Opset Version: 18
  • Model Type: moss_tts_local_onnx
  • External Data: Large graphs use .onnx.data / .data sidecar files
  • Runtime Layout: Mixed export with float32 core graphs and an int8 weight-quantized decoder.onnx
  • Bundle Size: Approximately 25 GB on disk

Limitations

  • This bundle covers the MOSS-TTS Local single-speaker synthesis path, not the dialogue, realtime, voice-design, or sound-effect family members
  • Voice cloning quality depends strongly on clean reference audio
  • Long-form quality and stability depend on decoding settings and target duration
  • GPU acceleration depends on ONNX Runtime execution provider support and compatible hardware

Ethical Considerations

This model should be used responsibly. Be aware that:

  • Voice cloning raises ethical concerns around consent and impersonation
  • Generated content should not be used to deceive, defraud, or harm others
  • Reference audio should only be used with proper authorization

License

Apache 2.0

Citation

If you use this ONNX export, please cite the original MOSS-TTS work and model release from MOSI.AI / OpenMOSS Team.

Original Model

This ONNX export is based on OpenMOSS-Team/MOSS-TTS-Local-Transformer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for romara-labs/MOSS-TTS-Local-Transformer-ONNX

Quantized
(4)
this model