MOSS-TTS-Local-Transformer-ONNX

Overview

MOSS-TTS-Local-Transformer-ONNX is an ONNX-optimized export of the MOSS-TTS Local architecture. It is designed for high-fidelity zero-shot voice cloning, multilingual speech synthesis, and cross-platform inference with ONNX Runtime.

This bundle targets the MOSS-TTS Local 1.7B model family and is packaged as a production-oriented ONNX layout for CPU and GPU-backed runtimes. The current export keeps the main acoustic stack in float32 and uses a weight-quantized decoder graph to reduce size while preserving voice cloning quality.

Model Description

MOSS-TTS is the flagship TTS family from MOSI.AI and the OpenMOSS team. The Local Transformer variant is the compact 1.7B architecture in that family, built around a global latent backbone plus a lightweight local autoregressive decoder over RVQ audio tokens.

Compared with the larger Delay-pattern model, the Local variant is smaller, easier to evaluate, and strong on objective benchmarks, while still supporting:

Zero-shot voice cloning from short reference audio
Multilingual synthesis and code-switching
Long-form speech generation
Production-oriented speech quality and speaker consistency

This ONNX export focuses on the MOSS-TTS Local single-speaker synthesis path.

Key Features

Zero-Shot Voice Cloning: Clone speaker timbre from short reference audio without speaker-specific fine-tuning
20-Language Coverage: Supports multilingual and code-switched synthesis across the released MOSS-TTS language set
Local Transformer Architecture: Compact 1.7B model with a global backbone plus a local token block generator
Cross-Platform Inference: CPU, NVIDIA CUDA, and AMD ROCm execution are supported through ONNX Runtime
Compatibility-Oriented ONNX Bundle: Mixed export with float32 core graphs and a quantized decoder graph

Architecture

The ONNX export uses the MOSS-TTS Local architecture:

Embeddings: 33 channels per step, with 1 text stream and 32 audio RVQ codebooks
Backbone: 28-layer Qwen-style transformer with hidden size 2048
Local Transformer: 1536-dim local autoregressive transformer for block-wise token generation
Audio Tokenizer: RVQ-32 encoder/decoder pair operating at 24 kHz audio

Supported Languages

Chinese, English, German, Spanish, French, Japanese, Italian, Hebrew, Korean, Russian, Persian (Farsi), Arabic, Polish, Portuguese, Czech, Danish, Swedish, Hungarian, Greek, Turkish

Installation

Use any ONNX Runtime-compatible inference stack that supports multi-file ONNX models with external data.

This export was validated with:

ONNX Runtime 1.22.x
CPU inference
AMD ROCm execution provider
NVIDIA CUDA execution provider support in the runtime integration

Quick Start

This bundle is designed to be consumed directly from an ONNX runtime application. One supported reference runtime is onnx-server:

onnx-server tts-moss \
    --text "Hello, this is a voice cloning test." \
    --ref reference.wav \
    --output output.wav

For AMD ROCm:

onnx-server tts-moss \
    --text "Hello, this is a voice cloning test." \
    --ref reference.wav \
    --output output.wav \
    --rocm \
    --rocm-device-id 0

Command Line Usage

onnx-server tts-moss \
    --text "Your text here" \
    --ref reference.wav \
    --output output.wav

Parameters

--text: Text to synthesize (required)
--ref: Reference audio for voice cloning (optional, but required for cloning)
--output: Output WAV file path
--max-seconds: Maximum output duration in seconds
--text-temp: Text-token sampling temperature
--text-top-k: Text-token top-k sampling
--audio-temp: Audio-token sampling temperature
--audio-top-k: Audio-token top-k sampling
--audio-top-p: Audio-token top-p sampling
--audio-rep-penalty: Audio repetition penalty
--seed: Random seed for deterministic sampling
--cuda: Enable NVIDIA CUDA execution provider
--device-id: CUDA device ID
--rocm: Enable AMD ROCm execution provider
--rocm-device-id: ROCm device ID

Model Components

The ONNX export includes 8 runtime components:

Component	Description	Input	Output
`embeddings.onnx`	Multi-stream token embedding lookup	`(B, T, 33)` token IDs	`(B, T, 2048)` global embeddings
`backbone_prefill.onnx`	Prompt prefill for the global backbone	embeddings, mask, positions	hidden states, KV cache
`backbone_decode.onnx`	Single-step autoregressive backbone decode	step embedding, mask, position, KV cache	hidden state, updated KV cache
`global_to_local.onnx`	Global-to-local projection	global hidden, previous audio tokens	local latent, projected audio features
`token_to_local.onnx`	Per-channel local token projection	33-channel token block	local token embeddings
`local_trm.onnx`	Local autoregressive token block model	`(1, 33, 1536)` local sequence	text logits, audio logits
`encoder.onnx`	Reference audio tokenizer	waveform	RVQ-32 audio codes
`decoder.onnx`	Audio waveform decoder	RVQ-32 audio codes	24 kHz mono waveform

Performance

Platform	Status	Notes
CPU	Supported	Full inference path is available on CPU
NVIDIA GPU	Supported	Backbone and local stack can run on CUDA-enabled ONNX Runtime builds
AMD GPU	Supported	Validated with ROCm for the backbone and local stack

Voice Cloning Quality

Upstream MOSS-TTS Local benchmark results reported on Seed-TTS-eval:

English WER: 1.85
English SIM: 73.42
Chinese CER: 1.20
Chinese SIM: 78.82

These values come from the original MOSS-TTS Local model release and are included as upstream reference metrics. Actual quality depends on runtime, decoding settings, and reference audio quality.

Technical Details

Model Specs

Architecture: MOSS-TTS Local Transformer
Global Hidden Size: 2048
Local Hidden Size: 1536
Backbone Layers: 28
KV Heads: 8
Head Dim: 128
RVQ Codebooks: 32
Text Vocabulary: 155648
Audio Vocabulary: 1024

Audio Specs

Sample Rate: 24000 Hz
Channels: Mono
Codec Rate: 12.5 audio frames per second
Encoder Stride: 1920 samples per audio frame

ONNX Specifications

Opset Version: 18
Model Type: moss_tts_local_onnx
External Data: Large graphs use .onnx.data / .data sidecar files
Runtime Layout: Mixed export with float32 core graphs and an int8 weight-quantized decoder.onnx
Bundle Size: Approximately 25 GB on disk

Limitations

This bundle covers the MOSS-TTS Local single-speaker synthesis path, not the dialogue, realtime, voice-design, or sound-effect family members
Voice cloning quality depends strongly on clean reference audio
Long-form quality and stability depend on decoding settings and target duration
GPU acceleration depends on ONNX Runtime execution provider support and compatible hardware

Ethical Considerations

This model should be used responsibly. Be aware that:

Voice cloning raises ethical concerns around consent and impersonation
Generated content should not be used to deceive, defraud, or harm others
Reference audio should only be used with proper authorization

License

Apache 2.0

Citation

If you use this ONNX export, please cite the original MOSS-TTS work and model release from MOSI.AI / OpenMOSS Team.

Original Model

This ONNX export is based on OpenMOSS-Team/MOSS-TTS-Local-Transformer.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for romara-labs/MOSS-TTS-Local-Transformer-ONNX

Base model

OpenMOSS-Team/MOSS-TTS-Local-Transformer

Quantized

(4)

this model