Li-Ruixiao's picture
modify readme (#2)
66562e3
|
raw
history blame
8.75 kB
metadata
license: apache-2.0
library_name: transformers
tags:
  - audio
  - audio-tokenizer
  - neural-codec
  - moss-tts-family
  - MOSS Audio Tokenizer
  - speech-tokenizer
  - trust-remote-code

MossAudioTokenizer

MOSS Audio Tokenizer is a unified audio tokenizer designed to achieve both high-fidelity reconstruction and semantically rich representations across speech, sound, and music. Built on the Cat (Causal Audio Tokenizer with Transformer) architecture, the model scales to 1.6 billion parameters and was trained on 3 million hours of audio, surpassing previous open-source tokenizers in reconstruction quality across all bitrates. It processes 24 kHz audio at a low 12.5 Hz frame rate, with all components—including the encoder, quantizer, decoder, decoder-only LLM, and discriminator—optimized jointly in an end-to-end manner. Featuring a 32-layer residual vector quantizer (RVQ) with variable-bitrate support, it provides a scalable, native foundation for the next generation of autoregressive audio foundation models.

This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers transformers.models.moss_audio_tokenizer module. It is intended to be uploaded to a Hugging Face Hub model repository and loaded with trust_remote_code=True when needed.



Architecture of MossAudioTokenizer


Usage

Quickstart

import torch
from transformers import AutoModel
import torchaudio

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

wav, sr = torchaudio.load('demo/demo_gt.wav')
if sr != model.sampling_rate:
    wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
wav = wav.unsqueeze(0)
enc = model.encode(wav, return_dict=True)
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
dec = model.decode(enc.audio_codes, return_dict=True)
print(f"dec.audio.shape: {dec.audio.shape}")
wav = dec.audio.squeeze(0)
torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)

# Decode using only the first 8 layers of the RVQ
dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
wav_rvq8 = dec_rvq8.audio.squeeze(0)
torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)

Streaming

MossAudioTokenizerModel.encode and MossAudioTokenizerModel.decode support simple streaming via a chunk_duration argument.

  • chunk_duration is expressed in seconds.
  • It must be <= MossAudioTokenizerConfig.causal_transformer_context_duration.
  • chunk_duration * MossAudioTokenizerConfig.sampling_rate must be divisible by MossAudioTokenizerConfig.downsample_rate.
  • Streaming chunking only supports batch_size=1.
import torch
from transformers import AutoModel

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
audio = torch.randn(1, 1, 3200)  # dummy waveform

# 0.08s @ 24kHz = 1920 samples, divisible by downsample_rate=1920
enc = model.encode(audio, return_dict=True, chunk_duration=0.08)
dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)

Repository layout

  • configuration_moss_audio_tokenizer.py
  • modeling_moss_audio_tokenizer.py
  • __init__.py
  • config.json
  • model weights

Evaluation Metrics

The table below compares the reconstruction quality of open-source audio tokenizers with MossAudioTokenizer on speech and audio/music data.

  • Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
  • Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
  • STFT-Dist. denotes the STFT distance.
  • Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
  • Nq denotes the number of quantizers.
Model bps Frame rate Nq Speech: SIM ↑ (EN/ZH) Speech: STOI ↑ (EN/ZH) Speech: PESQ-NB ↑ (EN/ZH) Speech: PESQ-WB ↑ (EN/ZH) Audio/Music: Mel-Loss ↓ Audio/Music: STFT-Dist. ↓
XCodec2.0 800 50 1 0.82 / 0.74 0.92 / 0.86 3.04 / 2.46 2.43 / 1.96 -- / -- -- / --
MiMo Audio Tokenizer 850 25 4 0.80 / 0.74 0.91 / 0.87 2.94 / 2.62 2.39 / 2.14 0.82 / 0.81 2.33 / 2.23
Higgs Audio Tokenizer 1000 25 4 0.77 / 0.68 0.83 / 0.82 3.03 / 2.61 2.48 / 2.14 0.83 / 0.80 2.20 / 2.05
SpeechTokenizer 1000 50 2 0.36 / 0.25 0.77 / 0.68 1.59 / 1.38 1.25 / 1.17 -- / -- -- / --
XY-Tokenizer 1000 12.5 8 0.85 / 0.79 0.92 / 0.87 3.10 / 2.63 2.50 / 2.12 -- / -- -- / --
BigCodec 1040 80 1 0.84 / 0.69 0.93 / 0.88 3.27 / 2.55 2.68 / 2.06 -- / -- -- / --
Mimi 1100 12.5 8 0.74 / 0.59 0.91 / 0.85 2.80 / 2.24 2.25 / 1.78 1.24 / 1.19 2.62 / 2.49
MOSS Audio Tokenizer (Ours) 750 12.5 6 0.82 / 0.75 0.93 / 0.89 3.14 / 2.73 2.60 / 2.22 0.86 / 0.85 2.21 / 2.10
MOSS Audio Tokenizer (Ours) 1000 12.5 8 0.88 / 0.81 0.94 / 0.91 3.38 / 2.96 2.87 / 2.43 0.82 / 0.80 2.16 / 2.04
DAC 1500 75 2 0.48 / 0.41 0.83 / 0.79 1.87 / 1.67 1.48 / 1.37 -- / -- -- / --
Encodec 1500 75 2 0.60 / 0.45 0.85 / 0.81 1.94 / 1.80 1.56 / 1.48 1.12 / 1.04 2.60 / 2.42
Higgs Audio Tokenizer 2000 25 8 0.90 / 0.83 0.85 / 0.85 3.59 / 3.22 3.11 / 2.73 0.74 / 0.70 2.07 / 1.92
SpeechTokenizer 2000 50 4 0.66 / 0.50 0.88 / 0.80 2.38 / 1.79 1.92 / 1.49 -- / -- -- / --
Qwen3 TTS Tokenizer 2200 12.5 16 0.95 / 0.88 0.96 / 0.93 3.66 / 3.10 3.19 / 2.62 -- / -- -- / --
MiMo Audio Tokenizer 2250 25 12 0.89 / 0.83 0.95 / 0.92 3.57 / 3.25 3.05 / 2.71 0.70 / 0.68 2.21 / 2.10
Mimi 2475 12.5 18 0.89 / 0.76 0.94 / 0.91 3.49 / 2.90 2.97 / 2.35 1.10 / 1.06 2.45 / 2.32
MOSS Audio Tokenizer (Ours) 1500 12.5 12 0.92 / 0.86 0.95 / 0.93 3.64 / 3.27 3.20 / 2.74 0.77 / 0.74 2.08 / 1.96
MOSS Audio Tokenizer (Ours) 2000 12.5 16 0.95 / 0.89 0.96 / 0.94 3.78 / 3.46 3.41 / 2.96 0.73 / 0.70 2.03 / 1.90
DAC 3000 75 4 0.74 / 0.67 0.90 / 0.88 2.76 / 2.47 2.31 / 2.07 0.86 / 0.83 2.23 / 2.10
MiMo Audio Tokenizer 3650 25 20 0.91 / 0.85 0.95 / 0.93 3.73 / 3.44 3.25 / 2.89 0.66 / 0.65 2.17 / 2.06
SpeechTokenizer 4000 50 8 0.85 / 0.69 0.92 / 0.85 3.05 / 2.20 2.60 / 1.87 -- / -- -- / --
Mimi 4400 12.5 32 0.94 / 0.83 0.96 / 0.94 3.80 / 3.31 3.43 / 2.78 1.02 / 0.98 2.34 / 2.21
Encodec 4500 75 6 0.86 / 0.75 0.92 / 0.91 2.91 / 2.63 2.46 / 2.15 0.91 / 0.84 2.33 / 2.17
DAC 6000 75 8 0.89 / 0.84 0.95 / 0.94 3.75 / 3.57 3.41 / 3.20 0.65 / 0.63 1.97 / 1.87
MOSS Audio Tokenizer (Ours) 3000 12.5 24 0.96 / 0.92 0.97 / 0.96 3.90 / 3.64 3.61 / 3.20 0.69 / 0.66 1.98 / 1.84
MOSS Audio Tokenizer (Ours) 4000 12.5 32 0.97 / 0.93 0.97 / 0.96 3.95 / 3.71 3.69 / 3.30 0.68 / 0.64 1.96 / 1.82

LibriSpeech Speech Metrics (MOSS Audio Tokenizer vs. Open-source Tokenizers)

The plots below compare our MOSS Audio Tokenizer model with other open-source speech tokenizers on the LibriSpeech dataset, evaluated with SIM, STOI, PESQ-NB, and PESQ-WB (higher is better). We control the bps of the same model by adjusting the number of RVQ codebooks used during inference.

SIM
STOI
PESQ-NB
PESQ-WB

Citation

If you use this code or result in your paper, please cite our work as: