--- license: apache-2.0 library_name: transformers tags: - audio - audio-tokenizer - neural-codec - moss-tts-family - MOSS Audio Tokenizer - speech-tokenizer - trust-remote-code --- # MossAudioTokenizer This is the code for MOSS-Audio-Tokenizer presented in [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934). **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment. **Key Features:** * **Extreme Compression & Variable Bitrate**: It compresses 48kHz stereo audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual LFQ quantizer stack, it supports high-fidelity reconstruction across a wide range of bitrates. * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference. * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music. * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS). * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data. * **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline. **Summary:** By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models. This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository and loaded with `trust_remote_code=True` when needed. ## Usage ### Quickstart ```python import torch from transformers import AutoModel import torchaudio repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer" model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval() wav, sr = torchaudio.load('demo/demo_gt.wav') if sr != model.sampling_rate: wav = torchaudio.functional.resample(wav, sr, model.sampling_rate) if wav.shape[0] == 1: wav = wav.repeat(model.config.number_channels, 1) else: wav = wav[: model.config.number_channels] wav = wav.unsqueeze(0) enc = model.encode(wav, return_dict=True) print(f"enc.audio_codes.shape: {enc.audio_codes.shape}") dec = model.decode(enc.audio_codes, return_dict=True) print(f"dec.audio.shape: {dec.audio.shape}") wav = dec.audio.squeeze(0) torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate) # Decode using only the first 8 layers of the RVQ dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True) wav_rvq8 = dec_rvq8.audio.squeeze(0) torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate) ``` ### Attention Backend And Compute Dtype `config.attention_implementation` controls whether transformer layers prefer `sdpa` or `flash_attention_2`. `config.compute_dtype` controls the non-quantizer autocast dtype and supports `fp32`, `bf16`, and `fp16`. ```python model.set_attention_implementation("flash_attention_2") model.set_compute_dtype("fp16") ``` The quantizer always runs in fp32. ### Streaming `MossAudioTokenizerModel.encode`, `decode`, `batch_encode`, and `batch_decode` all support streaming through a `chunk_duration` argument. - `chunk_duration` is expressed in seconds. - `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`. - Streaming batch inference is supported. - The public waveform interface expects stereo inputs shaped `(2, T)` or batched stereo inputs shaped `(B, 2, T)`. ```python import torch from transformers import AutoModel repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer" model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval() audio = torch.randn(2, 48000 * 6) # dummy stereo waveform # 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840 enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08) dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08) batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08) codes_list = [ batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]] for i in range(batch_enc.audio_codes.shape[1]) ] batch_dec = model.batch_decode(codes_list, chunk_duration=0.08) ``` #### Continuous Batch Streaming Decode For decoder-side continuous batching, prefer `batch_decode(..., streaming=True, ...)`. - The first streaming call may pass `max_batch_size=...`. If it is omitted, the first batch size reserves the fixed-slot decoder budget for that public stream. - Same-size calls continue the existing logical rows in-order. - If a later call is larger, the new rows are admitted by tail append. - `finalize_indices` means "decode these rows one last time, then evict them". The indices are interpreted against the pre-call logical order. - After a finalize call returns, the next streaming call may use the smaller survivor batch. - `reset_stream=True` discards the hidden public streaming state and starts a fresh stream. Milestone 1 boundaries: - decode-only continuous batching - one active streaming decode state per model instance - fixed-slot decoder reservation from `max_batch_size` - no encode-side continuous batching - no physical compaction of surviving decode slots - no multi-session concurrency on one model instance ```python import torch from transformers import AutoModel repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer" model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval() num_quantizers = model.config.quantizer_kwargs["num_quantizers"] codes_a0 = torch.randint(0, 8, (num_quantizers, 2)) codes_b0 = torch.randint(0, 8, (num_quantizers, 3)) codes_a1 = torch.randint(0, 8, (num_quantizers, 2)) codes_b1 = torch.randint(0, 8, (num_quantizers, 2)) codes_c0 = torch.randint(0, 8, (num_quantizers, 1)) codes_a2 = torch.randint(0, 8, (num_quantizers, 1)) codes_b2 = torch.randint(0, 8, (num_quantizers, 2)) codes_c1 = torch.randint(0, 8, (num_quantizers, 2)) codes_b3 = torch.randint(0, 8, (num_quantizers, 1)) codes_c2 = torch.randint(0, 8, (num_quantizers, 1)) # First call reserves 3 fixed decoder slots for A and B. out_ab0 = model.batch_decode( [codes_a0, codes_b0], streaming=True, max_batch_size=3, reset_stream=True, ) # Same logical rows continue in-order; C is a tail append. out_abc1 = model.batch_decode( [codes_a1, codes_b1, codes_c0], streaming=True, ) # Finalize A against the pre-call logical order. A still decodes in this call, # then is evicted immediately afterward. out_abc2 = model.batch_decode( [codes_a2, codes_b2, codes_c1], streaming=True, finalize_indices=[0], ) # The next call can shrink to the surviving logical rows only. out_bc3 = model.batch_decode( [codes_b3, codes_c2], streaming=True, ) ``` ## Repository layout - `configuration_moss_audio_tokenizer.py` - `modeling_moss_audio_tokenizer.py` - `__init__.py` - `config.json` - model weights ## Citation If you use this code or result in your paper, please cite our work as: ```tex ```