File size: 12,491 Bytes
6aa02b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
---
license: apache-2.0
library_name: transformers
tags:
  - audio
  - audio-tokenizer
  - neural-codec
  - moss-tts-family
  - MOSS Audio Tokenizer Nano
  - speech-tokenizer
  - trust-remote-code
---

# MOSS-Audio-Tokenizer-Nano

This repository contains the Hugging Face remote-code implementation and weights for **MOSS-Audio-Tokenizer-Nano**, the lightweight audio tokenizer used by **MOSS-TTS-Nano**.

MOSS-Audio-Tokenizer-Nano is a compact discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture from [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://arxiv.org/abs/2602.10934). The checkpoint in this repository has **21,969,664 parameters** (approximately **22M**), making it much smaller than the full-size MOSS-Audio-Tokenizer while preserving the 48 kHz stereo tokenizer interface used by the MOSS-TTS family.

## Key Features

- **Small model size**: approximately **22M parameters**, including about 10.45M encoder parameters, 10.45M decoder parameters, and 1.07M quantizer parameters.
- **Native high-resolution audio**: supports **48 kHz** input and output with **2-channel stereo** audio, helping reduce compression loss and improve listening quality.
- **Low-frame-rate discrete codes**: compresses 48 kHz stereo audio into a **12.5 Hz** token stream with a downsample rate of 7,680 samples.
- **Variable bitrate reconstruction**: uses a residual quantizer stack with **16 codebooks** and 1,024 entries per codebook. Each codebook contributes about **0.125 kbps**, for an inference range from **0.125 kbps to 2 kbps**.
- **Transformer-based tokenizer**: uses causal Transformer blocks and supports low-latency streaming encode/decode.
- **MOSS-TTS family interface**: designed as the audio tokenizer backbone for MOSS-TTS-Nano and compatible MOSS-TTS-family workflows.

**Summary:**
By combining a compact causal Transformer tokenizer with native 48 kHz stereo modeling, MOSS-Audio-Tokenizer-Nano reduces the deployment cost of the MOSS audio tokenizer interface while keeping high-fidelity reconstruction for speech, general audio, and music. It provides a lightweight, low-frame-rate, and streaming-friendly discrete audio representation for MOSS-TTS-Nano and other real-time speech generation workflows.

This repository contains a lightweight remote-code implementation that mirrors the current Hugging Face Transformers `transformers.models.moss_audio_tokenizer` module. Load it with `trust_remote_code=True` when needed.

## Evaluation Metrics

The table below compares the reconstruction quality of MOSS-Audio-Tokenizer-Nano with open-source audio tokenizers with **no more than 120M parameters** on speech, audio, and music data. MOSS-Audio-Tokenizer-Nano keeps one of the smallest model sizes in the comparison while supporting **48 kHz stereo** reconstruction.

- Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN/ZH.
- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
- STFT-Dist. denotes the STFT distance.
- Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
- Ch. denotes the number of input/output channels supported by the audio tokenizer: `ch=1` means mono audio, and `ch=2` means stereo audio.
- Nvq denotes the number of quantizers.

| Model | Params (M) | Sample rate | Ch. | bps | Nvq | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Mimi VAE** | 28 | 24k | 1 | -- | -- | 0.75 / 0.54 | 0.91 / 0.83 | 2.92 / 2.20 | 2.30 / 1.73 | 1.35 / 1.31 | 2.70 / 2.59 |
| **DAC** | 77 | 44.1k | 1 | 861 | 1 | 0.30 / 0.20 | 0.76 / 0.68 | 1.55 / 1.36 | 1.24 / 1.15 | 1.25 / 1.18 | 2.71 / 2.54 |
| **SpeechTokenizer** | 120 | 16k | 1 | 1000 | 2 | 0.36 / 0.25 | 0.77 / 0.68 | 1.59 / 1.38 | 1.25 / 1.17 | -- / -- | -- / -- |
| **Mimi** | 96 | 24k | 1 | 1100 | 8 | 0.74 / 0.59 | 0.91 / 0.85 | 2.80 / 2.24 | 2.25 / 1.78 | 1.24 / 1.19 | 2.62 / 2.49 |
| **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 750 | 6 | 0.64 / 0.61 | 0.90 / 0.85 | 2.65 / 2.28 | 2.11 / 1.87 | 1.04 / 1.01 | 2.42 / 2.27 |
| **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 1000 | 8 | **0.75 / 0.69** | **0.92 / 0.87** | **2.92 / 2.48** | **2.36 / 2.04** | **1.00 / 0.97** | **2.37 / 2.22** |
| **EnCodec** | 19 | 48k | 2 | 1500 | 1 | 0.35 / 0.30 | 0.76 / 0.75 | 1.54 / 1.60 | 1.25 / 1.32 | 1.25 / 1.05 | 2.73 / 2.30 |
| **SpeechTokenizer** | 120 | 16k | 1 | 1500 | 3 | 0.52 / 0.38 | 0.84 / 0.75 | 2.00 / 1.60 | 1.57 / 1.33 | -- / -- | -- / -- |
| **Mimi** | 96 | 24k | 1 | 1512.5 | 11 | 0.82 / 0.67 | 0.92 / 0.88 | 3.10 / 2.50 | 2.54 / 2.00 | 1.19 / 1.14 | 2.55 / 2.42 |
| **DAC** | 77 | 44.1k | 1 | 1723 | 2 | 0.57 / 0.47 | 0.86 / 0.80 | 2.21 / 1.85 | 1.74 / 1.49 | 1.03 / 0.99 | 2.43 / 2.26 |
| **SpeechTokenizer** | 120 | 16k | 1 | 2000 | 4 | 0.66 / 0.50 | 0.88 / 0.80 | 2.38 / 1.79 | 1.92 / 1.49 | -- / -- | -- / -- |
| **Mimi** | 96 | 24k | 1 | 2062.5 | 15 | 0.87 / 0.73 | 0.94 / 0.90 | 3.36 / 2.76 | 2.81 / 2.22 | 1.14 / 1.09 | 2.49 / 2.36 |
| **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 1500 | 12 | 0.84 / 0.77 | 0.94 / 0.90 | 3.25 / 2.77 | 2.71 / 2.31 | 0.95 / 0.91 | 2.31 / 2.14 |
| **MOSS-Audio-Tokenizer-Nano** | 22 | 48k | 2 | 2000 | 16 | **0.88 / 0.81** | **0.95 / 0.91** | **3.40 / 2.93** | **2.89 / 2.47** | **0.93 / 0.89** | **2.28 / 2.11** |

## Usage

### Quickstart

```python
import torchaudio
from transformers import AutoModel

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

wav, sr = torchaudio.load("demo/demo_gt.wav")
if sr != model.sampling_rate:
    wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)

# The public waveform interface expects stereo audio.
if wav.shape[0] == 1:
    wav = wav.repeat(model.config.number_channels, 1)
else:
    wav = wav[: model.config.number_channels]

wav = wav.unsqueeze(0)
enc = model.encode(wav, return_dict=True)
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")

dec = model.decode(enc.audio_codes, return_dict=True)
print(f"dec.audio.shape: {dec.audio.shape}")

wav = dec.audio.squeeze(0)
torchaudio.save("demo/demo_rec.wav", wav, sample_rate=model.sampling_rate)

# Decode with the first 8 codebooks, roughly 1 kbps.
dec_rvq8 = model.decode(enc.audio_codes[:8], return_dict=True)
wav_rvq8 = dec_rvq8.audio.squeeze(0)
torchaudio.save("demo/demo_rec_rvq8.wav", wav_rvq8, sample_rate=model.sampling_rate)
```

### Attention Backend And Compute Dtype

`config.attention_implementation` controls whether Transformer layers prefer `sdpa` or `flash_attention_2`.
`config.compute_dtype` controls the non-quantizer autocast dtype and supports `fp32`, `bf16`, and `fp16`.

```python
model.set_attention_implementation("flash_attention_2")
model.set_compute_dtype("fp16")
```

The quantizer always runs in fp32.

### Streaming

`MossAudioTokenizerModel.encode`, `decode`, `batch_encode`, and `batch_decode` all support streaming through a `chunk_duration` argument.

- `chunk_duration` is expressed in seconds.
- `chunk_duration * MossAudioTokenizerConfig.sampling_rate` must be divisible by `MossAudioTokenizerConfig.downsample_rate`.
- Streaming batch inference is supported.
- The public waveform interface expects stereo inputs shaped `(2, T)` or batched stereo inputs shaped `(B, 2, T)`.

```python
import torch
from transformers import AutoModel

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
audio = torch.randn(2, 48000 * 6)  # dummy stereo waveform

# 6.0s @ 48kHz = 288000 samples, divisible by downsample_rate=3840
enc = model.encode(audio.unsqueeze(0), return_dict=True, chunk_duration=0.08)
dec = model.decode(enc.audio_codes, return_dict=True, chunk_duration=0.08)

batch_enc = model.batch_encode([audio, audio[:, : 48000 * 3]], chunk_duration=0.08)
codes_list = [
    batch_enc.audio_codes[:, i, : batch_enc.audio_codes_lengths[i]]
    for i in range(batch_enc.audio_codes.shape[1])
]
batch_dec = model.batch_decode(codes_list, chunk_duration=0.08)
```

#### Continuous Batch Streaming Decode

For decoder-side continuous batching, prefer `batch_decode(..., streaming=True, ...)`.

- The first streaming call may pass `max_batch_size=...`. If it is omitted, the first batch size reserves the fixed-slot decoder budget for that public stream.
- Same-size calls continue the existing logical rows in order.
- If a later call is larger, the new rows are admitted by tail append.
- `finalize_indices` means "decode these rows one last time, then evict them". The indices are interpreted against the pre-call logical order.
- After a finalize call returns, the next streaming call may use the smaller survivor batch.
- `reset_stream=True` discards the hidden public streaming state and starts a fresh stream.

Milestone 1 boundaries:

- decode-only continuous batching
- one active streaming decode state per model instance
- fixed-slot decoder reservation from `max_batch_size`
- no encode-side continuous batching
- no physical compaction of surviving decode slots
- no multi-session concurrency on one model instance

```python
import torch
from transformers import AutoModel

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
num_quantizers = model.config.quantizer_kwargs["num_quantizers"]
codebook_size = model.config.quantizer_kwargs["codebook_size"]

codes_a0 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_b0 = torch.randint(0, codebook_size, (num_quantizers, 3))
codes_a1 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_b1 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_c0 = torch.randint(0, codebook_size, (num_quantizers, 1))
codes_a2 = torch.randint(0, codebook_size, (num_quantizers, 1))
codes_b2 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_c1 = torch.randint(0, codebook_size, (num_quantizers, 2))
codes_b3 = torch.randint(0, codebook_size, (num_quantizers, 1))
codes_c2 = torch.randint(0, codebook_size, (num_quantizers, 1))

# First call reserves 3 fixed decoder slots for A and B.
out_ab0 = model.batch_decode(
    [codes_a0, codes_b0],
    streaming=True,
    max_batch_size=3,
    reset_stream=True,
)

# Same logical rows continue in order; C is a tail append.
out_abc1 = model.batch_decode(
    [codes_a1, codes_b1, codes_c0],
    streaming=True,
)

# Finalize A against the pre-call logical order. A still decodes in this call,
# then is evicted immediately afterward.
out_abc2 = model.batch_decode(
    [codes_a2, codes_b2, codes_c1],
    streaming=True,
    finalize_indices=[0],
)

# The next call can shrink to the surviving logical rows only.
out_bc3 = model.batch_decode(
    [codes_b3, codes_c2],
    streaming=True,
)
```

## Repository Layout

- `configuration_moss_audio_tokenizer.py`
- `modeling_moss_audio_tokenizer.py`
- `__init__.py`
- `config.json`
- model weights

## Citation

If you use this model or code in your work, please cite:

```bibtex
@misc{gong2026mossttstechnicalreport,
  title={MOSS-TTS Technical Report},
  author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},
  year={2026},
  eprint={2603.18090},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2603.18090}
}
```

```bibtex
@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
  title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models},
  author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
  year={2026},
  eprint={2602.10934},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  url={https://arxiv.org/abs/2602.10934}
}
```