Li-Ruixiao fdugyt commited on
Commit
66562e3
·
1 Parent(s): 4a52ecf

modify readme (#2)

Browse files

- modify readme (f9c0141315f2c2fe1fb6442a19af4a13f49eda15)


Co-authored-by: yitian gong <fdugyt@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +4 -28
README.md CHANGED
@@ -13,8 +13,7 @@ tags:
13
 
14
  # MossAudioTokenizer
15
 
16
- MossAudioTokenizer is a Transformer-based neural audio tokenizer model jointly optimizing the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction of general audio, audio tokenization and synthesis.
17
- Both the encoder and decoder of MossAudioTokenizer contain approximately 0.8 billion parameters each, totaling about 1.6 billion. MossAudioTokenizer operates at 12.5 Hz, uses a 32-layer residual vector quantizer (RVQ), and supports variable-codebook decoding.
18
 
19
  This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
20
  `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
@@ -29,31 +28,8 @@ and loaded with `trust_remote_code=True` when needed.
29
 
30
  ## Usage
31
 
32
- ### Installation
33
-
34
- ```bash
35
- cd MOSS-Audio-Tokenizer
36
- pip install -r requirements.txt
37
- ```
38
-
39
  ### Quickstart
40
 
41
- ```python
42
- import torch
43
- from transformers import AutoModel
44
-
45
- repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
46
- model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()
47
-
48
- audio = torch.randn(1, 1, 3200) # dummy waveform
49
- enc = model.encode(audio, return_dict=True)
50
- print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")
51
- dec = model.decode(enc.audio_codes, return_dict=True)
52
- print(f"dec.audio.shape: {dec.audio.shape}")
53
- ```
54
-
55
- ### Quickstart (Waveform I/O)
56
-
57
  ```python
58
  import torch
59
  from transformers import AutoModel
@@ -118,10 +94,10 @@ The table below compares the reconstruction quality of open-source audio tokeniz
118
  - Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
119
  - STFT-Dist. denotes the STFT distance.
120
  - Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
121
- - $\boldsymbol{N}_{\mathrm{VQ}}$ denotes the number of quantizers.
122
 
123
- | Model | bps | Frame rate | $\boldsymbol{N}_{\mathrm{VQ}}$ | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
124
- | --- | ---: | ---: | ---: | --- | --- | --- | --- | --- | --- |
125
  | **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
126
  | **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
127
  | **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |
 
13
 
14
  # MossAudioTokenizer
15
 
16
+ MOSS Audio Tokenizer is a unified audio tokenizer designed to achieve both high-fidelity reconstruction and semantically rich representations across speech, sound, and music. Built on the Cat (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture, the model scales to 1.6 billion parameters and was trained on 3 million hours of audio, surpassing previous open-source tokenizers in reconstruction quality across all bitrates. It processes 24 kHz audio at a low 12.5 Hz frame rate, with all components—including the encoder, quantizer, decoder, decoder-only LLM, and discriminator—optimized jointly in an end-to-end manner. Featuring a 32-layer residual vector quantizer (RVQ) with variable-bitrate support, it provides a scalable, native foundation for the next generation of autoregressive audio foundation models.
 
17
 
18
  This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
19
  `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
 
28
 
29
  ## Usage
30
 
 
 
 
 
 
 
 
31
  ### Quickstart
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```python
34
  import torch
35
  from transformers import AutoModel
 
94
  - Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio/music.
95
  - STFT-Dist. denotes the STFT distance.
96
  - Higher is better for speech metrics, while lower is better for audio/music metrics (Mel-Loss, STFT-Dist.).
97
+ - Nq denotes the number of quantizers.
98
 
99
+ | Model | bps | Frame rate | Nq | Speech: SIM ↑ (EN/ZH) | Speech: STOI ↑ (EN/ZH) | Speech: PESQ-NB ↑ (EN/ZH) | Speech: PESQ-WB ↑ (EN/ZH) | Audio/Music: Mel-Loss ↓ | Audio/Music: STFT-Dist. ↓ |
100
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
101
  | **XCodec2.0** | 800 | 50 | 1 | 0.82 / 0.74 | 0.92 / 0.86 | 3.04 / 2.46 | 2.43 / 1.96 | -- / -- | -- / -- |
102
  | **MiMo Audio Tokenizer** | 850 | 25 | 4 | 0.80 / 0.74 | 0.91 / 0.87 | 2.94 / 2.62 | 2.39 / 2.14 | **0.82** / 0.81 | 2.33 / 2.23 |
103
  | **Higgs Audio Tokenizer** | 1000 | 25 | 4 | 0.77 / 0.68 | 0.83 / 0.82 | 3.03 / 2.61 | 2.48 / 2.14 | 0.83 / **0.80** | 2.20 / 2.05 |