fdugyt commited on
Commit
10cda39
·
verified ·
1 Parent(s): b99163d

modify README

Browse files
Files changed (1) hide show
  1. README.md +13 -1
README.md CHANGED
@@ -13,7 +13,19 @@ tags:
13
 
14
  # MossAudioTokenizer
15
 
16
- MOSS Audio Tokenizer is a unified audio tokenizer designed to achieve both high-fidelity reconstruction and semantically rich representations across speech, sound, and music. Built on the Cat (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture, the model scales to 1.6 billion parameters and was trained on 3 million hours of audio, surpassing previous open-source tokenizers in reconstruction quality across all bitrates. It processes 24 kHz audio at a low 12.5 Hz frame rate, with all components—including the encoder, quantizer, decoder, decoder-only LLM, and discriminator—optimized jointly in an end-to-end manner. Featuring a 32-layer residual vector quantizer (RVQ) with variable-bitrate support, it provides a scalable, native foundation for the next generation of autoregressive audio foundation models.
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
19
  `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository
 
13
 
14
  # MossAudioTokenizer
15
 
16
+ **MOSSAudioTokenizer** is a unified discrete audio tokenizer based on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture. Scaling to 1.6 billion parameters, it functions as a unified discrete interface, delivering both lossless-quality reconstruction and high-level semantic alignment.
17
+
18
+ **Key Features:**
19
+
20
+ * **Extreme Compression & Variable Bitrate**: It compresses 24kHz raw audio into a remarkably low frame rate of 12.5Hz. Utilizing a 32-layer Residual Vector Quantizer (RVQ), it supports high-fidelity reconstruction across a wide range of bitrates, from 0.125kbps to 4kbps.
21
+ * **Pure Transformer Architecture**: The model features a "CNN-free" homogeneous architecture built entirely from Causal Transformer blocks. With 1.6B combined parameters (Encoder + Decoder), it ensures exceptional scalability and supports low-latency streaming inference.
22
+ * **Large-Scale General Audio Training**: Trained on 3 million hours of diverse audio data, the model excels at encoding and reconstructing all audio domains, including speech, sound effects, and music.
23
+ * **Unified Semantic-Acoustic Representation**: While achieving state-of-the-art reconstruction quality, Cat produces discrete tokens that are "semantic-rich," making them ideal for downstream tasks like speech understanding (ASR) and generation (TTS).
24
+ * **Fully Trained From Scratch**: Cat does not rely on any pretrained encoders (such as HuBERT or Whisper) or distillation from teacher models. All representations are learned autonomously from raw data.
25
+ * **End-to-End Joint Optimization**: All components—including the encoder, quantizer, decoder, discriminator, and a decoder-only LLM for semantic alignment—are optimized jointly in a single unified training pipeline.
26
+
27
+ **Summary:**
28
+ By combining a simple, scalable architecture with massive-scale data, the Cat architecture overcomes the bottlenecks of traditional audio tokenizers. It provides a robust, high-fidelity, and semantically grounded interface for the next generation of native audio foundation models.
29
 
30
  This repository contains a lightweight remote-code implementation that mirrors the current 🤗 Transformers
31
  `transformers.models.moss_audio_tokenizer` module. It is intended to be uploaded to a Hugging Face Hub model repository