| | --- |
| | library_name: transformers |
| | pipeline_tag: audio-to-audio |
| | tags: |
| | - audio-classification |
| | - signal-processing |
| | license: apache-2.0 |
| | --- |
| | |
| |
|
| |
|
| |
|
| |
|
| | # DashengTokenizer |
| |
|
| | <div align="center"> |
| |
|
| |
|
| | <a href="https://arxiv.org/abs/2602.23765"><img src="https://img.shields.io/badge/arXiv-2602.23765-b31b1b" alt="version"></a> |
| | <a href="https://huggingface.co/mispeech/dashengtokenizer"><img src="https://img.shields.io/badge/HuggingFace-ffcc66" alt="version"></a> |
| | <a href="https://arxiv.org/abs/2602.2602.23765"><img src="https://img.shields.io/badge/license-Apache-13333b" alt="version"></a> |
| | <a href="https://huggingface.co/mispeech/dashengtokenizer/colab"> |
| | <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"> |
| | </a> |
| |
|
| | </div> |
| |
|
| | DashengTokenizer is a high-performance continious audio tokenizer designed for audio understanding and generation tasks. |
| | Compared to previous works, our framework trains a **single linear layer** to enable audio generation for semantically strong encoders. |
| |
|
| | Achievements: |
| |
|
| | * **State-of-the-Art** Audio Understanding: DashengTokenizer consistently outperforms most previous self-supervised and supervised audio encoders. |
| | * **High-Fidelity** Signal Reconstruction: Maintains exceptional signal integrity, ensuring that audio remains crisp and accurate after processing. |
| | * Accelerated **Audio Generation** Training: Achieves optimal performance significantly faster than standard VAE models, reducing training time and costs. |
| | * Superior **Speech Enhancement**: Provides a more robust encoding foundation for isolating and clarifying speech in noisy environments. |
| |
|
| |
|
| |  |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | uv pip install transformers torch torchaudio einops |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | import torch |
| | import torchaudio |
| | from transformers import AutoModel |
| | |
| | # Load the model |
| | model = AutoModel.from_pretrained("mispeech/dashengtokenizer", trust_remote_code=True) |
| | model.eval() |
| | |
| | # Load audio file (only 16kHz supported!) |
| | audio, sr = torchaudio.load("path/to/audio.wav") |
| | |
| | # Optional: Create attention mask for variable-length inputs |
| | # attention_mask = torch.ones(audio.shape[0], audio.shape[1]) # All ones for full audio |
| | # attention_mask[0, 8000:] = 0 # Example: mask second half of first sample |
| | |
| | # Method 1: End-to-end processing (encode + decode) |
| | with torch.no_grad(), torch.autocast(device_type='cuda'): |
| | outputs = model(audio) # Optionally pass attention_mask=attention_mask |
| | reconstructed_audio = outputs["audio"] |
| | embeddings = outputs['embeddings'] |
| | |
| | # Method 2: Separate encoding and decoding |
| | with torch.no_grad(), torch.autocast(device_type='cuda'): |
| | # Encode audio to embeddings |
| | embeddings = model.encode(audio) # Optionally pass attention_mask=attention_mask |
| | |
| | # Decode embeddings back to audio |
| | reconstructed_audio = model.decode(embeddings) |
| | |
| | # Save reconstructed audio |
| | torchaudio.save("reconstructed_audio.wav", reconstructed_audio, sr) |
| | ``` |
| |
|
| |
|
| | ## Use Cases |
| |
|
| | ### 1. Audio Encoding |
| | ```python |
| | embeddings = model.encode(audio) |
| | reconstructed = model.decode(embeddings) |
| | ``` |
| |
|
| | ### 2. Feature Extraction |
| | ```python |
| | # Extract rich audio features for downstream tasks |
| | features = model.encode(audio) |
| | # Use features for classification, clustering, etc. |
| | ``` |
| |
|
| |
|
| | ## Limitations |
| |
|
| | - Optimized for 16kHz mono audio |
| |
|
| | ## Results |
| |
|
| |  |
| |  |
| |
|
| | ## Citation |
| |
|
| | If you use DashengTokenizer in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{dinkel_dashengtokenizer_2026, |
| | title={DashengTokenizer: One layer is enough for unified audio understanding and generation}, |
| | author={MiLM Plus, Xiaomi}, |
| | year={2026}, |
| | url={https://huggingface.co/mispeech/dashengtokenizer} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 License |
| |
|
| |
|