Heinrich Dinkel commited on
Commit ·
ba200b5
1
Parent(s): b0489bd
Added README
Browse files- README.md +97 -3
- figures/audio_generation_results.pdf +0 -0
- figures/audio_understanding_results.pdf +0 -0
README.md
CHANGED
|
@@ -1,3 +1,97 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
pipeline_tag:
|
| 4 |
+
- audio-to-audio
|
| 5 |
+
- audio-classification
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# DashengTokenizer
|
| 10 |
+
|
| 11 |
+

|
| 12 |
+

|
| 13 |
+
|
| 14 |
+
DashengTokenizer is a high-performance neural audio tokenizer designed for audio understanding and generation tasks.
|
| 15 |
+
|
| 16 |
+
## Usage
|
| 17 |
+
|
| 18 |
+
### Installation
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
uv pip install transformers torch torchaudio einops
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### Basic Usage
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
import torch
|
| 28 |
+
import torchaudio
|
| 29 |
+
from transformers import AutoModel
|
| 30 |
+
|
| 31 |
+
# Load the model
|
| 32 |
+
model = AutoModel.from_pretrained("mispeech/dashengtokenizer", trust_remote_code=True)
|
| 33 |
+
model.eval()
|
| 34 |
+
|
| 35 |
+
# Load audio file (only 16kHz supported!)
|
| 36 |
+
audio, sr = torchaudio.load("path/to/audio.wav")
|
| 37 |
+
|
| 38 |
+
# Optional: Create attention mask for variable-length inputs
|
| 39 |
+
# attention_mask = torch.ones(audio.shape[0], audio.shape[1]) # All ones for full audio
|
| 40 |
+
# attention_mask[0, 8000:] = 0 # Example: mask second half of first sample
|
| 41 |
+
|
| 42 |
+
# Method 1: End-to-end processing (encode + decode)
|
| 43 |
+
with torch.no_grad():
|
| 44 |
+
outputs = model(audio) # Optionally pass attention_mask=attention_mask
|
| 45 |
+
reconstructed_audio = outputs["audio"]
|
| 46 |
+
embeddings = outputs['embeddings']
|
| 47 |
+
|
| 48 |
+
# Method 2: Separate encoding and decoding
|
| 49 |
+
with torch.no_grad():
|
| 50 |
+
# Encode audio to embeddings
|
| 51 |
+
embeddings = model.encode(audio) # Optionally pass attention_mask=attention_mask
|
| 52 |
+
|
| 53 |
+
# Decode embeddings back to audio
|
| 54 |
+
reconstructed_audio = model.decode(embeddings)
|
| 55 |
+
|
| 56 |
+
# Save reconstructed audio
|
| 57 |
+
torchaudio.save("reconstructed_audio.wav", reconstructed_audio, sr)
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
## Use Cases
|
| 62 |
+
|
| 63 |
+
### 1. Audio Encoding
|
| 64 |
+
```python
|
| 65 |
+
embeddings = model.encode(audio)
|
| 66 |
+
reconstructed = model.decode(embeddings)
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### 2. Feature Extraction
|
| 70 |
+
```python
|
| 71 |
+
# Extract rich audio features for downstream tasks
|
| 72 |
+
features = model.encode(audio)
|
| 73 |
+
# Use features for classification, clustering, etc.
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
## Limitations
|
| 78 |
+
|
| 79 |
+
- Optimized for 16kHz mono audio
|
| 80 |
+
|
| 81 |
+
## Citation
|
| 82 |
+
|
| 83 |
+
If you use DashengTokenizer in your research, please cite:
|
| 84 |
+
|
| 85 |
+
```bibtex
|
| 86 |
+
@misc{dinkel_dashengtokenizer_2026,
|
| 87 |
+
title={DashengTokenizer: One layer is enough for unified audio understanding and generation},
|
| 88 |
+
author={MiLM Plus, Xiaomi},
|
| 89 |
+
year={2026},
|
| 90 |
+
url={https://huggingface.co/mispeech/dashengtokenizer}
|
| 91 |
+
}
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## License
|
| 95 |
+
|
| 96 |
+
Apache 2.0 License
|
| 97 |
+
|
figures/audio_generation_results.pdf
ADDED
|
Binary file (16.6 kB). View file
|
|
|
figures/audio_understanding_results.pdf
ADDED
|
Binary file (20.9 kB). View file
|
|
|