Heinrich Dinkel commited on
Commit
ba200b5
·
1 Parent(s): b0489bd

Added README

Browse files
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag:
4
+ - audio-to-audio
5
+ - audio-classification
6
+ license: apache-2.0
7
+ ---
8
+
9
+ # DashengTokenizer
10
+
11
+ ![Audio Generation Results](./figures/audio_generation_results.pdf)
12
+ ![Audio Understanding Results](./figures/audio_understanding_results.pdf)
13
+
14
+ DashengTokenizer is a high-performance neural audio tokenizer designed for audio understanding and generation tasks.
15
+
16
+ ## Usage
17
+
18
+ ### Installation
19
+
20
+ ```bash
21
+ uv pip install transformers torch torchaudio einops
22
+ ```
23
+
24
+ ### Basic Usage
25
+
26
+ ```python
27
+ import torch
28
+ import torchaudio
29
+ from transformers import AutoModel
30
+
31
+ # Load the model
32
+ model = AutoModel.from_pretrained("mispeech/dashengtokenizer", trust_remote_code=True)
33
+ model.eval()
34
+
35
+ # Load audio file (only 16kHz supported!)
36
+ audio, sr = torchaudio.load("path/to/audio.wav")
37
+
38
+ # Optional: Create attention mask for variable-length inputs
39
+ # attention_mask = torch.ones(audio.shape[0], audio.shape[1]) # All ones for full audio
40
+ # attention_mask[0, 8000:] = 0 # Example: mask second half of first sample
41
+
42
+ # Method 1: End-to-end processing (encode + decode)
43
+ with torch.no_grad():
44
+ outputs = model(audio) # Optionally pass attention_mask=attention_mask
45
+ reconstructed_audio = outputs["audio"]
46
+ embeddings = outputs['embeddings']
47
+
48
+ # Method 2: Separate encoding and decoding
49
+ with torch.no_grad():
50
+ # Encode audio to embeddings
51
+ embeddings = model.encode(audio) # Optionally pass attention_mask=attention_mask
52
+
53
+ # Decode embeddings back to audio
54
+ reconstructed_audio = model.decode(embeddings)
55
+
56
+ # Save reconstructed audio
57
+ torchaudio.save("reconstructed_audio.wav", reconstructed_audio, sr)
58
+ ```
59
+
60
+
61
+ ## Use Cases
62
+
63
+ ### 1. Audio Encoding
64
+ ```python
65
+ embeddings = model.encode(audio)
66
+ reconstructed = model.decode(embeddings)
67
+ ```
68
+
69
+ ### 2. Feature Extraction
70
+ ```python
71
+ # Extract rich audio features for downstream tasks
72
+ features = model.encode(audio)
73
+ # Use features for classification, clustering, etc.
74
+ ```
75
+
76
+
77
+ ## Limitations
78
+
79
+ - Optimized for 16kHz mono audio
80
+
81
+ ## Citation
82
+
83
+ If you use DashengTokenizer in your research, please cite:
84
+
85
+ ```bibtex
86
+ @misc{dinkel_dashengtokenizer_2026,
87
+ title={DashengTokenizer: One layer is enough for unified audio understanding and generation},
88
+ author={MiLM Plus, Xiaomi},
89
+ year={2026},
90
+ url={https://huggingface.co/mispeech/dashengtokenizer}
91
+ }
92
+ ```
93
+
94
+ ## License
95
+
96
+ Apache 2.0 License
97
+
figures/audio_generation_results.pdf ADDED
Binary file (16.6 kB). View file
 
figures/audio_understanding_results.pdf ADDED
Binary file (20.9 kB). View file