dashengtokenizer / README.md
Heinrich Dinkel
updated readme
2dfe0e8
|
raw
history blame
2.55 kB
metadata
library_name: transformers
pipeline_tag: audio-to-audio
tags:
  - audio-classification
  - signal-processing
license: apache-2.0

DashengTokenizer

DashengTokenizer is a high-performance continious audio tokenizer designed for audio understanding and generation tasks. Compared to previous works, our framework simply trains a single linear layer to enable audio generation for semantically strong encoders.

Framework

Usage

Installation

uv pip install transformers torch torchaudio einops

Basic Usage

import torch
import torchaudio
from transformers import AutoModel

# Load the model
model = AutoModel.from_pretrained("mispeech/dashengtokenizer", trust_remote_code=True)
model.eval()

# Load audio file (only 16kHz supported!)
audio, sr = torchaudio.load("path/to/audio.wav")

# Optional: Create attention mask for variable-length inputs
# attention_mask = torch.ones(audio.shape[0], audio.shape[1])  # All ones for full audio
# attention_mask[0, 8000:] = 0  # Example: mask second half of first sample

# Method 1: End-to-end processing (encode + decode)
with torch.no_grad():
    outputs = model(audio)  # Optionally pass attention_mask=attention_mask
    reconstructed_audio = outputs["audio"]
    embeddings = outputs['embeddings']

# Method 2: Separate encoding and decoding
with torch.no_grad():
    # Encode audio to embeddings
    embeddings = model.encode(audio)  # Optionally pass attention_mask=attention_mask

    # Decode embeddings back to audio
    reconstructed_audio = model.decode(embeddings)

# Save reconstructed audio
torchaudio.save("reconstructed_audio.wav", reconstructed_audio, sr)

Use Cases

1. Audio Encoding

embeddings = model.encode(audio)
reconstructed = model.decode(embeddings)

2. Feature Extraction

# Extract rich audio features for downstream tasks
features = model.encode(audio)
# Use features for classification, clustering, etc.

Limitations

  • Optimized for 16kHz mono audio

Results

Audio Generation Results Audio Understanding Results

Citation

If you use DashengTokenizer in your research, please cite:

@misc{dinkel_dashengtokenizer_2026,
  title={DashengTokenizer: One layer is enough for unified audio understanding and generation},
  author={MiLM Plus, Xiaomi},
  year={2026},
  url={https://huggingface.co/mispeech/dashengtokenizer}
}

License

Apache 2.0 License