|
|
--- |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: feature-extraction |
|
|
tags: |
|
|
- audio |
|
|
- speech |
|
|
- tokenizer |
|
|
- quantizer |
|
|
- cochlear |
|
|
- custom_code |
|
|
license: apache-2.0 |
|
|
pretty_name: WavCoch (8192-code speech tokenizer) |
|
|
--- |
|
|
|
|
|
# WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens) |
|
|
|
|
|
**WavCochV8192** is a biologically-inspired, learned **audio quantizer** that maps a raw waveform to **discrete "cochlear tokens".** It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., [TuKoResearch/AuriStream1B_librilight_ckpt500k](https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k)). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation ([Cochleagram; Feather et al., 2023 Nat Neuro](https://github.com/jenellefeather/chcochleagram)) and reads out **8,192-way discrete codes** through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for **representation learning** and **next-token prediction** (speech continuation). |
|
|
|
|
|
> **API at a glance** |
|
|
> - **Input:** mono waveform at 16 kHz (pytorch tensor float32), shape **(B, 1, T)** |
|
|
> - **Output:** token IDs, shape **(B, L)** returned as dictionary under key **`"input_ids"`** |
|
|
> - Implemented as a `transformers` custom model — load with `trust_remote_code=True`. |
|
|
|
|
|
--- |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install -U torch torchaudio transformers |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Quickstart — Quantize a waveform into cochlear tokens |
|
|
|
|
|
```python |
|
|
import torch, torchaudio |
|
|
from transformers import AutoModel |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
# Load the quantizer |
|
|
quantizer = AutoModel.from_pretrained( |
|
|
"TuKoResearch/WavCochV8192", trust_remote_code=True |
|
|
).to(device).eval() |
|
|
|
|
|
# Load & prep audio (mono, 16 kHz) |
|
|
wav, sr = torchaudio.load("sample.wav") |
|
|
if wav.size(0) > 1: # stereo -> mono |
|
|
wav = wav.mean(dim=0, keepdim=True) |
|
|
if sr != 16_000: |
|
|
wav = torchaudio.transforms.Resample(sr, 16_000)(wav) |
|
|
sr = 16_000 |
|
|
|
|
|
# Forward pass — returns a dict with "input_ids" = (B, L) |
|
|
with torch.no_grad(): |
|
|
out = quantizer(wav.unsqueeze(0).to(device)) # (1, 1, T) -> dict |
|
|
token_ids = out["input_ids"] # LongTensor (1, L) |
|
|
|
|
|
print("Token IDs shape:", token_ids.shape) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended uses & limitations |
|
|
- **Uses:** tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology. |
|
|
- **Limitations:** trained only on spoken English, so might not perform as well for other languages and non-speech sounds. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this tokenizer please cite: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{tuckute2025cochleartokens, |
|
|
title = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens}, |
|
|
author = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins}, |
|
|
booktitle = {Interspeech 2025}, |
|
|
year = {2025}, |
|
|
pages = {2180--2184}, |
|
|
doi = {10.21437/Interspeech.2025-2044}, |
|
|
issn = {2958-1796} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Related |
|
|
- **AuriStream LM:** https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k |
|
|
- **Org:** https://huggingface.co/TuKoResearch |
|
|
|