WavCochV8192 / README.md
gretatuckute's picture
Update README.md
277ea3a verified
---
language:
- en
library_name: transformers
pipeline_tag: feature-extraction
tags:
- audio
- speech
- tokenizer
- quantizer
- cochlear
- custom_code
license: apache-2.0 # ← adjust if different
pretty_name: WavCoch (8192-code speech tokenizer)
---
# WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens)
**WavCochV8192** is a biologically-inspired, learned **audio quantizer** that maps a raw waveform to **discrete "cochlear tokens".** It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., [TuKoResearch/AuriStream1B_librilight_ckpt500k](https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k)). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation ([Cochleagram; Feather et al., 2023 Nat Neuro](https://github.com/jenellefeather/chcochleagram)) and reads out **8,192-way discrete codes** through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for **representation learning** and **next-token prediction** (speech continuation).
> **API at a glance**
> - **Input:** mono waveform at 16 kHz (pytorch tensor float32), shape **(B, 1, T)**
> - **Output:** token IDs, shape **(B, L)** returned as dictionary under key **`"input_ids"`**
> - Implemented as a `transformers` custom model — load with `trust_remote_code=True`.
---
## Installation
```bash
pip install -U torch torchaudio transformers
```
---
## Quickstart — Quantize a waveform into cochlear tokens
```python
import torch, torchaudio
from transformers import AutoModel
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the quantizer
quantizer = AutoModel.from_pretrained(
"TuKoResearch/WavCochV8192", trust_remote_code=True
).to(device).eval()
# Load & prep audio (mono, 16 kHz)
wav, sr = torchaudio.load("sample.wav")
if wav.size(0) > 1: # stereo -> mono
wav = wav.mean(dim=0, keepdim=True)
if sr != 16_000:
wav = torchaudio.transforms.Resample(sr, 16_000)(wav)
sr = 16_000
# Forward pass — returns a dict with "input_ids" = (B, L)
with torch.no_grad():
out = quantizer(wav.unsqueeze(0).to(device)) # (1, 1, T) -> dict
token_ids = out["input_ids"] # LongTensor (1, L)
print("Token IDs shape:", token_ids.shape)
```
---
## Intended uses & limitations
- **Uses:** tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology.
- **Limitations:** trained only on spoken English, so might not perform as well for other languages and non-speech sounds.
---
## Citation
If you use this tokenizer please cite:
```bibtex
@inproceedings{tuckute2025cochleartokens,
title = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens},
author = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins},
booktitle = {Interspeech 2025},
year = {2025},
pages = {2180--2184},
doi = {10.21437/Interspeech.2025-2044},
issn = {2958-1796}
}
```
---
## Related
- **AuriStream LM:** https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k
- **Org:** https://huggingface.co/TuKoResearch