File size: 3,202 Bytes
77a6660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
277ea3a
77a6660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e86c976
77a6660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e86c976
 
77a6660
 
 
 
 
 
 
 
14e8224
 
 
 
 
 
 
 
77a6660
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
language:
- en
library_name: transformers
pipeline_tag: feature-extraction
tags:
- audio
- speech
- tokenizer
- quantizer
- cochlear
- custom_code
license: apache-2.0   # ← adjust if different
pretty_name: WavCoch (8192-code speech tokenizer)
---

# WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens)

**WavCochV8192** is a biologically-inspired, learned **audio quantizer** that maps a raw waveform to **discrete "cochlear tokens".** It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., [TuKoResearch/AuriStream1B_librilight_ckpt500k](https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k)). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation ([Cochleagram; Feather et al., 2023 Nat Neuro](https://github.com/jenellefeather/chcochleagram)) and reads out **8,192-way discrete codes** through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for **representation learning** and **next-token prediction** (speech continuation).

> **API at a glance**
> - **Input:** mono waveform at 16 kHz (pytorch tensor float32), shape **(B, 1, T)**  
> - **Output:** token IDs, shape **(B, L)** returned as dictionary under key **`"input_ids"`**  
> - Implemented as a `transformers` custom model — load with `trust_remote_code=True`.

---

## Installation

```bash
pip install -U torch torchaudio transformers
```

---

## Quickstart — Quantize a waveform into cochlear tokens

```python
import torch, torchaudio
from transformers import AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the quantizer
quantizer = AutoModel.from_pretrained(
    "TuKoResearch/WavCochV8192", trust_remote_code=True
).to(device).eval()

# Load & prep audio (mono, 16 kHz)
wav, sr = torchaudio.load("sample.wav")
if wav.size(0) > 1:  # stereo -> mono
    wav = wav.mean(dim=0, keepdim=True)
if sr != 16_000:
    wav = torchaudio.transforms.Resample(sr, 16_000)(wav)
    sr = 16_000

# Forward pass — returns a dict with "input_ids" = (B, L)
with torch.no_grad():
    out = quantizer(wav.unsqueeze(0).to(device))   # (1, 1, T) -> dict
    token_ids = out["input_ids"]                   # LongTensor (1, L)

print("Token IDs shape:", token_ids.shape)
```

---

## Intended uses & limitations
- **Uses:** tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology.  
- **Limitations:** trained only on spoken English, so might not perform as well for other languages and non-speech sounds.

---

## Citation

If you use this tokenizer please cite:

```bibtex
@inproceedings{tuckute2025cochleartokens,
  title     = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens},
  author    = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins},
  booktitle = {Interspeech 2025},
  year      = {2025},
  pages     = {2180--2184},
  doi       = {10.21437/Interspeech.2025-2044},
  issn      = {2958-1796}
}
```

---

## Related
- **AuriStream LM:** https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k
- **Org:** https://huggingface.co/TuKoResearch