You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens)

WavCochV8192 is a biologically-inspired, learned audio quantizer that maps a raw waveform to discrete "cochlear tokens". It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., TuKoResearch/AuriStream1B_librilight_ckpt500k). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation (Cochleagram; Feather et al., 2023 Nat Neuro) and reads out 8,192-way discrete codes through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for representation learning and next-token prediction (speech continuation).

API at a glance

Input: mono waveform at 16 kHz (pytorch tensor float32), shape (B, 1, T)

Output: token IDs, shape (B, L) returned as dictionary under key "input_ids"

Implemented as a transformers custom model — load with trust_remote_code=True.

Installation

pip install -U torch torchaudio transformers

Quickstart — Quantize a waveform into cochlear tokens

import torch, torchaudio
from transformers import AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the quantizer
quantizer = AutoModel.from_pretrained(
    "TuKoResearch/WavCochV8192", trust_remote_code=True
).to(device).eval()

# Load & prep audio (mono, 16 kHz)
wav, sr = torchaudio.load("sample.wav")
if wav.size(0) > 1:  # stereo -> mono
    wav = wav.mean(dim=0, keepdim=True)
if sr != 16_000:
    wav = torchaudio.transforms.Resample(sr, 16_000)(wav)
    sr = 16_000

# Forward pass — returns a dict with "input_ids" = (B, L)
with torch.no_grad():
    out = quantizer(wav.unsqueeze(0).to(device))   # (1, 1, T) -> dict
    token_ids = out["input_ids"]                   # LongTensor (1, L)

print("Token IDs shape:", token_ids.shape)

Intended uses & limitations

Uses: tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology.
Limitations: trained only on spoken English, so might not perform as well for other languages and non-speech sounds.

Citation

If you use this tokenizer please cite:

@inproceedings{tuckute2025cochleartokens,
  title     = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens},
  author    = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins},
  booktitle = {Interspeech 2025},
  year      = {2025},
  pages     = {2180--2184},
  doi       = {10.21437/Interspeech.2025-2044},
  issn      = {2958-1796}
}

AuriStream LM: https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k
Org: https://huggingface.co/TuKoResearch

Downloads last month: 53

Safetensors

Model size

11.1M params

Tensor type

I64

F32

TuKoResearch
/

WavCochV8192