How to use from the
Use from the
Transformers library
# Gated model: Login with a HF token with gated access permission
hf auth login
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("feature-extraction", model="TuKoResearch/WavCochV8192", trust_remote_code=True)
# Load model directly
from transformers import AutoTokenizer
model = AutoTokenizer.from_pretrained("TuKoResearch/WavCochV8192", trust_remote_code=True, dtype="auto")
Quick Links

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens)

WavCochV8192 is a biologically-inspired, learned audio quantizer that maps a raw waveform to discrete "cochlear tokens". It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., TuKoResearch/AuriStream1B_librilight_ckpt500k). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation (Cochleagram; Feather et al., 2023 Nat Neuro) and reads out 8,192-way discrete codes through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for representation learning and next-token prediction (speech continuation).

API at a glance

  • Input: mono waveform at 16 kHz (pytorch tensor float32), shape (B, 1, T)
  • Output: token IDs, shape (B, L) returned as dictionary under key "input_ids"
  • Implemented as a transformers custom model — load with trust_remote_code=True.

Installation

pip install -U torch torchaudio transformers

Quickstart — Quantize a waveform into cochlear tokens

import torch, torchaudio
from transformers import AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the quantizer
quantizer = AutoModel.from_pretrained(
    "TuKoResearch/WavCochV8192", trust_remote_code=True
).to(device).eval()

# Load & prep audio (mono, 16 kHz)
wav, sr = torchaudio.load("sample.wav")
if wav.size(0) > 1:  # stereo -> mono
    wav = wav.mean(dim=0, keepdim=True)
if sr != 16_000:
    wav = torchaudio.transforms.Resample(sr, 16_000)(wav)
    sr = 16_000

# Forward pass — returns a dict with "input_ids" = (B, L)
with torch.no_grad():
    out = quantizer(wav.unsqueeze(0).to(device))   # (1, 1, T) -> dict
    token_ids = out["input_ids"]                   # LongTensor (1, L)

print("Token IDs shape:", token_ids.shape)

Intended uses & limitations

  • Uses: tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology.
  • Limitations: trained only on spoken English, so might not perform as well for other languages and non-speech sounds.

Citation

If you use this tokenizer please cite:

@inproceedings{tuckute2025cochleartokens,
  title     = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens},
  author    = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins},
  booktitle = {Interspeech 2025},
  year      = {2025},
  pages     = {2180--2184},
  doi       = {10.21437/Interspeech.2025-2044},
  issn      = {2958-1796}
}

Related

Downloads last month
53
Safetensors
Model size
11.1M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support