--- language: - en library_name: transformers pipeline_tag: feature-extraction tags: - audio - speech - tokenizer - quantizer - cochlear - custom_code license: apache-2.0 # ← adjust if different pretty_name: WavCoch (8192-code speech tokenizer) --- # WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens) **WavCochV8192** is a biologically-inspired, learned **audio quantizer** that maps a raw waveform to **discrete "cochlear tokens".** It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., [TuKoResearch/AuriStream1B_librilight_ckpt500k](https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k)). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation ([Cochleagram; Feather et al., 2023 Nat Neuro](https://github.com/jenellefeather/chcochleagram)) and reads out **8,192-way discrete codes** through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for **representation learning** and **next-token prediction** (speech continuation). > **API at a glance** > - **Input:** mono waveform at 16 kHz (pytorch tensor float32), shape **(B, 1, T)** > - **Output:** token IDs, shape **(B, L)** returned as dictionary under key **`"input_ids"`** > - Implemented as a `transformers` custom model — load with `trust_remote_code=True`. --- ## Installation ```bash pip install -U torch torchaudio transformers ``` --- ## Quickstart — Quantize a waveform into cochlear tokens ```python import torch, torchaudio from transformers import AutoModel device = "cuda" if torch.cuda.is_available() else "cpu" # Load the quantizer quantizer = AutoModel.from_pretrained( "TuKoResearch/WavCochV8192", trust_remote_code=True ).to(device).eval() # Load & prep audio (mono, 16 kHz) wav, sr = torchaudio.load("sample.wav") if wav.size(0) > 1: # stereo -> mono wav = wav.mean(dim=0, keepdim=True) if sr != 16_000: wav = torchaudio.transforms.Resample(sr, 16_000)(wav) sr = 16_000 # Forward pass — returns a dict with "input_ids" = (B, L) with torch.no_grad(): out = quantizer(wav.unsqueeze(0).to(device)) # (1, 1, T) -> dict token_ids = out["input_ids"] # LongTensor (1, L) print("Token IDs shape:", token_ids.shape) ``` --- ## Intended uses & limitations - **Uses:** tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology. - **Limitations:** trained only on spoken English, so might not perform as well for other languages and non-speech sounds. --- ## Citation If you use this tokenizer please cite: ```bibtex @inproceedings{tuckute2025cochleartokens, title = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens}, author = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins}, booktitle = {Interspeech 2025}, year = {2025}, pages = {2180--2184}, doi = {10.21437/Interspeech.2025-2044}, issn = {2958-1796} } ``` --- ## Related - **AuriStream LM:** https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k - **Org:** https://huggingface.co/TuKoResearch