TuKoResearch
/

WavCochV8192

Feature Extraction

WavCoch.WavCoch

Model card Files Files and versions

WavCochV8192 / README.md

gretatuckute's picture

Update README.md

277ea3a verified 5 months ago

|

history blame contribute delete

3.2 kB

	---
	language:
	- en
	library_name: transformers
	pipeline_tag: feature-extraction
	tags:
	- audio
	- speech
	- tokenizer
	- quantizer
	- cochlear
	- custom_code
	license: apache-2.0 # ← adjust if different
	pretty_name: WavCoch (8192-code speech tokenizer)
	---

	# WavCochV8192 — 8,192-code speech tokenizer (cochlear tokens)

	WavCochV8192 is a biologically-inspired, learned audio quantizer that maps a raw waveform to discrete "cochlear tokens". It is used as the tokenizer for the AuriStream autoregressive speech/language model (e.g., [TuKoResearch/AuriStream1B_librilight_ckpt500k](https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k)). The model is trained on LibriSpeech960 and encodes audio into a time–frequency representation ([Cochleagram; Feather et al., 2023 Nat Neuro](https://github.com/jenellefeather/chcochleagram)) and reads out 8,192-way discrete codes through a low-bit latent bottleneck (LFQ). These tokens can be fed to a transformer LM for representation learning and next-token prediction (speech continuation).

	> API at a glance
	> - Input: mono waveform at 16 kHz (pytorch tensor float32), shape (B, 1, T)
	> - Output: token IDs, shape (B, L) returned as dictionary under key `"input_ids"`
	> - Implemented as a `transformers` custom model — load with `trust_remote_code=True`.

	---

	## Installation

	```bash
	pip install -U torch torchaudio transformers
	```

	---

	## Quickstart — Quantize a waveform into cochlear tokens

	```python
	import torch, torchaudio
	from transformers import AutoModel

	device = "cuda" if torch.cuda.is_available() else "cpu"

	# Load the quantizer
	quantizer = AutoModel.from_pretrained(
	"TuKoResearch/WavCochV8192", trust_remote_code=True
	).to(device).eval()

	# Load & prep audio (mono, 16 kHz)
	wav, sr = torchaudio.load("sample.wav")
	if wav.size(0) > 1: # stereo -> mono
	wav = wav.mean(dim=0, keepdim=True)
	if sr != 16_000:
	wav = torchaudio.transforms.Resample(sr, 16_000)(wav)
	sr = 16_000

	# Forward pass — returns a dict with "input_ids" = (B, L)
	with torch.no_grad():
	out = quantizer(wav.unsqueeze(0).to(device)) # (1, 1, T) -> dict
	token_ids = out["input_ids"] # LongTensor (1, L)

	print("Token IDs shape:", token_ids.shape)
	```

	---

	## Intended uses & limitations
	- Uses: tokenization for speech LM training; compact storage/streaming of speech as discrete IDs, loosely inspired by human biology.
	- Limitations: trained only on spoken English, so might not perform as well for other languages and non-speech sounds.

	---

	## Citation

	If you use this tokenizer please cite:

	```bibtex
	@inproceedings{tuckute2025cochleartokens,
	title = {Representing Speech Through Autoregressive Prediction of Cochlear Tokens},
	author = {Greta Tuckute and Klemen Kotar and Evelina Fedorenko and Daniel Yamins},
	booktitle = {Interspeech 2025},
	year = {2025},
	pages = {2180--2184},
	doi = {10.21437/Interspeech.2025-2044},
	issn = {2958-1796}
	}
	```

	---

	## Related
	- AuriStream LM: https://huggingface.co/TuKoResearch/AuriStream1B_librilight_ckpt500k
	- Org: https://huggingface.co/TuKoResearch