File size: 1,825 Bytes
9a99b81 47fc0f6 a577dc1 47fc0f6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | ---
license: mit
base_model: microsoft/wavlm-large
tags:
- audio
- speech
- wavlm
- ctc
- phone-recognition
- arpabet
---
# HuPER Recognizer (ARPAbet phone recognition)
A CTC phone recognizer fine-tuned from **WavLM-Large** that maps **16 kHz** speech audio to an **ARPAbet** phone sequence.
See the HuPER paper for details: **arXiv:2602.01634**.
## Quickstart
```bash
pip install -U transformers torchaudio
```
```python
import torch
import torchaudio
from transformers import Wav2Vec2Processor, WavLMForCTC
repo_id = "huper29/huper_recognizer"
processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = WavLMForCTC.from_pretrained(repo_id)
model.eval()
waveform, sr = torchaudio.load("sample.wav")
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
if sr != 16000:
waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)[0].tolist()
blank_id = processor.tokenizer.pad_token_id
phone_tokens = []
prev = None
for token_id in pred_ids:
if token_id != blank_id and token_id != prev:
token = model.config.id2label.get(token_id, processor.tokenizer.convert_ids_to_tokens(token_id))
if token not in {"<PAD>", "<UNK>", "<BOS>", "<EOS>", "|"}:
phone_tokens.append(token)
prev = token_id
print(" ".join(phone_tokens))
```
## Citation
```bibtex
@article{guo2026huper,
title = {HuPER: A Human-Inspired Framework for Phonetic Perception},
author = {Guo, Chenxu and Lian, Jiachen and Liu, Yisi and Huang, Baihe and Narayanan, Shriyaa and Cho, Cheol Jun and Anumanchipalli, Gopala},
journal = {arXiv preprint arXiv:2602.01634},
year = {2026}
}
```
|