Audio Classification
Transformers
Safetensors
phonoq
feature-extraction
audio
speech
phonology
phonological-features
wav2vec2
multilingual
custom_code
Instructions to use abnerh/phonoq-2.0-multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use abnerh/phonoq-2.0-multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="abnerh/phonoq-2.0-multilingual", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("abnerh/phonoq-2.0-multilingual", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
PhonoQ 2.0 Multilingual
Framewise phonological feature recognition for multilingual speech.
This model returns phonological probabilities for manner, vowel height, vowel backness, place, and voicing, plus a hard conditional 22-feature representation per frame. It is a modernized successor to the original PhonoQ system: https://github.com/TAriasVergara/PhonoQ
Usage
pip install torch transformers soundfile safetensors
import soundfile as sf
import torch
from transformers import AutoFeatureExtractor, AutoModel
model_id = "abnerh/phonoq-2.0-multilingual"
audio, sr = sf.read("example.wav")
if audio.ndim > 1:
audio = audio.mean(axis=1)
processor = AutoFeatureExtractor.from_pretrained(model_id)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
model.eval()
with torch.no_grad():
out = model(**inputs)
print(out.features.shape) # [1, T, 22]
print(out.manner_probabilities.shape) # [1, T, 9]
print(out.vowel_height.shape) # [1, T, 3]
print(out.vowel_backness.shape) # [1, T, 3]
print(out.place_probabilities.shape) # [1, T, 5]
print(out.voice_probabilities.shape) # [1, T, 2]
Outputs
features: hard conditional 22-dimensional features,[B, T, 22]manner_probabilities:[B, T, 9]vowel_height:[B, T, 3]vowel_backness:[B, T, 3]place_probabilities:[B, T, 5]voice_probabilities:[B, T, 2]attention_mask: valid encoder frames,[B, T]feature_names: names for the 22 feature dimensions
Feature order:
silence, stop, nasal, rhotic, fricative, affricate, approximant, lateral, vowel,
high, mid, low, front, central, back,
labial, alveolar, velar, palatal, postalveolar,
voiceless, voiced
Viewing Probabilities
The following snippet prints only the non-silence region.
manner_labels = [
"silence", "stop", "nasal", "rhotic", "fricative", "affricate",
"approximant", "lateral", "vowel",
]
manner = out.manner_probabilities[0]
mask = out.attention_mask[0].bool()
manner = manner[mask]
best_manner = manner.argmax(dim=-1)
non_silence = (best_manner != 0).nonzero(as_tuple=True)[0]
if len(non_silence) == 0:
print("No non-silence frames found.")
else:
start = int(non_silence[0])
end = int(non_silence[-1]) + 1
print(f"Non-silence frame range: {start}-{end - 1}")
print()
for frame_idx in range(start, end):
probs = manner[frame_idx]
best = int(probs.argmax())
print(f"{frame_idx:03d} {manner_labels[best]:10s} {float(probs[best]):.3f}")
CLI
This repository includes best.ckpt for PhonoQ CLI compatibility:
pip install git+https://github.com/abnerLing/PhonoQ-2.0.git
phonoq predict example.wav \
--model abnerh/phonoq-2.0-multilingual \
--outdir outputs \
--pretty
Notes
This model uses custom Transformers code and must be loaded with
trust_remote_code=True.
The multilingual checkpoint is intended for use across English, Spanish, German, and Czech speech.
- Downloads last month
- 21