PhonoQ 2.0 Multilingual

Framewise phonological feature recognition for multilingual speech.

This model returns phonological probabilities for manner, vowel height, vowel backness, place, and voicing, plus a hard conditional 22-feature representation per frame. It is a modernized successor to the original PhonoQ system: https://github.com/TAriasVergara/PhonoQ

Usage

pip install torch transformers soundfile safetensors
import soundfile as sf
import torch
from transformers import AutoFeatureExtractor, AutoModel

model_id = "abnerh/phonoq-2.0-multilingual"
audio, sr = sf.read("example.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)

processor = AutoFeatureExtractor.from_pretrained(model_id)
inputs = processor(audio, sampling_rate=sr, return_tensors="pt", padding=True)

model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
model.eval()

with torch.no_grad():
    out = model(**inputs)

print(out.features.shape)              # [1, T, 22]
print(out.manner_probabilities.shape)  # [1, T, 9]
print(out.vowel_height.shape)          # [1, T, 3]
print(out.vowel_backness.shape)        # [1, T, 3]
print(out.place_probabilities.shape)   # [1, T, 5]
print(out.voice_probabilities.shape)   # [1, T, 2]

Outputs

  • features: hard conditional 22-dimensional features, [B, T, 22]
  • manner_probabilities: [B, T, 9]
  • vowel_height: [B, T, 3]
  • vowel_backness: [B, T, 3]
  • place_probabilities: [B, T, 5]
  • voice_probabilities: [B, T, 2]
  • attention_mask: valid encoder frames, [B, T]
  • feature_names: names for the 22 feature dimensions

Feature order:

silence, stop, nasal, rhotic, fricative, affricate, approximant, lateral, vowel,
high, mid, low, front, central, back,
labial, alveolar, velar, palatal, postalveolar,
voiceless, voiced

Viewing Probabilities

The following snippet prints only the non-silence region.

manner_labels = [
    "silence", "stop", "nasal", "rhotic", "fricative", "affricate",
    "approximant", "lateral", "vowel",
]

manner = out.manner_probabilities[0]
mask = out.attention_mask[0].bool()
manner = manner[mask]

best_manner = manner.argmax(dim=-1)
non_silence = (best_manner != 0).nonzero(as_tuple=True)[0]

if len(non_silence) == 0:
    print("No non-silence frames found.")
else:
    start = int(non_silence[0])
    end = int(non_silence[-1]) + 1

    print(f"Non-silence frame range: {start}-{end - 1}")
    print()

    for frame_idx in range(start, end):
        probs = manner[frame_idx]
        best = int(probs.argmax())
        print(f"{frame_idx:03d}  {manner_labels[best]:10s}  {float(probs[best]):.3f}")

CLI

This repository includes best.ckpt for PhonoQ CLI compatibility:

pip install git+https://github.com/abnerLing/PhonoQ-2.0.git
phonoq predict example.wav \
  --model abnerh/phonoq-2.0-multilingual \
  --outdir outputs \
  --pretty

Notes

This model uses custom Transformers code and must be loaded with trust_remote_code=True.

The multilingual checkpoint is intended for use across English, Spanish, German, and Czech speech.

Downloads last month
21
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support