You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

WavLM Large โ€” English Accent Classification

This model is a fine-tuned version of microsoft/wavlm-large for classifying English accents from audio. Only the classification head was trained; the WavLM backbone weights are frozen.

Model Details

  • Base model: microsoft/wavlm-large
  • Task: Audio Classification
  • Accents: 8 classes
  • Training: Classification head only (backbone frozen)
  • Framework: HuggingFace Transformers

Supported Accents

ID Label
0 american
1 australian
2 british
3 canadian
4 indian
5 irish
6 jamaican
7 scottish

How to Use

Basic Inference

import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

# Load model and feature extractor
model_id = "shuyuncci/wavlm-accent"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModelForAudioClassification.from_pretrained(model_id)
model.eval()

# Load audio file (must be 16kHz mono)
waveform, sample_rate = torchaudio.load("your_audio.wav")

# Resample if needed
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert to mono if stereo
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)

# Prepare input
inputs = feature_extractor(
    waveform.squeeze().numpy(),
    sampling_rate=16000,
    return_tensors="pt"
)

# Run inference
with torch.no_grad():
    logits = model(**inputs).logits

# Get predicted label
predicted_id = logits.argmax(dim=-1).item()
predicted_label = model.config.id2label[predicted_id]
print(f"Predicted accent: {predicted_label}")

Important Notes

  • Input format: 16kHz mono WAV audio
  • Audio length: The model was trained on clips up to 20 seconds. For longer audio, consider splitting into chunks and aggregating predictions.
  • Language: This model is designed for English speech only. Performance on non-English audio is not guaranteed.
Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support