You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

WavLM Large — English Accent Classification

This model is a fine-tuned version of microsoft/wavlm-large for classifying English accents from audio. Only the classification head was trained; the WavLM backbone weights are frozen.

Model Details

Base model: microsoft/wavlm-large
Task: Audio Classification
Accents: 8 classes
Training: Classification head only (backbone frozen)
Framework: HuggingFace Transformers

Supported Accents

ID	Label
0	american
1	australian
2	british
3	canadian
4	indian
5	irish
6	jamaican
7	scottish

How to Use

Basic Inference

import torch
import torchaudio
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

# Load model and feature extractor
model_id = "shuyuncci/wavlm-accent"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModelForAudioClassification.from_pretrained(model_id)
model.eval()

# Load audio file (must be 16kHz mono)
waveform, sample_rate = torchaudio.load("your_audio.wav")

# Resample if needed
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert to mono if stereo
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)

# Prepare input
inputs = feature_extractor(
    waveform.squeeze().numpy(),
    sampling_rate=16000,
    return_tensors="pt"
)

# Run inference
with torch.no_grad():
    logits = model(**inputs).logits

# Get predicted label
predicted_id = logits.argmax(dim=-1).item()
predicted_label = model.config.id2label[predicted_id]
print(f"Predicted accent: {predicted_label}")

Important Notes

Input format: 16kHz mono WAV audio
Audio length: The model was trained on clips up to 20 seconds. For longer audio, consider splitting into chunks and aggregating predictions.
Language: This model is designed for English speech only. Performance on non-English audio is not guaranteed.

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support