|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- accent |
|
|
license: cc-by-sa-4.0 |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- f1 |
|
|
base_model: |
|
|
- facebook/mms-lid-256 |
|
|
pipeline_tag: audio-classification |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
Classifies voice input into 11 English Accents |
|
|
|
|
|
## Model Details |
|
|
|
|
|
This model is a finetune of facebook/mms-lid-256 on the [speech accent archive dataset](https://accent.gmu.edu/) |
|
|
|
|
|
It classies voice into 11 English Accents:\ |
|
|
"0": "African"\ |
|
|
"1": "Australian"\ |
|
|
"2": "British"\ |
|
|
"3": "EastAsian"\ |
|
|
"4": "EasternEuropean"\ |
|
|
"5": "LatinAmerican"\ |
|
|
"6": "MiddleEastern"\ |
|
|
"7": "NorthAmerican"\ |
|
|
"8": "SouthAsian"\ |
|
|
"9": "SouthEastAsian"\ |
|
|
"10": "WesternEuropean" |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
Because of the constraints of the dataset, the input audio should be saying the phrase for best prediction results: |
|
|
|
|
|
> Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. |
|
|
|
|
|
|
|
|
### Direct Use |
|
|
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
|
|
You can load the model using the ID vkao8264/mms-accent-predict with the Transformers package |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor |
|
|
import torchaudio |
|
|
import torch |
|
|
|
|
|
def load_and_preprocess_audio(path): |
|
|
|
|
|
waveform, sr = torchaudio.load(path) |
|
|
|
|
|
# Resample to 16kHz because mms uses Wav2Vec |
|
|
if sr != sample_rate: |
|
|
waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform) |
|
|
|
|
|
# Convert to mono if stereo |
|
|
if waveform.shape[0] > 1: |
|
|
waveform = waveform.mean(dim=0, keepdim=True) |
|
|
|
|
|
# Remove channel dimension and convert to 1D |
|
|
waveform = waveform.squeeze(0) |
|
|
|
|
|
inputs = feature_extractor( |
|
|
waveform, |
|
|
sampling_rate=sample_rate, |
|
|
return_tensors="pt", |
|
|
padding="max_length", |
|
|
max_length=sample_rate * max_audio_length, |
|
|
truncation=True |
|
|
) |
|
|
|
|
|
return inputs.input_values |
|
|
|
|
|
id_to_class = { |
|
|
0: "African", |
|
|
1: "Australian", |
|
|
2: "British", |
|
|
3: "EastAsian", |
|
|
4: "EasternEuropean", |
|
|
5: "LatinAmerican", |
|
|
6: "MiddleEastern", |
|
|
7: "NorthAmerican", |
|
|
8: "SouthAsian", |
|
|
9: "SouthEastAsian", |
|
|
10: "WesternEuropean" |
|
|
} |
|
|
|
|
|
sample_rate = 16000 |
|
|
max_audio_length = 15 |
|
|
|
|
|
model = AutoModelForAudioClassification.from_pretrained("vkao8264/mms-accent-predict") |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/mms-lid-256") |
|
|
|
|
|
sample = "audio_input.mp3" |
|
|
inputs = load_and_preprocess_audio(sample) |
|
|
|
|
|
predictions = model(inputs) |
|
|
pred_label = torch.argmax(predictions['logits']).item() |
|
|
|
|
|
print(id_to_class[pred_label]) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
|
|
The whole training data consists of about 2000 unique audio samples from the speech accent archive, downloaded from [kaggle](https://www.kaggle.com/datasets/rtatman/speech-accent-archive/data) |
|
|
Data is then further split into training and validation set of size 1698 and 425 respectively |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
|
|
Accuracy on the validation set: 0.86 (f1 score) |
|
|
|
|
|
 |
|
|
|