mms-accent-predict / README.md
vkao8264's picture
Update README.md
8094e9d verified
---
library_name: transformers
tags:
- accent
license: cc-by-sa-4.0
language:
- en
metrics:
- f1
base_model:
- facebook/mms-lid-256
pipeline_tag: audio-classification
---
# Model Card for Model ID
Classifies voice input into 11 English Accents
## Model Details
This model is a finetune of facebook/mms-lid-256 on the [speech accent archive dataset](https://accent.gmu.edu/)
It classies voice into 11 English Accents:\
"0": "African"\
"1": "Australian"\
"2": "British"\
"3": "EastAsian"\
"4": "EasternEuropean"\
"5": "LatinAmerican"\
"6": "MiddleEastern"\
"7": "NorthAmerican"\
"8": "SouthAsian"\
"9": "SouthEastAsian"\
"10": "WesternEuropean"
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Because of the constraints of the dataset, the input audio should be saying the phrase for best prediction results:
> Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
You can load the model using the ID vkao8264/mms-accent-predict with the Transformers package
```python
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torchaudio
import torch
def load_and_preprocess_audio(path):
waveform, sr = torchaudio.load(path)
# Resample to 16kHz because mms uses Wav2Vec
if sr != sample_rate:
waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)
# Convert to mono if stereo
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Remove channel dimension and convert to 1D
waveform = waveform.squeeze(0)
inputs = feature_extractor(
waveform,
sampling_rate=sample_rate,
return_tensors="pt",
padding="max_length",
max_length=sample_rate * max_audio_length,
truncation=True
)
return inputs.input_values
id_to_class = {
0: "African",
1: "Australian",
2: "British",
3: "EastAsian",
4: "EasternEuropean",
5: "LatinAmerican",
6: "MiddleEastern",
7: "NorthAmerican",
8: "SouthAsian",
9: "SouthEastAsian",
10: "WesternEuropean"
}
sample_rate = 16000
max_audio_length = 15
model = AutoModelForAudioClassification.from_pretrained("vkao8264/mms-accent-predict")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/mms-lid-256")
sample = "audio_input.mp3"
inputs = load_and_preprocess_audio(sample)
predictions = model(inputs)
pred_label = torch.argmax(predictions['logits']).item()
print(id_to_class[pred_label])
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The whole training data consists of about 2000 unique audio samples from the speech accent archive, downloaded from [kaggle](https://www.kaggle.com/datasets/rtatman/speech-accent-archive/data)
Data is then further split into training and validation set of size 1698 and 425 respectively
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
Accuracy on the validation set: 0.86 (f1 score)
![c](mms_eval.png)