mms-accent-predict / README.md

Update README.md

8094e9d verified 7 months ago

3.77 kB

	---
	library_name: transformers
	tags:
	- accent
	license: cc-by-sa-4.0
	language:
	- en
	metrics:
	- f1
	base_model:
	- facebook/mms-lid-256
	pipeline_tag: audio-classification
	---

	# Model Card for Model ID

	Classifies voice input into 11 English Accents

	## Model Details

	This model is a finetune of facebook/mms-lid-256 on the [speech accent archive dataset](https://accent.gmu.edu/)

	It classies voice into 11 English Accents:\
	"0": "African"\
	"1": "Australian"\
	"2": "British"\
	"3": "EastAsian"\
	"4": "EasternEuropean"\
	"5": "LatinAmerican"\
	"6": "MiddleEastern"\
	"7": "NorthAmerican"\
	"8": "SouthAsian"\
	"9": "SouthEastAsian"\
	"10": "WesternEuropean"




	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	Because of the constraints of the dataset, the input audio should be saying the phrase for best prediction results:

	> Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.


	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	You can load the model using the ID vkao8264/mms-accent-predict with the Transformers package

	```python
	from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
	import torchaudio
	import torch

	def load_and_preprocess_audio(path):

	waveform, sr = torchaudio.load(path)

	# Resample to 16kHz because mms uses Wav2Vec
	if sr != sample_rate:
	waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)

	# Convert to mono if stereo
	if waveform.shape[0] > 1:
	waveform = waveform.mean(dim=0, keepdim=True)

	# Remove channel dimension and convert to 1D
	waveform = waveform.squeeze(0)

	inputs = feature_extractor(
	waveform,
	sampling_rate=sample_rate,
	return_tensors="pt",
	padding="max_length",
	max_length=sample_rate * max_audio_length,
	truncation=True
	)

	return inputs.input_values

	id_to_class = {
	0: "African",
	1: "Australian",
	2: "British",
	3: "EastAsian",
	4: "EasternEuropean",
	5: "LatinAmerican",
	6: "MiddleEastern",
	7: "NorthAmerican",
	8: "SouthAsian",
	9: "SouthEastAsian",
	10: "WesternEuropean"
	}

	sample_rate = 16000
	max_audio_length = 15

	model = AutoModelForAudioClassification.from_pretrained("vkao8264/mms-accent-predict")
	feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/mms-lid-256")

	sample = "audio_input.mp3"
	inputs = load_and_preprocess_audio(sample)

	predictions = model(inputs)
	pred_label = torch.argmax(predictions['logits']).item()

	print(id_to_class[pred_label])

	```


	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	The whole training data consists of about 2000 unique audio samples from the speech accent archive, downloaded from [kaggle](https://www.kaggle.com/datasets/rtatman/speech-accent-archive/data)
	Data is then further split into training and validation set of size 1698 and 425 respectively

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	Accuracy on the validation set: 0.86 (f1 score)

	![c](mms_eval.png)