codeceejay
/

HIYACCENT_Wav2Vec2

@@ -1,6 +1,62 @@
 HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning
-The global objective of this research was to develop a more robust model for the Nigerian English Speakers whose English pronunciations are heavily affected by their mother tongue. The developed model is then compared to the performance of the state of the art models. The project was motivated by the poor performance of existing models on Nigerian Accented English (NAE) Speakers.
-The Wav2Vec-HIYACCENT model was proposed which introduced a new layer to the Novel Facebook Wav2vec to capture the disparity between the baseline model and NAE. A CTC loss was also inserted on top of the model which adds flexibility to the speech-text alignment. This resulted in over 20% improvement in the performance for NAE.

 HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning
+The global objective of this research was to develop a more robust model for the Nigerian English Speakers whose English pronunciations are heavily affected by their mother tongue. For this, the Wav2Vec-HIYACCENT model was proposed which introduced a new layer to the Novel Facebook Wav2vec to capture the disparity between the baseline model and Nigerian English Speeches. A CTC loss was also inserted on top of the model which adds flexibility to the speech-text alignment. This resulted in over 20% improvement in the performance for NAE.T
+Fine-tuned facebook/wav2vec2-large on English using the UISpeech Corpus. When using this model, make sure that your speech input is sampled at 16kHz.
+The script used for training can be found here: https://github.com/amceejay/HIYACCENT-NE-Speech-Recognition-System
+Usage
+The model can be used directly (without a language model) as follows...
+Using the ASRecognition library:
+from asrecognition import ASREngine
+asr = ASREngine("fr", model_path="codeceejay/HIYACCENT_Wav2Vec2")
+audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
+transcriptions = asr.transcribe(audio_paths)
+Writing your own inference speech:
+import torch
+import librosa
+from datasets import load_dataset
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+LANG_ID = "en"
+MODEL_ID = "codeceejay/HIYACCENT_Wav2Vec2"
+SAMPLES = 10
+#You can use common_voice/timit or Nigerian Accented Speeches can also be found here: https://openslr.org/70/
+test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
+processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
+model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
+# Preprocessing the datasets.
+# We need to read the audio files as arrays
+def speech_file_to_array_fn(batch):
+    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
+    batch["speech"] = speech_array
+    batch["sentence"] = batch["sentence"].upper()
+    return batch
+test_dataset = test_dataset.map(speech_file_to_array_fn)
+inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
+with torch.no_grad():
+    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+predicted_ids = torch.argmax(logits, dim=-1)
+predicted_sentences = processor.batch_decode(predicted_ids)
+for i, predicted_sentence in enumerate(predicted_sentences):
+    print("-" * 100)
+    print("Reference:", test_dataset[i]["sentence"])
+    print("Prediction:", predicted_sentence)