benjaminogbonna commited on
Commit
ee3d7d1
·
verified ·
1 Parent(s): a2e7407

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +365 -137
README.md CHANGED
@@ -1,200 +1,428 @@
1
  ---
2
- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
- # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
- {}
 
 
 
 
5
  ---
 
 
6
 
7
- # Model Card for Model ID
8
 
9
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
 
 
 
 
10
 
11
- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
12
 
13
- ## Model Details
14
 
15
- ### Model Description
 
 
 
16
 
17
- <!-- Provide a longer summary of what this model is. -->
 
 
 
18
 
19
 
 
 
 
 
 
 
 
 
20
 
21
- - **Developed by:** [More Information Needed]
22
- - **Funded by [optional]:** [More Information Needed]
23
- - **Shared by [optional]:** [More Information Needed]
24
- - **Model type:** [More Information Needed]
25
- - **Language(s) (NLP):** [More Information Needed]
26
- - **License:** [More Information Needed]
27
- - **Finetuned from model [optional]:** [More Information Needed]
28
 
29
- ### Model Sources [optional]
30
 
31
- <!-- Provide the basic links for the model. -->
32
 
33
- - **Repository:** [More Information Needed]
34
- - **Paper [optional]:** [More Information Needed]
35
- - **Demo [optional]:** [More Information Needed]
36
 
37
- ## Uses
 
 
 
38
 
39
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
40
 
41
- ### Direct Use
42
 
43
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
- [More Information Needed]
 
 
 
 
 
 
46
 
47
- ### Downstream Use [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- [More Information Needed]
52
 
53
- ### Out-of-Scope Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
 
 
56
 
57
- [More Information Needed]
58
 
59
- ## Bias, Risks, and Limitations
60
-
61
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
62
-
63
- [More Information Needed]
64
-
65
- ### Recommendations
66
-
67
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
68
-
69
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
70
-
71
- ## How to Get Started with the Model
72
-
73
- Use the code below to get started with the model.
74
-
75
- [More Information Needed]
76
-
77
- ## Training Details
78
-
79
- ### Training Data
80
-
81
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
82
-
83
- [More Information Needed]
84
-
85
- ### Training Procedure
86
-
87
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
88
-
89
- #### Preprocessing [optional]
90
-
91
- [More Information Needed]
92
-
93
-
94
- #### Training Hyperparameters
95
 
96
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
97
 
98
- #### Speeds, Sizes, Times [optional]
 
99
 
100
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
101
 
102
- [More Information Needed]
 
103
 
104
- ## Evaluation
105
 
106
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
107
 
108
- ### Testing Data, Factors & Metrics
 
 
 
 
 
 
109
 
110
- #### Testing Data
111
 
112
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
 
113
 
114
- [More Information Needed]
 
115
 
116
- #### Factors
117
 
118
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
119
 
120
- [More Information Needed]
121
 
122
- #### Metrics
 
 
123
 
124
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
125
 
126
- [More Information Needed]
127
 
128
- ### Results
 
 
 
 
129
 
130
- [More Information Needed]
 
 
 
 
 
 
 
131
 
132
- #### Summary
133
 
 
134
 
 
 
 
 
 
 
 
 
 
135
 
136
- ## Model Examination [optional]
137
 
138
- <!-- Relevant interpretability work for the model goes here -->
139
 
140
- [More Information Needed]
141
 
142
- ## Environmental Impact
143
 
144
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
145
 
146
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
147
 
148
- - **Hardware Type:** [More Information Needed]
149
- - **Hours used:** [More Information Needed]
150
- - **Cloud Provider:** [More Information Needed]
151
- - **Compute Region:** [More Information Needed]
152
- - **Carbon Emitted:** [More Information Needed]
153
 
154
- ## Technical Specifications [optional]
155
 
156
- ### Model Architecture and Objective
157
 
158
- [More Information Needed]
159
 
160
- ### Compute Infrastructure
161
 
162
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
163
 
 
 
 
 
 
 
164
  #### Hardware
165
-
166
- [More Information Needed]
167
-
168
  #### Software
169
-
170
- [More Information Needed]
171
-
 
 
 
 
172
  ## Citation [optional]
173
-
174
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
175
-
176
- **BibTeX:**
177
-
178
- [More Information Needed]
179
-
180
- **APA:**
181
-
182
- [More Information Needed]
183
-
184
- ## Glossary [optional]
185
-
186
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
187
-
188
- [More Information Needed]
189
-
190
- ## More Information [optional]
191
-
192
- [More Information Needed]
193
-
194
- ## Model Card Authors [optional]
195
-
196
- [More Information Needed]
197
-
198
- ## Model Card Contact
199
-
200
- [More Information Needed]
 
1
  ---
2
+ # library_name:
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ base_model:
7
+ - StyleTTS2
8
+ pipeline_tag: text-to-speech
9
  ---
10
+ # Nigerian Accented Text to Speech Model
11
+ ![image/png](https://huggingface.co/)
12
 
13
+ ## Table of Contents
14
 
15
+ 1. [Model Summary](#model-summary)
16
+ 2. [Model Description](#model-description)
17
+ 3. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
18
+ - [Recommendations](#recommendations)
19
+ 4. [Speech Samples](#speech-samples)
20
+ 5. [Training](#training)
21
+ 6. [Future Improvements](#future-improvements)
22
+ 7. [Citation](#citation)
23
+ 8. [Credits & References](#credits--references)
24
 
25
+ ## Model Summary
26
 
27
+ This text-to-speech (TTS) model (v1) was designed to synthesize Nigerian-accented English, offering high-quality, natural and relevant speech synthesis for diverse applications like narration, voice cloning, etc.
28
 
29
+ <video controls width="600">
30
+ <source src="https://huggingface.co/saheedniyi/YarnGPT/resolve/main/audio/YearnGPT.mp4" type="video/mp4">
31
+ Your browser does not support the video tag.
32
+ </video>
33
 
34
+ #### How to use in Colab
35
+ The model can generate audio on its own but its better to use a voice to prompt the model, there are about 11 voices supported by default (6 males and 5 females ):
36
+ - ben
37
+ - oge
38
 
39
 
40
+ ```python
41
+ !sudo apt-get update -y
42
+ !apt-get install build-essential -y
43
+ !pip install torch tensorboard transformers accelerate SoundFile torchaudio librosa phonemizer
44
+ !pip install einops einops-exts tqdm typing typing-extensions munch pydub pyyaml nltk matplotlib
45
+ !pip install git+https://github.com/resemble-ai/monotonic_align.git
46
+ !pip install hf_transfer -qU
47
+ !sudo apt-get install -y espeak-ng
48
 
49
+ #________________________
50
+ import nltk
51
+ nltk.download('punkt')
52
+ nltk.download('punkt_tab')
 
 
 
53
 
 
54
 
 
55
 
56
+ model_folder = 'Models/'
 
 
57
 
58
+ # I do this to always pick the last trained epoch
59
+ files = [f for f in os.listdir(model_folder) if f.endswith('.pth')]
60
+ sorted_files = sorted(files, key=lambda x: int(x.split('_')[-1].split('.')[0]))
61
+ print(sorted_files[-1])
62
 
 
63
 
 
64
 
65
+ #________________________
66
+ import torch
67
+ torch.manual_seed(0)
68
+ torch.backends.cudnn.benchmark = False
69
+ torch.backends.cudnn.deterministic = True
70
+
71
+ import random
72
+ random.seed(0)
73
+
74
+ import numpy as np
75
+ np.random.seed(0)
76
+
77
+ # load packages
78
+ import time
79
+ import random
80
+ import yaml
81
+ from munch import Munch
82
+ import numpy as np
83
+ import torch
84
+ from torch import nn
85
+ import torch.nn.functional as F
86
+ import torchaudio
87
+ import librosa
88
+ from nltk.tokenize import word_tokenize
89
+
90
+ from models import *
91
+ from utils import *
92
+ from text_utils import TextCleaner
93
+ textclenaer = TextCleaner()
94
+
95
+ %matplotlib inline
96
+
97
+
98
+ #________________________
99
+ to_mel = torchaudio.transforms.MelSpectrogram(
100
+ n_mels=80, n_fft=2048, win_length=1200, hop_length=300)
101
+ mean, std = -4, 4
102
+
103
+ def length_to_mask(lengths):
104
+ mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
105
+ mask = torch.gt(mask+1, lengths.unsqueeze(1))
106
+ return mask
107
+
108
+ def preprocess(wave):
109
+ wave_tensor = torch.from_numpy(wave).float()
110
+ mel_tensor = to_mel(wave_tensor)
111
+ mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
112
+ return mel_tensor
113
+
114
+ def compute_style(path):
115
+ wave, sr = librosa.load(path, sr=24000)
116
+ audio, index = librosa.effects.trim(wave, top_db=30)
117
+ if sr != 24000:
118
+ audio = librosa.resample(audio, sr, 24000)
119
+ mel_tensor = preprocess(audio).to(device)
120
+
121
+ with torch.no_grad():
122
+ ref_s = model.style_encoder(mel_tensor.unsqueeze(1))
123
+ ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1))
124
 
125
+ return torch.cat([ref_s, ref_p], dim=1)
126
+
127
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
128
+
129
+ # load phonemizer
130
+ import phonemizer
131
+ global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True)
132
 
133
+ config = yaml.safe_load(open(f"{model_folder}config.yml"))
134
+
135
+ # load pretrained ASR model
136
+ ASR_config = config.get('ASR_config', False)
137
+ ASR_path = config.get('ASR_path', False)
138
+ text_aligner = load_ASR_models(ASR_path, ASR_config)
139
+
140
+ # load pretrained F0 model
141
+ F0_path = config.get('F0_path', False)
142
+ pitch_extractor = load_F0_models(F0_path)
143
+
144
+ # load BERT model
145
+ from Utils.PLBERT.util import load_plbert
146
+ BERT_path = config.get('PLBERT_dir', False)
147
+ plbert = load_plbert(BERT_path)
148
 
149
+ model_params = recursive_munch(config['model_params'])
150
+ model = build_model(model_params, text_aligner, pitch_extractor, plbert)
151
+ _ = [model[key].eval() for key in model]
152
+ _ = [model[key].to(device) for key in model]
153
+
154
+
155
+ #________________________
156
+ params_whole = torch.load(f"{model_folder}" + sorted_files[-1], map_location='cpu')
157
+ params = params_whole['net']
158
+
159
+
160
+ #________________________
161
+ for key in model:
162
+ if key in params:
163
+ print('%s loaded' % key)
164
+ try:
165
+ model[key].load_state_dict(params[key])
166
+ except:
167
+ from collections import OrderedDict
168
+ state_dict = params[key]
169
+ new_state_dict = OrderedDict()
170
+ for k, v in state_dict.items():
171
+ name = k[7:] # remove `module.`
172
+ new_state_dict[name] = v
173
+ # load params
174
+ model[key].load_state_dict(new_state_dict, strict=False)
175
+ # except:
176
+ # _load(params[key], model[key])
177
+ _ = [model[key].eval() for key in model]
178
+
179
+
180
+ #________________________
181
+ from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule
182
+
183
+ sampler = DiffusionSampler(
184
+ model.diffusion.diffusion,
185
+ sampler=ADPM2Sampler(),
186
+ sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters
187
+ clamp=False
188
+ )
189
 
 
190
 
191
+ #________________________
192
+ def inference(text, ref_s, alpha = 0.3, beta = 0.7, diffusion_steps=5, embedding_scale=1):
193
+ text = text.strip()
194
+ ps = global_phonemizer.phonemize([text])
195
+ ps = word_tokenize(ps[0])
196
+ ps = ' '.join(ps)
197
+ tokens = textclenaer(ps)
198
+ tokens.insert(0, 0)
199
+ tokens = torch.LongTensor(tokens).to(device).unsqueeze(0)
200
+
201
+ with torch.no_grad():
202
+ input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
203
+ text_mask = length_to_mask(input_lengths).to(device)
204
+
205
+ t_en = model.text_encoder(tokens, input_lengths, text_mask)
206
+ bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
207
+ d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
208
 
209
+ s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device),
210
+ embedding=bert_dur,
211
+ embedding_scale=embedding_scale,
212
+ features=ref_s, # reference from the same speaker as the embedding
213
+ num_steps=diffusion_steps).squeeze(1)
214
 
 
215
 
216
+ s = s_pred[:, 128:]
217
+ ref = s_pred[:, :128]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
 
219
+ ref = alpha * ref + (1 - alpha) * ref_s[:, :128]
220
+ s = beta * s + (1 - beta) * ref_s[:, 128:]
221
 
222
+ d = model.predictor.text_encoder(d_en,
223
+ s, input_lengths, text_mask)
224
 
225
+ x, _ = model.predictor.lstm(d)
226
+ duration = model.predictor.duration_proj(x)
227
 
228
+ duration = torch.sigmoid(duration).sum(axis=-1)
229
+ pred_dur = torch.round(duration.squeeze()).clamp(min=1)
230
 
 
231
 
232
+ pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data))
233
+ c_frame = 0
234
+ for i in range(pred_aln_trg.size(0)):
235
+ pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1
236
+ c_frame += int(pred_dur[i].data)
237
 
238
+ # encode prosody
239
+ en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device))
240
+ if model_params.decoder.type == "hifigan":
241
+ asr_new = torch.zeros_like(en)
242
+ asr_new[:, :, 0] = en[:, :, 0]
243
+ asr_new[:, :, 1:] = en[:, :, 0:-1]
244
+ en = asr_new
245
 
246
+ F0_pred, N_pred = model.predictor.F0Ntrain(en, s)
247
 
248
+ asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device))
249
+ if model_params.decoder.type == "hifigan":
250
+ asr_new = torch.zeros_like(asr)
251
+ asr_new[:, :, 0] = asr[:, :, 0]
252
+ asr_new[:, :, 1:] = asr[:, :, 0:-1]
253
+ asr = asr_new
254
 
255
+ out = model.decoder(asr,
256
+ F0_pred, N_pred, ref.squeeze().unsqueeze(0))
257
 
 
258
 
259
+ return out.squeeze().cpu().numpy()[..., :-50] # weird pulse at the end of the model, need to be fixed later
260
 
 
261
 
262
+ #________________________
263
+ # Synthesize speech
264
+ text = "We are happy to invite you to join us on a journey to the future."
265
 
266
+ #________________________
267
+ reference_dicts = {}
268
+ reference_dicts['oge'] = "ref_audios/things_fall_apart_1.wav" # or use your own audio samples
269
+ reference_dicts['ben'] = "ref_audios/feels_good_to_be_odd_1.wav"
270
 
 
271
 
272
+ #________________________
273
+ start = time.time()
274
+ noise = torch.randn(1,1,256).to(device)
275
+ for k, path in reference_dicts.items():
276
+ ref_s = compute_style(path)
277
 
278
+ wav = inference(text, ref_s, alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2)
279
+ rtf = (time.time() - start) / (len(wav) / 24000)
280
+ print(f"RTF = {rtf:5f}")
281
+ import IPython.display as ipd
282
+ print(k + ' Synthesized:')
283
+ display(ipd.Audio(wav, rate=24000, normalize=False))
284
+ print('Reference:')
285
+ display(ipd.Audio(path, rate=24000, normalize=False))
286
 
287
+ ```
288
 
289
+ ## Model Description
290
 
291
+ - **Developed by:** [Saheedniyi](https://linkedin.com/in/azeez-saheed)
292
+ - **Model type:** Text-to-Speech
293
+ - **Language(s) (NLP):** English--> Nigerian Accented English
294
+ - **Finetuned from:** [HuggingFaceTB/SmolLM2-360M](https://huggingface.co/HuggingFaceTB/SmolLM2-360M)
295
+ - **Repository:** [YarnGPT Github Repository](https://github.com/saheedniyi02/yarngpt)
296
+ - **Paper:** IN PROGRESS.
297
+ - **Demo:** 1) [Prompt YarnGPT notebook](https://colab.research.google.com/drive/11zMUrfBiLa1gEflAKp8lliSOTNQ-X_nU?usp=sharing)
298
+ 2) [Simple news reader](https://colab.research.google.com/drive/1SsXV08kly1TUJVM_NFpKqQWOZ1gUZpGe?usp=sharing)
299
+
300
 
301
+ #### Uses
302
 
303
+ Generate Nigerian-accented English speech for experimental purposes.
304
 
 
305
 
306
+ #### Out-of-Scope Use
307
 
308
+ The model is not suitable for generating speech in languages other than English or other accents.
309
 
 
310
 
311
+ ## Bias, Risks, and Limitations
 
 
 
 
312
 
313
+ The model may not capture the full diversity of Nigerian accents and could exhibit biases based on the training dataset. Also a lot of the text the model was trained on were automatically generated which could impact performance.
314
 
 
315
 
316
+ #### Recommendations
317
 
318
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
319
 
320
+ Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. Feedback and diverse training data contributions are encouraged.
321
+ ## Speech Samples
322
+
323
+ <div style="margin-top: 20px;">
324
+ <table style="width: 100%; border-collapse: collapse;">
325
+ <thead>
326
+ <tr>
327
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left; width: 40%;">Input</th>
328
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left; width: 40%;">Audio</th>
329
+ <th style="border: 1px solid #ddd; padding: 8px; text-align: left; width: 10%;">Notes</th>
330
+ </tr>
331
+ </thead>
332
+ <tbody>
333
+ <tr>
334
+ <td style="border: 1px solid #ddd; padding: 8px;">Hello world! I am Saheed Azeez and I am excited to announce the release of his project, I have been gathering data and learning how to build Audio-based models over the last two months, but thanks to God, I have been able to come up with something</td>
335
+ <td style="border: 1px solid #ddd; padding: 8px;">
336
+ <audio controls style="width: 100%;">
337
+ <source src="https://huggingface.co/saheedniyi/YarnGPT/resolve/main/audio/Sample_1.wav" type="audio/wav">
338
+ Your browser does not support the audio element.
339
+ </audio>
340
+ </td>
341
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1), voice: idera</td>
342
+ </tr>
343
+ <tr>
344
+ <td style="border: 1px solid #ddd; padding: 8px;"> Wizkid, Davido, Burna Boy perform at same event in Lagos. This event has sparked many reactions across social media, with fans and critics alike praising the artistes' performances and the rare opportunity to see the three music giants on the same stage.</td>
345
+ <td style="border: 1px solid #ddd; padding: 8px;">
346
+ <audio controls style="width: 100%;">
347
+ <source src="https://huggingface.co/saheedniyi/YarnGPT/resolve/main/audio/Sample_2.wav" type="audio/wav">
348
+ Your browser does not support the audio element.
349
+ </audio>
350
+ </td>
351
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1), voice: jude</td>
352
+ </tr>
353
+ <tr>
354
+ <td style="border: 1px solid #ddd; padding: 8px;">Since Nigeria became a republic in 1963, 14 individuals have served as head of state of Nigeria under different titles. The incumbent president Bola Tinubu is the nation's 16th head of state.</td>
355
+ <td style="border: 1px solid #ddd; padding: 8px;">
356
+ <audio controls style="width: 100%;">
357
+ <source src="https://huggingface.co/saheedniyi/YarnGPT/resolve/main/audio/Sample_3.wav" type="audio/wav">
358
+ Your browser does not support the audio element.
359
+ </audio>
360
+ </td>
361
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1), voice: zainab, the model struggled in pronouncing ` in 1963`</td>
362
+ </tr>
363
+ <tr>
364
+ <td style="border: 1px solid #ddd; padding: 8px;">I visited the President, who has shown great concern for the security of Plateau State, especially considering that just a year ago, our state was in mourning. The President’s commitment to addressing these challenges has been steadfast.</td>
365
+ <td style="border: 1px solid #ddd; padding: 8px;">
366
+ <audio controls style="width: 100%;">
367
+ <source src="https://huggingface.co/saheedniyi/YarnGPT/resolve/main/audio/Sample_4.wav" type="audio/wav">
368
+ Your browser does not support the audio element.
369
+ </audio>
370
+ </td>
371
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1), voice: emma</td>
372
+ </tr>
373
+ <tr>
374
+ <td style="border: 1px solid #ddd; padding: 8px;">Scientists have discovered a new planet that may be capable of supporting life!</td>
375
+ <td style="border: 1px solid #ddd; padding: 8px;">
376
+ <audio controls style="width: 100%;">
377
+ <source src="https://huggingface.co/saheedniyi/YarnGPT/resolve/main/audio/Sample_5.wav" type="audio/wav">
378
+ Your browser does not support the audio element.
379
+ </audio>
380
+ </td>
381
+ <td style="border: 1px solid #ddd; padding: 8px;">(temperature=0.1, repetition_penalty=1.1)</td>
382
+ </tr>
383
+ </tbody>
384
+ </table>
385
+ </div>
386
+
387
+ ## Training
388
+
389
+ #### Data
390
+ Trained on a dataset of publicly available Nigerian movies, podcasts ( using the subtitle-audio pairs) and open source Nigerian-related audio data on Huggingface,
391
+
392
+ #### Preprocessing
393
+
394
+ Audio files were preprocessed and resampled to 24Khz and tokenized using [wavtokenizer](https://huggingface.co/novateur/WavTokenizer).
395
 
396
+ #### Training Hyperparameters
397
+ - **Number of epochs:** 5
398
+ - **batch_size:** 2
399
+ - **Scheduler:** linear schedule with warmup for 4 epochs, then linear decay to zero for the last epoch
400
+ - **Optimizer:** AdamW (betas=(0.9, 0.95),weight_decay=0.01)
401
+ - **Learning rate:** 1*10^-3
402
  #### Hardware
403
+ - **GPUs:** 1 A100 (google colab: 50 hours)
 
 
404
  #### Software
405
+ - **Training Framework:** Pytorch
406
+ ## Future Improvements?
407
+ - Scaling up model size and human-annotaed/ reviewed training data
408
+ - Wrap the model around an API endpoint
409
+ - Add support for local Nigerian languages
410
+ - Voice cloning.
411
+ - Potential expansion into speech-to-speech assistant models
412
  ## Citation [optional]
413
+ #### BibTeX:
414
+ ```python
415
+ @misc{yarngpt2025,
416
+ author = {Saheed Azeez},
417
+ title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model},
418
+ year = {2025},
419
+ publisher = {Hugging Face},
420
+ url = {https://huggingface.co/SaheedAzeez/yarngpt}
421
+ }
422
+ ```
423
+ #### APA:
424
+ ```python
425
+ Saheed Azeez. (2025). YarnGPT: Nigerian-Accented English Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT
426
+ ```
427
+ ## Credits & References
428
+ - [OuteAI/OuteTTS-0.2-500M](https://huggingface.co/OuteAI/OuteTTS-0.2-500M/)