Pis-py commited on
Commit
9de369a
·
verified ·
1 Parent(s): 4a1d896

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -171
README.md CHANGED
@@ -1,172 +1,173 @@
1
- ---
2
- license: mit
3
- base_model: bilalfaye/speecht5_tts-wolof
4
- tags:
5
- - generated_from_trainer
6
- model-index:
7
- - name: speecht5_tts-wolof-v0.2
8
- results: []
9
- language:
10
- - wo
11
- - fr
12
- pipeline_tag: text-to-speech
13
- ---
14
-
15
- # **speecht5_tts-wolof-v0.2**
16
-
17
- This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
18
-
19
- ## **Model Description**
20
-
21
- This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
22
-
23
- ---
24
-
25
- ## **Installation Instructions for Users**
26
-
27
- To install the necessary dependencies, run the following command:
28
-
29
- ```bash
30
- pip install transformers datasets torch
31
- ```
32
-
33
- ## **Model Loading and Speech Generation Code**
34
-
35
- ```python
36
- import torch
37
- from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
38
- from datasets import load_dataset
39
- from IPython.display import Audio, display
40
-
41
- def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
42
- """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
43
-
44
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
45
-
46
- processor = SpeechT5Processor.from_pretrained(checkpoint)
47
- model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
48
- vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
49
-
50
- return processor, model, vocoder, device
51
-
52
- # Load the model
53
- processor, model, vocoder, device = load_speech_model()
54
-
55
- # Load speaker embeddings (pretrained from CMU Arctic dataset)
56
- embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
57
- speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
58
-
59
- def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
60
- """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
61
-
62
- inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
63
- inputs = {key: value.to(model.device) for key, value in inputs.items()}
64
-
65
- speech = model.generate(
66
- inputs["input_ids"],
67
- speaker_embeddings=speaker_embedding.to(model.device),
68
- vocoder=vocoder,
69
- num_beams=7,
70
- temperature=0.6,
71
- no_repeat_ngram_size=3,
72
- repetition_penalty=1.5,
73
- )
74
-
75
- speech = speech.detach().cpu().numpy()
76
- display(Audio(speech, rate=16000))
77
-
78
- # Example usage French
79
- text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
80
- generate_speech_from_text(text)
81
-
82
- # Example usage Wolof
83
- text = "ñu ne ñoom ñooy nattukaay satélite yi"
84
- generate_speech_from_text(text)
85
- ```
86
-
87
- ---
88
-
89
- ## **Intended Uses & Limitations**
90
-
91
- ### **Intended Uses**
92
- - **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
93
- - **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
94
- - **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
95
-
96
- ### **Limitations**
97
- - **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
98
- - **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
99
- - **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
100
-
101
- ---
102
-
103
- ## **Training and Evaluation Data**
104
-
105
- The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
106
-
107
- ---
108
-
109
- ## **Training Procedure**
110
-
111
- ### **Training Hyperparameters**
112
-
113
- | Hyperparameter | Value |
114
- |----------------------------|---------|
115
- | Learning Rate | 1e-05 |
116
- | Training Batch Size | 8 |
117
- | Evaluation Batch Size | 2 |
118
- | Gradient Accumulation Steps| 8 |
119
- | Total Train Batch Size | 64 |
120
- | Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
121
- | Learning Rate Scheduler | Linear |
122
- | Warmup Steps | 500 |
123
- | Training Steps | 25,500 |
124
- | Mixed Precision Training | AMP (Automatic Mixed Precision) |
125
-
126
- ### **Training Results**
127
-
128
- | Training Loss | Epoch | Step | Validation Loss |
129
- |:-------------:|:-------:|:-----:|:---------------:|
130
- | 0.5372 | 0.9995 | 954 | 0.4398 |
131
- | 0.4646 | 2.0 | 1909 | 0.4214 |
132
- | 0.4505 | 2.9995 | 2863 | 0.4163 |
133
- | 0.4443 | 4.0 | 3818 | 0.4109 |
134
- | 0.4403 | 4.9995 | 4772 | 0.4080 |
135
- | 0.4368 | 6.0 | 5727 | 0.4057 |
136
- | 0.4343 | 6.9995 | 6681 | 0.4034 |
137
- | 0.4315 | 8.0 | 7636 | 0.4018 |
138
- | 0.4311 | 8.9995 | 8590 | 0.4015 |
139
- | 0.4273 | 10.0 | 9545 | 0.4017 |
140
- | 0.4282 | 10.9995 | 10499 | 0.3990 |
141
- | 0.4249 | 12.0 | 11454 | 0.3986 |
142
- | 0.4242 | 12.9995 | 12408 | 0.3973 |
143
- | 0.4225 | 14.0 | 13363 | 0.3966 |
144
- | 0.4217 | 14.9995 | 14317 | 0.3951 |
145
- | 0.4208 | 16.0 | 15272 | 0.3950 |
146
- | 0.4200 | 16.9995 | 16226 | 0.3950 |
147
- | 0.4202 | 18.0 | 17181 | 0.3952 |
148
- | 0.4200 | 18.9995 | 18135 | 0.3943 |
149
- | 0.4183 | 20.0 | 19090 | 0.3962 |
150
- | 0.4175 | 20.9995 | 20044 | 0.3937 |
151
- | 0.4161 | 22.0 | 20999 | 0.3940 |
152
- | 0.4193 | 22.9995 | 21953 | 0.3932 |
153
- | 0.4177 | 24.0 | 22908 | 0.3939 |
154
- | 0.4166 | 24.9995 | 23862 | 0.3936 |
155
- | 0.4156 | 26.0 | 24817 | 0.3938 |
156
-
157
- ---
158
-
159
- ## **Framework Versions**
160
-
161
- - **Transformers**: 4.41.2
162
- - **PyTorch**: 2.4.0+cu121
163
- - **Datasets**: 3.2.0
164
- - **Tokenizers**: 0.19.1
165
-
166
- ---
167
-
168
- ## **Author**
169
-
170
- - **Bilal FAYE**
171
-
 
172
  This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - AYI-TEKK/tts-v2
5
+ tags:
6
+ - generated_from_trainer
7
+ model-index:
8
+ - name: speecht5_tts-wolof-v0.2
9
+ results: []
10
+ language:
11
+ - wo
12
+ - fr
13
+ pipeline_tag: text-to-speech
14
+ ---
15
+
16
+ # **speecht5_tts-wolof-v0.2**
17
+
18
+ This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
19
+
20
+ ## **Model Description**
21
+
22
+ This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
23
+
24
+ ---
25
+
26
+ ## **Installation Instructions for Users**
27
+
28
+ To install the necessary dependencies, run the following command:
29
+
30
+ ```bash
31
+ pip install transformers datasets torch
32
+ ```
33
+
34
+ ## **Model Loading and Speech Generation Code**
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
39
+ from datasets import load_dataset
40
+ from IPython.display import Audio, display
41
+
42
+ def load_speech_model(checkpoint="AYI-TEKK/tts-v2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
43
+ """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
44
+
45
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
46
+
47
+ processor = SpeechT5Processor.from_pretrained(checkpoint)
48
+ model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
49
+ vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
50
+
51
+ return processor, model, vocoder, device
52
+
53
+ # Load the model
54
+ processor, model, vocoder, device = load_speech_model()
55
+
56
+ # Load speaker embeddings (pretrained from CMU Arctic dataset)
57
+ embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
58
+ speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
59
+
60
+ def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
61
+ """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
62
+
63
+ inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
64
+ inputs = {key: value.to(model.device) for key, value in inputs.items()}
65
+
66
+ speech = model.generate(
67
+ inputs["input_ids"],
68
+ speaker_embeddings=speaker_embedding.to(model.device),
69
+ vocoder=vocoder,
70
+ num_beams=7,
71
+ temperature=0.6,
72
+ no_repeat_ngram_size=3,
73
+ repetition_penalty=1.5,
74
+ )
75
+
76
+ speech = speech.detach().cpu().numpy()
77
+ display(Audio(speech, rate=16000))
78
+
79
+ # Example usage French
80
+ text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
81
+ generate_speech_from_text(text)
82
+
83
+ # Example usage Wolof
84
+ text = "ñu ne ñoom ñooy nattukaay satélite yi"
85
+ generate_speech_from_text(text)
86
+ ```
87
+
88
+ ---
89
+
90
+ ## **Intended Uses & Limitations**
91
+
92
+ ### **Intended Uses**
93
+ - **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
94
+ - **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
95
+ - **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
96
+
97
+ ### **Limitations**
98
+ - **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
99
+ - **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
100
+ - **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
101
+
102
+ ---
103
+
104
+ ## **Training and Evaluation Data**
105
+
106
+ The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
107
+
108
+ ---
109
+
110
+ ## **Training Procedure**
111
+
112
+ ### **Training Hyperparameters**
113
+
114
+ | Hyperparameter | Value |
115
+ |----------------------------|---------|
116
+ | Learning Rate | 1e-05 |
117
+ | Training Batch Size | 8 |
118
+ | Evaluation Batch Size | 2 |
119
+ | Gradient Accumulation Steps| 8 |
120
+ | Total Train Batch Size | 64 |
121
+ | Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
122
+ | Learning Rate Scheduler | Linear |
123
+ | Warmup Steps | 500 |
124
+ | Training Steps | 25,500 |
125
+ | Mixed Precision Training | AMP (Automatic Mixed Precision) |
126
+
127
+ ### **Training Results**
128
+
129
+ | Training Loss | Epoch | Step | Validation Loss |
130
+ |:-------------:|:-------:|:-----:|:---------------:|
131
+ | 0.5372 | 0.9995 | 954 | 0.4398 |
132
+ | 0.4646 | 2.0 | 1909 | 0.4214 |
133
+ | 0.4505 | 2.9995 | 2863 | 0.4163 |
134
+ | 0.4443 | 4.0 | 3818 | 0.4109 |
135
+ | 0.4403 | 4.9995 | 4772 | 0.4080 |
136
+ | 0.4368 | 6.0 | 5727 | 0.4057 |
137
+ | 0.4343 | 6.9995 | 6681 | 0.4034 |
138
+ | 0.4315 | 8.0 | 7636 | 0.4018 |
139
+ | 0.4311 | 8.9995 | 8590 | 0.4015 |
140
+ | 0.4273 | 10.0 | 9545 | 0.4017 |
141
+ | 0.4282 | 10.9995 | 10499 | 0.3990 |
142
+ | 0.4249 | 12.0 | 11454 | 0.3986 |
143
+ | 0.4242 | 12.9995 | 12408 | 0.3973 |
144
+ | 0.4225 | 14.0 | 13363 | 0.3966 |
145
+ | 0.4217 | 14.9995 | 14317 | 0.3951 |
146
+ | 0.4208 | 16.0 | 15272 | 0.3950 |
147
+ | 0.4200 | 16.9995 | 16226 | 0.3950 |
148
+ | 0.4202 | 18.0 | 17181 | 0.3952 |
149
+ | 0.4200 | 18.9995 | 18135 | 0.3943 |
150
+ | 0.4183 | 20.0 | 19090 | 0.3962 |
151
+ | 0.4175 | 20.9995 | 20044 | 0.3937 |
152
+ | 0.4161 | 22.0 | 20999 | 0.3940 |
153
+ | 0.4193 | 22.9995 | 21953 | 0.3932 |
154
+ | 0.4177 | 24.0 | 22908 | 0.3939 |
155
+ | 0.4166 | 24.9995 | 23862 | 0.3936 |
156
+ | 0.4156 | 26.0 | 24817 | 0.3938 |
157
+
158
+ ---
159
+
160
+ ## **Framework Versions**
161
+
162
+ - **Transformers**: 4.41.2
163
+ - **PyTorch**: 2.4.0+cu121
164
+ - **Datasets**: 3.2.0
165
+ - **Tokenizers**: 0.19.1
166
+
167
+ ---
168
+
169
+ ## **Author**
170
+
171
+ - **Bilal FAYE**
172
+
173
  This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀