Update README.md
Browse files
README.md
CHANGED
|
@@ -1,172 +1,173 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
base_model:
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
-
|
| 12 |
-
|
| 13 |
-
--
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
from
|
| 39 |
-
from
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
inputs =
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
- **
|
| 94 |
-
- **
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
- **
|
| 99 |
-
- **
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
|
| 115 |
-
|
|
| 116 |
-
|
|
| 117 |
-
|
|
| 118 |
-
|
|
| 119 |
-
|
|
| 120 |
-
|
|
| 121 |
-
|
|
| 122 |
-
|
|
| 123 |
-
|
|
| 124 |
-
|
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
|
| 130 |
-
|
|
| 131 |
-
| 0.
|
| 132 |
-
| 0.
|
| 133 |
-
| 0.
|
| 134 |
-
| 0.
|
| 135 |
-
| 0.
|
| 136 |
-
| 0.
|
| 137 |
-
| 0.
|
| 138 |
-
| 0.
|
| 139 |
-
| 0.
|
| 140 |
-
| 0.
|
| 141 |
-
| 0.
|
| 142 |
-
| 0.
|
| 143 |
-
| 0.
|
| 144 |
-
| 0.
|
| 145 |
-
| 0.
|
| 146 |
-
| 0.
|
| 147 |
-
| 0.
|
| 148 |
-
| 0.
|
| 149 |
-
| 0.
|
| 150 |
-
| 0.
|
| 151 |
-
| 0.
|
| 152 |
-
| 0.
|
| 153 |
-
| 0.
|
| 154 |
-
| 0.
|
| 155 |
-
| 0.
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
- **
|
| 163 |
-
- **
|
| 164 |
-
- **
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
|
|
|
| 172 |
This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
base_model:
|
| 4 |
+
- AYI-TEKK/tts-v2
|
| 5 |
+
tags:
|
| 6 |
+
- generated_from_trainer
|
| 7 |
+
model-index:
|
| 8 |
+
- name: speecht5_tts-wolof-v0.2
|
| 9 |
+
results: []
|
| 10 |
+
language:
|
| 11 |
+
- wo
|
| 12 |
+
- fr
|
| 13 |
+
pipeline_tag: text-to-speech
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# **speecht5_tts-wolof-v0.2**
|
| 17 |
+
|
| 18 |
+
This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
|
| 19 |
+
|
| 20 |
+
## **Model Description**
|
| 21 |
+
|
| 22 |
+
This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## **Installation Instructions for Users**
|
| 27 |
+
|
| 28 |
+
To install the necessary dependencies, run the following command:
|
| 29 |
+
|
| 30 |
+
```bash
|
| 31 |
+
pip install transformers datasets torch
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## **Model Loading and Speech Generation Code**
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
import torch
|
| 38 |
+
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
|
| 39 |
+
from datasets import load_dataset
|
| 40 |
+
from IPython.display import Audio, display
|
| 41 |
+
|
| 42 |
+
def load_speech_model(checkpoint="AYI-TEKK/tts-v2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
|
| 43 |
+
""" Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
|
| 44 |
+
|
| 45 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 46 |
+
|
| 47 |
+
processor = SpeechT5Processor.from_pretrained(checkpoint)
|
| 48 |
+
model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
|
| 49 |
+
vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
|
| 50 |
+
|
| 51 |
+
return processor, model, vocoder, device
|
| 52 |
+
|
| 53 |
+
# Load the model
|
| 54 |
+
processor, model, vocoder, device = load_speech_model()
|
| 55 |
+
|
| 56 |
+
# Load speaker embeddings (pretrained from CMU Arctic dataset)
|
| 57 |
+
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
| 58 |
+
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
| 59 |
+
|
| 60 |
+
def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
|
| 61 |
+
""" Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
|
| 62 |
+
|
| 63 |
+
inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
|
| 64 |
+
inputs = {key: value.to(model.device) for key, value in inputs.items()}
|
| 65 |
+
|
| 66 |
+
speech = model.generate(
|
| 67 |
+
inputs["input_ids"],
|
| 68 |
+
speaker_embeddings=speaker_embedding.to(model.device),
|
| 69 |
+
vocoder=vocoder,
|
| 70 |
+
num_beams=7,
|
| 71 |
+
temperature=0.6,
|
| 72 |
+
no_repeat_ngram_size=3,
|
| 73 |
+
repetition_penalty=1.5,
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
speech = speech.detach().cpu().numpy()
|
| 77 |
+
display(Audio(speech, rate=16000))
|
| 78 |
+
|
| 79 |
+
# Example usage French
|
| 80 |
+
text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
|
| 81 |
+
generate_speech_from_text(text)
|
| 82 |
+
|
| 83 |
+
# Example usage Wolof
|
| 84 |
+
text = "ñu ne ñoom ñooy nattukaay satélite yi"
|
| 85 |
+
generate_speech_from_text(text)
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## **Intended Uses & Limitations**
|
| 91 |
+
|
| 92 |
+
### **Intended Uses**
|
| 93 |
+
- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
|
| 94 |
+
- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
|
| 95 |
+
- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
|
| 96 |
+
|
| 97 |
+
### **Limitations**
|
| 98 |
+
- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
|
| 99 |
+
- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
|
| 100 |
+
- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## **Training and Evaluation Data**
|
| 105 |
+
|
| 106 |
+
The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## **Training Procedure**
|
| 111 |
+
|
| 112 |
+
### **Training Hyperparameters**
|
| 113 |
+
|
| 114 |
+
| Hyperparameter | Value |
|
| 115 |
+
|----------------------------|---------|
|
| 116 |
+
| Learning Rate | 1e-05 |
|
| 117 |
+
| Training Batch Size | 8 |
|
| 118 |
+
| Evaluation Batch Size | 2 |
|
| 119 |
+
| Gradient Accumulation Steps| 8 |
|
| 120 |
+
| Total Train Batch Size | 64 |
|
| 121 |
+
| Optimizer | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
|
| 122 |
+
| Learning Rate Scheduler | Linear |
|
| 123 |
+
| Warmup Steps | 500 |
|
| 124 |
+
| Training Steps | 25,500 |
|
| 125 |
+
| Mixed Precision Training | AMP (Automatic Mixed Precision) |
|
| 126 |
+
|
| 127 |
+
### **Training Results**
|
| 128 |
+
|
| 129 |
+
| Training Loss | Epoch | Step | Validation Loss |
|
| 130 |
+
|:-------------:|:-------:|:-----:|:---------------:|
|
| 131 |
+
| 0.5372 | 0.9995 | 954 | 0.4398 |
|
| 132 |
+
| 0.4646 | 2.0 | 1909 | 0.4214 |
|
| 133 |
+
| 0.4505 | 2.9995 | 2863 | 0.4163 |
|
| 134 |
+
| 0.4443 | 4.0 | 3818 | 0.4109 |
|
| 135 |
+
| 0.4403 | 4.9995 | 4772 | 0.4080 |
|
| 136 |
+
| 0.4368 | 6.0 | 5727 | 0.4057 |
|
| 137 |
+
| 0.4343 | 6.9995 | 6681 | 0.4034 |
|
| 138 |
+
| 0.4315 | 8.0 | 7636 | 0.4018 |
|
| 139 |
+
| 0.4311 | 8.9995 | 8590 | 0.4015 |
|
| 140 |
+
| 0.4273 | 10.0 | 9545 | 0.4017 |
|
| 141 |
+
| 0.4282 | 10.9995 | 10499 | 0.3990 |
|
| 142 |
+
| 0.4249 | 12.0 | 11454 | 0.3986 |
|
| 143 |
+
| 0.4242 | 12.9995 | 12408 | 0.3973 |
|
| 144 |
+
| 0.4225 | 14.0 | 13363 | 0.3966 |
|
| 145 |
+
| 0.4217 | 14.9995 | 14317 | 0.3951 |
|
| 146 |
+
| 0.4208 | 16.0 | 15272 | 0.3950 |
|
| 147 |
+
| 0.4200 | 16.9995 | 16226 | 0.3950 |
|
| 148 |
+
| 0.4202 | 18.0 | 17181 | 0.3952 |
|
| 149 |
+
| 0.4200 | 18.9995 | 18135 | 0.3943 |
|
| 150 |
+
| 0.4183 | 20.0 | 19090 | 0.3962 |
|
| 151 |
+
| 0.4175 | 20.9995 | 20044 | 0.3937 |
|
| 152 |
+
| 0.4161 | 22.0 | 20999 | 0.3940 |
|
| 153 |
+
| 0.4193 | 22.9995 | 21953 | 0.3932 |
|
| 154 |
+
| 0.4177 | 24.0 | 22908 | 0.3939 |
|
| 155 |
+
| 0.4166 | 24.9995 | 23862 | 0.3936 |
|
| 156 |
+
| 0.4156 | 26.0 | 24817 | 0.3938 |
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
## **Framework Versions**
|
| 161 |
+
|
| 162 |
+
- **Transformers**: 4.41.2
|
| 163 |
+
- **PyTorch**: 2.4.0+cu121
|
| 164 |
+
- **Datasets**: 3.2.0
|
| 165 |
+
- **Tokenizers**: 0.19.1
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## **Author**
|
| 170 |
+
|
| 171 |
+
- **Bilal FAYE**
|
| 172 |
+
|
| 173 |
This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀
|