AYI-TEKK
/

tts-v2

@@ -1,172 +1,173 @@
----
-license: mit
-base_model: bilalfaye/speecht5_tts-wolof
-tags:
-- generated_from_trainer
-model-index:
-- name: speecht5_tts-wolof-v0.2
-  results: []
-language:
-- wo
-- fr
-pipeline_tag: text-to-speech
----
-# **speecht5_tts-wolof-v0.2**
-This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
-## **Model Description**
-This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
----
-## **Installation Instructions for Users**
-To install the necessary dependencies, run the following command:
-```bash
-pip install transformers datasets torch
-```
-## **Model Loading and Speech Generation Code**
-```python
-import torch
-from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
-from datasets import load_dataset
-from IPython.display import Audio, display
-def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof-v0.2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
-    """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    processor = SpeechT5Processor.from_pretrained(checkpoint)
-    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
-    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
-    return processor, model, vocoder, device
-# Load the model
-processor, model, vocoder, device = load_speech_model()
-# Load speaker embeddings (pretrained from CMU Arctic dataset)
-embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
-speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
-def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
-    """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
-    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
-    inputs = {key: value.to(model.device) for key, value in inputs.items()}
-    speech = model.generate(
-        inputs["input_ids"],
-        speaker_embeddings=speaker_embedding.to(model.device),
-        vocoder=vocoder,
-        num_beams=7,
-        temperature=0.6,
-        no_repeat_ngram_size=3,
-        repetition_penalty=1.5,
-    )
-    speech = speech.detach().cpu().numpy()
-    display(Audio(speech, rate=16000))
-# Example usage French
-text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
-generate_speech_from_text(text)
-# Example usage Wolof
-text = "ñu ne ñoom ñooy nattukaay satélite yi"
-generate_speech_from_text(text)
-```
----
-## **Intended Uses & Limitations**
-### **Intended Uses**
-- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
-- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
-- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
-### **Limitations**
-- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
-- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
-- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
----
-## **Training and Evaluation Data**
-The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
----
-## **Training Procedure**
-### **Training Hyperparameters**
-| Hyperparameter             | Value   |
-|----------------------------|---------|
-| Learning Rate              | 1e-05   |
-| Training Batch Size        | 8       |
-| Evaluation Batch Size      | 2       |
-| Gradient Accumulation Steps| 8       |
-| Total Train Batch Size     | 64      |
-| Optimizer                  | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
-| Learning Rate Scheduler    | Linear  |
-| Warmup Steps               | 500     |
-| Training Steps             | 25,500  |
-| Mixed Precision Training   | AMP (Automatic Mixed Precision) |
-### **Training Results**
-| Training Loss | Epoch   | Step  | Validation Loss |
-|:-------------:|:-------:|:-----:|:---------------:|
-| 0.5372        | 0.9995  | 954   | 0.4398          |
-| 0.4646        | 2.0     | 1909  | 0.4214          |
-| 0.4505        | 2.9995  | 2863  | 0.4163          |
-| 0.4443        | 4.0     | 3818  | 0.4109          |
-| 0.4403        | 4.9995  | 4772  | 0.4080          |
-| 0.4368        | 6.0     | 5727  | 0.4057          |
-| 0.4343        | 6.9995  | 6681  | 0.4034          |
-| 0.4315        | 8.0     | 7636  | 0.4018          |
-| 0.4311        | 8.9995  | 8590  | 0.4015          |
-| 0.4273        | 10.0    | 9545  | 0.4017          |
-| 0.4282        | 10.9995 | 10499 | 0.3990          |
-| 0.4249        | 12.0    | 11454 | 0.3986          |
-| 0.4242        | 12.9995 | 12408 | 0.3973          |
-| 0.4225        | 14.0    | 13363 | 0.3966          |
-| 0.4217        | 14.9995 | 14317 | 0.3951          |
-| 0.4208        | 16.0    | 15272 | 0.3950          |
-| 0.4200        | 16.9995 | 16226 | 0.3950          |
-| 0.4202        | 18.0    | 17181 | 0.3952          |
-| 0.4200        | 18.9995 | 18135 | 0.3943          |
-| 0.4183        | 20.0    | 19090 | 0.3962          |
-| 0.4175        | 20.9995 | 20044 | 0.3937          |
-| 0.4161        | 22.0    | 20999 | 0.3940          |
-| 0.4193        | 22.9995 | 21953 | 0.3932          |
-| 0.4177        | 24.0    | 22908 | 0.3939          |
-| 0.4166        | 24.9995 | 23862 | 0.3936          |
-| 0.4156        | 26.0    | 24817 | 0.3938          |
----
-## **Framework Versions**
-- **Transformers**: 4.41.2
-- **PyTorch**: 2.4.0+cu121
-- **Datasets**: 3.2.0
-- **Tokenizers**: 0.19.1
----
-## **Author**
-- **Bilal FAYE**
 This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀

+---
+license: mit
+base_model:
+- AYI-TEKK/tts-v2
+tags:
+- generated_from_trainer
+model-index:
+- name: speecht5_tts-wolof-v0.2
+  results: []
+language:
+- wo
+- fr
+pipeline_tag: text-to-speech
+---
+# **speecht5_tts-wolof-v0.2**
+This model is a fine-tuned version of [speecht5_tts-wolof](https://huggingface.co/bilalfaye/speecht5_tts-wolof) that enhances Text-to-Speech (TTS) synthesis for both **Wolof and French**. It is based on Microsoft's [SpeechT5](https://huggingface.co/microsoft/speecht5_tts) and incorporates a **custom tokenizer** and additional fine-tuning to improve performance across these two languages.
+## **Model Description**
+This model builds upon the `SpeechT5` architecture, which unifies speech recognition and synthesis. The fine-tuning process introduced modifications to the original Wolof model, enabling it to **generate natural speech in both Wolof and French**. The model maintains the same general structure but **learns a more robust alignment** between textual inputs and speech synthesis, improving pronunciation and fluency in both languages.
+---
+## **Installation Instructions for Users**
+To install the necessary dependencies, run the following command:
+```bash
+pip install transformers datasets torch
+```
+## **Model Loading and Speech Generation Code**
+```python
+import torch
+from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
+from datasets import load_dataset
+from IPython.display import Audio, display
+def load_speech_model(checkpoint="AYI-TEKK/tts-v2", vocoder_checkpoint="microsoft/speecht5_hifigan"):
+    """ Load the SpeechT5 model, processor, and vocoder for text-to-speech. """
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    processor = SpeechT5Processor.from_pretrained(checkpoint)
+    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)
+    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)
+    return processor, model, vocoder, device
+# Load the model
+processor, model, vocoder, device = load_speech_model()
+# Load speaker embeddings (pretrained from CMU Arctic dataset)
+embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
+speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
+def generate_speech_from_text(text, speaker_embedding=speaker_embedding, processor=processor, model=model, vocoder=vocoder):
+    """ Generates speech from input text using SpeechT5 and HiFi-GAN vocoder. """
+    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=model.config.max_text_positions)
+    inputs = {key: value.to(model.device) for key, value in inputs.items()}
+    speech = model.generate(
+        inputs["input_ids"],
+        speaker_embeddings=speaker_embedding.to(model.device),
+        vocoder=vocoder,
+        num_beams=7,
+        temperature=0.6,
+        no_repeat_ngram_size=3,
+        repetition_penalty=1.5,
+    )
+    speech = speech.detach().cpu().numpy()
+    display(Audio(speech, rate=16000))
+# Example usage French
+text = "Bonjour, bienvenue dans le modèle de synthèse vocale Wolof et Français."
+generate_speech_from_text(text)
+# Example usage Wolof
+text = "ñu ne ñoom ñooy nattukaay satélite yi"
+generate_speech_from_text(text)
+```
+---
+## **Intended Uses & Limitations**
+### **Intended Uses**
+- **Multilingual TTS:** Converts **Wolof and French** text into natural-sounding speech.
+- **Voice Assistants & Speech Interfaces:** Can be used for **audio-based applications** supporting both languages.
+- **Linguistic Research:** Facilitates speech synthesis research in low-resource languages.
+### **Limitations**
+- **Data Dependency:** The quality of synthesized speech depends on the dataset used for fine-tuning.
+- **Pronunciation Variations:** Some complex or uncommon words may be mispronounced.
+- **Limited Speaker Variety:** The model was trained on a single speaker embedding and may not generalize well to different voice profiles.
+---
+## **Training and Evaluation Data**
+The model was fine-tuned on an extended dataset containing text in both **Wolof and French**, ensuring improved synthesis capabilities across these two languages.
+---
+## **Training Procedure**
+### **Training Hyperparameters**
+| Hyperparameter             | Value   |
+|----------------------------|---------|
+| Learning Rate              | 1e-05   |
+| Training Batch Size        | 8       |
+| Evaluation Batch Size      | 2       |
+| Gradient Accumulation Steps| 8       |
+| Total Train Batch Size     | 64      |
+| Optimizer                  | Adam (β1=0.9, β2=0.999, ϵ=1e-08) |
+| Learning Rate Scheduler    | Linear  |
+| Warmup Steps               | 500     |
+| Training Steps             | 25,500  |
+| Mixed Precision Training   | AMP (Automatic Mixed Precision) |
+### **Training Results**
+| Training Loss | Epoch   | Step  | Validation Loss |
+|:-------------:|:-------:|:-----:|:---------------:|
+| 0.5372        | 0.9995  | 954   | 0.4398          |
+| 0.4646        | 2.0     | 1909  | 0.4214          |
+| 0.4505        | 2.9995  | 2863  | 0.4163          |
+| 0.4443        | 4.0     | 3818  | 0.4109          |
+| 0.4403        | 4.9995  | 4772  | 0.4080          |
+| 0.4368        | 6.0     | 5727  | 0.4057          |
+| 0.4343        | 6.9995  | 6681  | 0.4034          |
+| 0.4315        | 8.0     | 7636  | 0.4018          |
+| 0.4311        | 8.9995  | 8590  | 0.4015          |
+| 0.4273        | 10.0    | 9545  | 0.4017          |
+| 0.4282        | 10.9995 | 10499 | 0.3990          |
+| 0.4249        | 12.0    | 11454 | 0.3986          |
+| 0.4242        | 12.9995 | 12408 | 0.3973          |
+| 0.4225        | 14.0    | 13363 | 0.3966          |
+| 0.4217        | 14.9995 | 14317 | 0.3951          |
+| 0.4208        | 16.0    | 15272 | 0.3950          |
+| 0.4200        | 16.9995 | 16226 | 0.3950          |
+| 0.4202        | 18.0    | 17181 | 0.3952          |
+| 0.4200        | 18.9995 | 18135 | 0.3943          |
+| 0.4183        | 20.0    | 19090 | 0.3962          |
+| 0.4175        | 20.9995 | 20044 | 0.3937          |
+| 0.4161        | 22.0    | 20999 | 0.3940          |
+| 0.4193        | 22.9995 | 21953 | 0.3932          |
+| 0.4177        | 24.0    | 22908 | 0.3939          |
+| 0.4166        | 24.9995 | 23862 | 0.3936          |
+| 0.4156        | 26.0    | 24817 | 0.3938          |
+---
+## **Framework Versions**
+- **Transformers**: 4.41.2
+- **PyTorch**: 2.4.0+cu121
+- **Datasets**: 3.2.0
+- **Tokenizers**: 0.19.1
+---
+## **Author**
+- **Bilal FAYE**
 This model contributes to **enhancing TTS accessibility** for Wolof and French, making it a valuable resource for multilingual voice applications. 🚀