Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -10,23 +10,76 @@ tags:
|
|
| 10 |
- speech-synthesis
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# BayanSynthTTS
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
|
|
|
| 17 |
|
| 18 |
**GitHub:** [Ramendan/BayanSynthTTS](https://github.com/Ramendan/BayanSynthTTS)
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
---
|
| 21 |
|
| 22 |
## Audio Demos
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
> ู
ูุฑูุญูุจูุงุ ุฃูููุง ุจูููุงููุณููููุซุ ููุธูุงู
ู ููุชููููููุฏู ุงููููููุงู
ู ุงููุนูุฑูุจูููู.
|
| 27 |
>
|
| 28 |
> *Hello, I am BayanSynth, a system for generating Arabic speech.*
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/01_basic.wav"></audio>
|
| 31 |
|
| 32 |
---
|
|
@@ -37,88 +90,317 @@ Trained on ~4 h of diacritized Arabic speech.
|
|
| 37 |
>
|
| 38 |
> *The Arabic language is a treasure of culture and heritage.*
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/02_prediacritized.wav"></audio>
|
| 41 |
|
| 42 |
---
|
| 43 |
|
| 44 |
-
### 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
> ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู ุฃุญุฏ ุฃุจุฑุฒ ุงูุชุทูุฑุงุช ุงูุชูููููุฌูุฉ ูู ุนุตุฑูุง ุงูุญุฏูุซ. ูุนุชู
ุฏ ุนูู ุชุญููู ูู
ูุงุช ุถุฎู
ุฉ ู
ู ุงูุจูุงูุงุช ูุงุณุชุฎูุงุต ุฃูู
ุงุท ู
ุนูุฏุฉ. ูู
ู ุฃุจุฑุฒ ุชุทุจููุงุชู ูุธู
ุงูุชุนุฑู ุนูู ุงูุตูุช ูุชุฑุฌู
ุฉ ุงููุบุงุช ูุชูููุฏ ุงููุตูุต.
|
| 47 |
>
|
| 48 |
> *Artificial intelligence is one of the most prominent technological advances of our era. It relies on analyzing massive amounts of data to extract complex patterns. Among its most notable applications: speech recognition, language translation, and text generation.*
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/04_long_text.wav"></audio>
|
| 51 |
|
| 52 |
---
|
| 53 |
|
| 54 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
> ุงููุฌูููุฏูุฉู ุงููุนูุงููููุฉู ููุชููููููููุงุชู ุงูุฐููููุงุกู ุงูุงุตูุทูููุงุนูููู ุชูุณูุงููู
ู ููู ุจูููุงุกู ู
ูุณูุชูููุจููู ุจูุงููุฑู ููููุฃูุฌูููุงูู.
|
| 57 |
>
|
| 58 |
> *The high quality of AI technologies contributes to building a brilliant future for generations to come.*
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
---
|
| 63 |
|
| 64 |
-
###
|
|
|
|
|
|
|
| 65 |
|
| 66 |
> ุฅูููู ููุธูุงู
ู ุจูููุงููุณููููุซ ููููุฏููู ุฅูููู ุชูููุฏููู
ู ุชูุฌูุฑูุจูุฉู ุตูููุชููููุฉู ููุฑููุฏูุฉูุ ุชูุฌูู
ูุนู ุจููููู ุฏููููุฉู ุงููููุทููู ููุฌูู
ูุงูู ุงููุฃูุฏูุงุกู.
|
| 67 |
>
|
| 68 |
> *BayanSynth aims to deliver a unique voice experience that combines precise pronunciation with beauty of delivery.*
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
|
|
|
|
|
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/11_flow_s2.wav"></audio>
|
| 79 |
|
| 80 |
---
|
| 81 |
|
| 82 |
-
###
|
|
|
|
|
|
|
| 83 |
|
| 84 |
> ุนูููู
ู ุงููุนูุงููู
ู ุฃูููู ุงููุนูููู
ู ููุนูููู ุจูุงููุนูููู
ูุ ููุงุณูุชูุนูููู
ู ุนููู ุนููููู
ู ุงููุฃููููููููู.
|
| 85 |
>
|
| 86 |
> *The scholar knew that the flag rises with knowledge, so he inquired about the sciences of the ancients.*
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/09_challenge.wav"></audio>
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
-
###
|
|
|
|
|
|
|
| 93 |
|
| 94 |
> ู
ูุฑูุญูุจุงู ุจูููู
ู. ููุฐูุง ู
ูุซูุงูู ุนูููู ุงุณูุชูุฎูุฏูุงู
ู ุงูุชููููุฌูููู ููุถูุจูุทู ุฃูุณููููุจู ุงูุตููููุชู.
|
| 95 |
>
|
| 96 |
> *Welcome. This is an example of using an instruct prompt to control voice style.*
|
| 97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/12_instruct.wav"></audio>
|
| 99 |
|
| 100 |
---
|
| 101 |
|
| 102 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
| File | Description |
|
| 105 |
|------|-------------|
|
| 106 |
| `epoch_28_whole.pt` | LoRA weights (LLM, 629 keys) โ main checkpoint |
|
| 107 |
| `samples/*.wav` | Pre-generated audio demos |
|
| 108 |
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
```bash
|
| 112 |
-
|
| 113 |
```
|
| 114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
```python
|
| 116 |
from bayansynthtts import BayanSynthTTS
|
| 117 |
-
|
| 118 |
tts = BayanSynthTTS()
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
- speech-synthesis
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# BayanSynthTTS
|
| 14 |
|
| 15 |
+
**Arabic Text-to-Speech powered by CosyVoice3 with LoRA fine-tuning.**
|
| 16 |
+
|
| 17 |
+
> Text in. Speech out. Inference only without training or preprocessing.
|
| 18 |
|
| 19 |
**GitHub:** [Ramendan/BayanSynthTTS](https://github.com/Ramendan/BayanSynthTTS)
|
| 20 |
|
| 21 |
+
## Features
|
| 22 |
+
|
| 23 |
+
| Feature | Details |
|
| 24 |
+
|---------|---------|
|
| 25 |
+
| Arabic TTS | Natural-sounding Modern Standard Arabic |
|
| 26 |
+
| Auto-Tashkeel | Automatic diacritization via mishkal (always on by default) |
|
| 27 |
+
| Voice Cloning | Clone any voice from a 5-15 s clip (WAV/MP3/OGG/M4A/FLAC) |
|
| 28 |
+
| Example voices | Two reference voices (`default.wav` and `muffled-talking.wav`) are included; add your own to `voices/` |
|
| 29 |
+
| Speed control | Slow down or speed up synthesis (0.5โ2.0ร) |
|
| 30 |
+
| LoRA Swapping | Change checkpoints via `conf/models.yaml` no code edits |
|
| 31 |
+
| Streaming | Chunk-by-chunk audio generation |
|
| 32 |
+
| Gradio UI | Simple web interface included |
|
| 33 |
+
| CLI | One-liner inference from terminal |
|
| 34 |
+
| Multilingual base | CosyVoice3 supports many languages; Arabic LoRA ships by default |
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
> **Multilingual note:** the underlying CosyVoice3 base model is trained for zero-shot
|
| 39 |
+
> synthesis across a wide range of languages. BayanSynthTTS currently defaults to an
|
| 40 |
+
> Arabic-conditioned LoRA checkpoint and delivers the best results in Modern Standard
|
| 41 |
+
> Arabic. You are free to plug in other LoRA files (not provided here) for additional
|
| 42 |
+
> languages, though quality may vary.
|
| 43 |
+
|
| 44 |
---
|
| 45 |
|
| 46 |
## Audio Demos
|
| 47 |
|
| 48 |
+
All samples were generated with this library. No post-processing applied.
|
| 49 |
+
|
| 50 |
+
| # | Description | Duration |
|
| 51 |
+
|---|-------------|----------|
|
| 52 |
+
| 1 | Basic synthesis, pre-diacritized | ~5 s |
|
| 53 |
+
| 2 | Pre-diacritized text, mishkal off | ~4 s |
|
| 54 |
+
| 3 | Voice cloning from muffled reference | ~10 s |
|
| 55 |
+
| 4 | Longer passage, AI topic, 3 sentences | ~17 s |
|
| 56 |
+
| 5 | Slow speed (0.80x) | ~10 s |
|
| 57 |
+
| 6 | Fast speed (1.20x) | ~5 s |
|
| 58 |
+
| 7 | Phonetics test: halqiyyat, tanwin, shaddah | ~7 s |
|
| 59 |
+
| 8 | Flow and rhythm, connected speech | ~9 s |
|
| 60 |
+
| 9 | Challenge: identical root, different diacritics | ~5 s |
|
| 61 |
+
| 10 | Phonetics, alternate seed (seed=17) | ~9 s |
|
| 62 |
+
| 11 | Flow, alternate seed (seed=99) | ~10 s |
|
| 63 |
+
| 12 | Instruct prompt: warm newsreader style | ~8 s |
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
### 1. Basic synthesis
|
| 68 |
|
| 69 |
> ู
ูุฑูุญูุจูุงุ ุฃูููุง ุจูููุงููุณููููุซุ ููุธูุงู
ู ููุชููููููุฏู ุงููููููุงู
ู ุงููุนูุฑูุจูููู.
|
| 70 |
>
|
| 71 |
> *Hello, I am BayanSynth, a system for generating Arabic speech.*
|
| 72 |
|
| 73 |
+
```python
|
| 74 |
+
from bayansynthtts import BayanSynthTTS
|
| 75 |
+
tts = BayanSynthTTS()
|
| 76 |
+
audio = tts.synthesize(
|
| 77 |
+
"ู
ูุฑูุญูุจูุงุ ุฃูููุง ุจูููุงููุณููููุซุ ููุธูุงู
ู ููุชููููููุฏู ุงููููููุงู
ู ุงููุนูุฑูุจูููู.",
|
| 78 |
+
auto_tashkeel=False,
|
| 79 |
+
)
|
| 80 |
+
tts.save_wav(audio, "output.wav")
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/01_basic.wav"></audio>
|
| 84 |
|
| 85 |
---
|
|
|
|
| 90 |
>
|
| 91 |
> *The Arabic language is a treasure of culture and heritage.*
|
| 92 |
|
| 93 |
+
```python
|
| 94 |
+
audio = tts.synthesize(
|
| 95 |
+
"ุฅูููู ุงููููุบูุฉู ุงููุนูุฑูุจููููุฉู ููููุฒู ู
ููู ุงูุซููููุงููุฉู ููุงูุชููุฑูุงุซู.",
|
| 96 |
+
auto_tashkeel=False,
|
| 97 |
+
)
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/02_prediacritized.wav"></audio>
|
| 101 |
|
| 102 |
---
|
| 103 |
|
| 104 |
+
### 3. Voice cloning
|
| 105 |
+
|
| 106 |
+
> ููุฐูุง ุงูุตููููุชู ู
ูุณูุชูููุณูุฎู ู
ููู ู
ูููุทูุนู ุตูููุชูููู ููุตููุฑู. ููู
ููููููู ุงุณูุชูุฎูุฏูุงู
ู ุฃูููู ู
ูููุทูุนู ุจูู
ูุฏููุฉู ุฎูู
ูุณู ุฅูููู ุฎูู
ูุณู ุนูุดูุฑูุฉู ุซูุงููููุฉู.
|
| 107 |
+
>
|
| 108 |
+
> *This voice is cloned from a short audio clip. You can use any clip between five and fifteen seconds.*
|
| 109 |
+
|
| 110 |
+
```python
|
| 111 |
+
audio = tts.synthesize(
|
| 112 |
+
"ููุฐูุง ุงูุตููููุชู ู
ูุณูุชูููุณูุฎู ู
ููู ู
ูููุทูุนู ุตูููุชูููู ููุตููุฑู. "
|
| 113 |
+
"ููู
ููููููู ุงุณูุชูุฎูุฏูุงู
ู ุฃูููู ู
ูููุทูุนู ุจูู
ูุฏููุฉู ุฎูู
ูุณู ุฅูููู ุฎูู
ูุณู ุนูุดูุฑูุฉู ุซูุงููููุฉู.",
|
| 114 |
+
ref_audio="voices/muffled_trim.wav",
|
| 115 |
+
auto_tashkeel=False,
|
| 116 |
+
)
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
**Reference clip:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/ref_voice_muffled.wav"></audio>
|
| 120 |
+
|
| 121 |
+
**Result:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/03_voice_cloning.wav"></audio>
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
### 4. Longer passage (auto-tashkeel, speed 0.88)
|
| 126 |
|
| 127 |
> ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู ุฃุญุฏ ุฃุจุฑุฒ ุงูุชุทูุฑุงุช ุงูุชูููููุฌูุฉ ูู ุนุตุฑูุง ุงูุญุฏูุซ. ูุนุชู
ุฏ ุนูู ุชุญููู ูู
ูุงุช ุถุฎู
ุฉ ู
ู ุงูุจูุงูุงุช ูุงุณุชุฎูุงุต ุฃูู
ุงุท ู
ุนูุฏุฉ. ูู
ู ุฃุจุฑุฒ ุชุทุจููุงุชู ูุธู
ุงูุชุนุฑู ุนูู ุงูุตูุช ูุชุฑุฌู
ุฉ ุงููุบุงุช ูุชูููุฏ ุงููุตูุต.
|
| 128 |
>
|
| 129 |
> *Artificial intelligence is one of the most prominent technological advances of our era. It relies on analyzing massive amounts of data to extract complex patterns. Among its most notable applications: speech recognition, language translation, and text generation.*
|
| 130 |
|
| 131 |
+
```python
|
| 132 |
+
audio = tts.synthesize(
|
| 133 |
+
"ุงูุฐูุงุก ุงูุงุตุทูุงุนู ูู ุฃุญุฏ ุฃุจุฑุฒ ุงูุชุทูุฑุงุช ุงูุชูููููุฌูุฉ ูู ุนุตุฑูุง ุงูุญุฏูุซ. "
|
| 134 |
+
"ูุนุชู
ุฏ ุนูู ุชุญููู ูู
ูุงุช ุถุฎู
ุฉ ู
ู ุงูุจูุงูุงุช ูุงุณุชุฎูุงุต ุฃูู
ุงุท ู
ุนูุฏุฉ. "
|
| 135 |
+
"ูู
ู ุฃุจุฑุฒ ุชุทุจููุงุชู ูุธู
ุงูุชุนุฑู ุนูู ุงูุตูุช ูุชุฑุฌู
ุฉ ุงููุบุงุช ูุชูููุฏ ุงููุตูุต.",
|
| 136 |
+
auto_tashkeel=True,
|
| 137 |
+
speed=0.88,
|
| 138 |
+
)
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/04_long_text.wav"></audio>
|
| 142 |
|
| 143 |
---
|
| 144 |
|
| 145 |
+
### 5. Speed control
|
| 146 |
+
|
| 147 |
+
> ู
ูุฑูุญูุจุงู ุจูููู
ู ููู ุจูููุงููุณููููุซู. ููุฐูุง ุชููููููุฏู ุจูุณูุฑูุนูุฉู ู
ูุฎููููุถูุฉู ูููุชููููุถููุญู.
|
| 148 |
+
>
|
| 149 |
+
> *Welcome to BayanSynth. This is synthesis at reduced speed for demonstration.*
|
| 150 |
+
|
| 151 |
+
```python
|
| 152 |
+
TEXT = "ู
ูุฑูุญูุจุงู ุจูููู
ู ููู ุจูููุงููุณููููุซู. ููุฐูุง ุชููููููุฏู ุจูุณูุฑูุนูุฉู ู
ูุฎููููุถูุฉู ูููุชููููุถููุญู."
|
| 153 |
+
audio = tts.synthesize(TEXT, speed=0.80, auto_tashkeel=False)
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
**Slow (0.80ร):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/05_slow_speed.wav"></audio>
|
| 157 |
+
|
| 158 |
+
**Fast (1.20ร):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/06_fast_speed.wav"></audio>
|
| 159 |
+
|
| 160 |
+
---
|
| 161 |
+
|
| 162 |
+
### 6. Phonetics test: halqiyyat, tanwin, shaddah
|
| 163 |
+
|
| 164 |
+
Designed to exercise pharyngeal/velar consonants, gemination, and nunation at once:
|
| 165 |
|
| 166 |
> ุงููุฌูููุฏูุฉู ุงููุนูุงููููุฉู ููุชููููููููุงุชู ุงูุฐููููุงุกู ุงูุงุตูุทูููุงุนูููู ุชูุณูุงููู
ู ููู ุจูููุงุกู ู
ูุณูุชูููุจููู ุจูุงููุฑู ููููุฃูุฌูููุงูู.
|
| 167 |
>
|
| 168 |
> *The high quality of AI technologies contributes to building a brilliant future for generations to come.*
|
| 169 |
|
| 170 |
+
```python
|
| 171 |
+
audio = tts.synthesize(
|
| 172 |
+
"ุงููุฌูููุฏูุฉู ุงููุนูุงููููุฉู ููุชููููููููุงุชู ุงูุฐููููุงุกู ุงูุงุตูุทูููุงุนูููู "
|
| 173 |
+
"ุชูุณูุงููู
ู ููู ุจูููุงุกู ู
ูุณูุชูููุจููู ุจูุงููุฑู ููููุฃูุฌูููุงูู.",
|
| 174 |
+
auto_tashkeel=False,
|
| 175 |
+
)
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
**seed=42:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/07_phonetics.wav"></audio>
|
| 179 |
+
|
| 180 |
+
**seed=17 (different prosody):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/10_phonetics_s2.wav"></audio>
|
| 181 |
|
| 182 |
---
|
| 183 |
|
| 184 |
+
### 7. Flow & rhythm test: connected speech
|
| 185 |
+
|
| 186 |
+
Tests natural sandhi, liaison, and intonation across a multi-clause sentence:
|
| 187 |
|
| 188 |
> ุฅูููู ููุธูุงู
ู ุจูููุงููุณููููุซ ููููุฏููู ุฅูููู ุชูููุฏููู
ู ุชูุฌูุฑูุจูุฉู ุตูููุชููููุฉู ููุฑููุฏูุฉูุ ุชูุฌูู
ูุนู ุจููููู ุฏููููุฉู ุงููููุทููู ููุฌูู
ูุงูู ุงููุฃูุฏูุงุกู.
|
| 189 |
>
|
| 190 |
> *BayanSynth aims to deliver a unique voice experience that combines precise pronunciation with beauty of delivery.*
|
| 191 |
|
| 192 |
+
```python
|
| 193 |
+
audio = tts.synthesize(
|
| 194 |
+
"ุฅูููู ููุธูุงู
ู ุจูููุงููุณููููุซ ููููุฏููู ุฅูููู ุชูููุฏููู
ู ุชูุฌูุฑูุจูุฉู ุตูููุชููููุฉู ููุฑููุฏูุฉูุ "
|
| 195 |
+
"ุชูุฌูู
ูุนู ุจููููู ุฏููููุฉู ุงููููุทููู ููุฌูู
ูุงูู ุงููุฃูุฏูุงุกู.",
|
| 196 |
+
auto_tashkeel=False,
|
| 197 |
+
)
|
| 198 |
+
```
|
| 199 |
|
| 200 |
+
**seed=42:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/08_flow.wav"></audio>
|
| 201 |
|
| 202 |
+
**seed=99 (different prosody):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/11_flow_s2.wav"></audio>
|
| 203 |
|
| 204 |
---
|
| 205 |
|
| 206 |
+
### 8. Challenge: tashkeel disambiguation
|
| 207 |
+
|
| 208 |
+
All five ุน-rooted words differ **only** by their diacritics; correct rendering proves the model reads harakat accurately:
|
| 209 |
|
| 210 |
> ุนูููู
ู ุงููุนูุงููู
ู ุฃูููู ุงููุนูููู
ู ููุนูููู ุจูุงููุนูููู
ูุ ููุงุณูุชูุนูููู
ู ุนููู ุนููููู
ู ุงููุฃููููููููู.
|
| 211 |
>
|
| 212 |
> *The scholar knew that the flag rises with knowledge, so he inquired about the sciences of the ancients.*
|
| 213 |
|
| 214 |
+
```python
|
| 215 |
+
audio = tts.synthesize(
|
| 216 |
+
"ุนูููู
ู ุงููุนูุงููู
ู ุฃูููู ุงููุนูููู
ู ููุนูููู ุจูุงููุนูููู
ูุ "
|
| 217 |
+
"ููุงุณูุชูุนูููู
ู ุนููู ุนููููู
ู ุงููุฃููููููููู.",
|
| 218 |
+
auto_tashkeel=False,
|
| 219 |
+
)
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/09_challenge.wav"></audio>
|
| 223 |
|
| 224 |
---
|
| 225 |
|
| 226 |
+
### 9. Instruct prompt: warm newsreader style
|
| 227 |
+
|
| 228 |
+
Pass a free-text style directive alongside the synthesis text to steer the speaker's tone, register, or delivery:
|
| 229 |
|
| 230 |
> ู
ูุฑูุญูุจุงู ุจูููู
ู. ููุฐูุง ู
ูุซูุงูู ุนูููู ุงุณูุชูุฎูุฏูุงู
ู ุงูุชููููุฌูููู ููุถูุจูุทู ุฃูุณููููุจู ุงูุตููููุชู.
|
| 231 |
>
|
| 232 |
> *Welcome. This is an example of using an instruct prompt to control voice style.*
|
| 233 |
|
| 234 |
+
```python
|
| 235 |
+
audio = tts.synthesize(
|
| 236 |
+
"ู
ูุฑูุญูุจุงู ุจูููู
ู. ููุฐูุง ู
ูุซูุงูู ุนูููู ุงุณูุชูุฎูุฏูุงู
ู ุงูุชููููุฌูููู ููุถูุจูุทู ุฃูุณููููุจู ุงูุตููููุชู.",
|
| 237 |
+
instruct="Speak in a warm, clear newsreader style with careful diction.",
|
| 238 |
+
auto_tashkeel=False,
|
| 239 |
+
seed=42,
|
| 240 |
+
)
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
<audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/12_instruct.wav"></audio>
|
| 244 |
|
| 245 |
---
|
| 246 |
|
| 247 |
+
## Quick Start
|
| 248 |
+
|
| 249 |
+
### 1. Clone and install
|
| 250 |
+
|
| 251 |
+
```bash
|
| 252 |
+
git clone https://github.com/Ramendan/BayanSynthTTS
|
| 253 |
+
cd BayanSynthTTS
|
| 254 |
+
python -m venv .venv
|
| 255 |
+
.venv\Scripts\activate # Windows
|
| 256 |
+
# source .venv/bin/activate # Linux / macOS
|
| 257 |
+
pip install -r requirements.txt
|
| 258 |
+
pip install -e . # installs bayansynthtts + bundled packages into the venv
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
> The CosyVoice3 inference engine and Matcha-TTS decoder are **bundled directly in this repo**. No external private repos required.
|
| 262 |
+
>
|
| 263 |
+
> **Example voices:** two reference clips (`default.wav` and `muffled-talking.wav`) live in `voices/`. Drop additional 5-15 s recordings there and they automatically appear in the CLI/UI dropdown.
|
| 264 |
+
|
| 265 |
+
### 2. Download models
|
| 266 |
+
|
| 267 |
+
```bash
|
| 268 |
+
python scripts/setup_models.py
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
This downloads everything automatically:
|
| 272 |
+
- CosyVoice3 base weights (~2 GB) from Hugging Face โ `pretrained_models/CosyVoice3/`
|
| 273 |
+
- Arabic LoRA checkpoint from Hugging Face โ `checkpoints/llm/epoch_28_whole.pt`
|
| 274 |
+
- Verifies the checkpoint SHA-256
|
| 275 |
+
|
| 276 |
+
> On Windows you can also double-click `scripts\setup_models.bat`.
|
| 277 |
+
|
| 278 |
+
### 3. Run
|
| 279 |
+
|
| 280 |
+
**Web UI:**
|
| 281 |
+
```bash
|
| 282 |
+
scripts\run_ui.bat # Windows GUI launcher
|
| 283 |
+
python bayansynthtts/app.py # Cross-platform (run from inside BayanSynthTTS/)
|
| 284 |
+
```
|
| 285 |
+
|
| 286 |
+
---
|
| 287 |
+
|
| 288 |
+
## Files in this repo
|
| 289 |
|
| 290 |
| File | Description |
|
| 291 |
|------|-------------|
|
| 292 |
| `epoch_28_whole.pt` | LoRA weights (LLM, 629 keys) โ main checkpoint |
|
| 293 |
| `samples/*.wav` | Pre-generated audio demos |
|
| 294 |
|
| 295 |
+
---
|
| 296 |
+
|
| 297 |
+
## Swapping the LoRA Checkpoint
|
| 298 |
+
|
| 299 |
+
### Via `conf/models.yaml` (recommended, no code changes)
|
| 300 |
+
|
| 301 |
+
```yaml
|
| 302 |
+
llm_lora:
|
| 303 |
+
enabled: true
|
| 304 |
+
checkpoint: "checkpoints/llm/my_new_epoch.pt" # โ change this line only
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
### Via Python constructor (for A/B testing at runtime)
|
| 308 |
+
|
| 309 |
+
```python
|
| 310 |
+
tts = BayanSynthTTS(llm_checkpoint="checkpoints/llm/epoch_40.pt")
|
| 311 |
+
```
|
| 312 |
+
|
| 313 |
+
### Via CLI flag
|
| 314 |
|
| 315 |
```bash
|
| 316 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู" --llm checkpoints/llm/epoch_40.pt
|
| 317 |
```
|
| 318 |
|
| 319 |
+
---
|
| 320 |
+
|
| 321 |
+
## Adding Your Own Voices
|
| 322 |
+
|
| 323 |
+
Drop any 5-15 second Arabic clip into `voices/`. Supported formats: WAV, MP3, FLAC, OGG, M4A. Non-WAV files are auto-converted at runtime.
|
| 324 |
+
|
| 325 |
```python
|
| 326 |
from bayansynthtts import BayanSynthTTS
|
|
|
|
| 327 |
tts = BayanSynthTTS()
|
| 328 |
+
print(tts.list_voices()) # e.g. ['default.wav', 'muffled-talking.wav', 'my_voice.wav']
|
| 329 |
+
```
|
| 330 |
+
|
| 331 |
+
```bash
|
| 332 |
+
bayansynthtts "ู
ุฑุญุจุง" --voice voices/my_voice.wav
|
| 333 |
+
```
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## CLI Reference
|
| 338 |
+
|
| 339 |
+
```bash
|
| 340 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู ุจูููู
ู" # basic synthesis โ output.wav
|
| 341 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู" -o hello.wav # custom output path
|
| 342 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู" --voice voices/speaker2.wav # use specific voice
|
| 343 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู" --llm checkpoints/llm/new.pt # override LLM LoRA
|
| 344 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู" --speed 0.85 # slower speech
|
| 345 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู" --no-tashkeel # skip auto-diacritize
|
| 346 |
+
bayansynthtts "ู
ูุฑูุญูุจุงู" --seed 123 # reproducible output
|
| 347 |
+
bayansynthtts --help
|
| 348 |
+
```
|
| 349 |
+
|
| 350 |
+
---
|
| 351 |
+
|
| 352 |
+
## API Reference
|
| 353 |
+
|
| 354 |
+
### `BayanSynthTTS`
|
| 355 |
+
|
| 356 |
+
| Argument | Type | Default | Description |
|
| 357 |
+
|----------|------|---------|-------------|
|
| 358 |
+
| `model_dir` | `str` | from YAML | CosyVoice3 weights directory |
|
| 359 |
+
| `llm_checkpoint` | `str` | from YAML | LLM LoRA `.pt` path |
|
| 360 |
+
| `ref_audio` | `str` | from YAML | Default reference voice path |
|
| 361 |
+
| `instruct` | `str` | from YAML | Instruct prompt text |
|
| 362 |
+
| `config_path` | `str` | `conf/models.yaml` | Custom config file path |
|
| 363 |
+
|
| 364 |
+
### `synthesize(text, *, ...)`
|
| 365 |
+
|
| 366 |
+
| Argument | Type | Default | Description |
|
| 367 |
+
|----------|------|---------|-------------|
|
| 368 |
+
| `text` | `str` | required | Arabic text (plain or diacritized) |
|
| 369 |
+
| `ref_audio` | `str` | default voice | Voice clone source (any format) |
|
| 370 |
+
| `instruct` | `str` | from config | Instruct prompt override |
|
| 371 |
+
| `speed` | `float` | `1.0` | Speed multiplier (0.5-2.0) |
|
| 372 |
+
| `stream` | `bool` | `False` | Yield chunks vs return full array |
|
| 373 |
+
| `seed` | `int` | `None` | Random seed for reproducibility |
|
| 374 |
+
| `auto_tashkeel` | `bool` | `True` | Auto-diacritize input text |
|
| 375 |
+
|
| 376 |
+
### Tashkeel utilities
|
| 377 |
+
|
| 378 |
+
```python
|
| 379 |
+
from bayansynthtts import auto_diacritize, has_harakat, strip_harakat, list_available_backends
|
| 380 |
+
|
| 381 |
+
auto_diacritize("ู
ุฑุญุจุง ุจูู
") # โ "ู
ูุฑูุญูุจุงู ุจูููู
ู"
|
| 382 |
+
has_harakat("ู
ูุฑูุญูุจุงู") # โ True
|
| 383 |
+
strip_harakat("ู
ูุฑูุญูุจุงู") # โ "ู
ุฑุญุจุง"
|
| 384 |
+
list_available_backends() # โ ['mishkal'] (or ['tashkeel', 'mishkal'])
|
| 385 |
```
|
| 386 |
+
|
| 387 |
+
---
|
| 388 |
+
|
| 389 |
+
## Troubleshooting
|
| 390 |
+
|
| 391 |
+
| Problem | Solution |
|
| 392 |
+
|---------|---------|
|
| 393 |
+
| `No module named 'cosyvoice'` | Run `pip install -e .` from inside `BayanSynthTTS/` |
|
| 394 |
+
| `No LLM checkpoint found` | Run `python scripts/setup_models.py` |
|
| 395 |
+
| `mishkal not found` | `pip install mishkal` |
|
| 396 |
+
| No audio generated | Check console for the specific mode that failed; verify `voices/default.wav` exists |
|
| 397 |
+
| MP3/M4A upload fails | Install ffmpeg: `winget install ffmpeg` (Windows) or `sudo apt install ffmpeg` (Linux) |
|
| 398 |
+
|
| 399 |
+
---
|
| 400 |
+
|
| 401 |
+
## License
|
| 402 |
+
|
| 403 |
+
Apache 2.0.
|
| 404 |
+
|
| 405 |
+
The underlying CosyVoice3 model is subject to its own license.
|
| 406 |
+
LoRA checkpoints trained on Common Voice Arabic data are released under CC-BY 4.0.
|