Ramendan commited on
Commit
d94d802
ยท
verified ยท
1 Parent(s): 8e0ef58

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +308 -26
README.md CHANGED
@@ -10,23 +10,76 @@ tags:
10
  - speech-synthesis
11
  ---
12
 
13
- # BayanSynthTTS โ€” Arabic TTS Checkpoints
14
 
15
- Fine-tuned LoRA weights for **CosyVoice 3** (Arabic).
16
- Trained on ~4 h of diacritized Arabic speech.
 
17
 
18
  **GitHub:** [Ramendan/BayanSynthTTS](https://github.com/Ramendan/BayanSynthTTS)
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ---
21
 
22
  ## Audio Demos
23
 
24
- ### 1. Basic synthesis (pre-diacritized)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  > ู…ูŽุฑู’ุญูŽุจู‹ุงุŒ ุฃูŽู†ูŽุง ุจูŽูŠูŽุงู†ู’ุณููŠู†ู’ุซุŒ ู†ูุธูŽุงู…ูŒ ู„ูุชูŽูˆู’ู„ููŠุฏู ุงู„ู’ูƒูŽู„ูŽุงู…ู ุงู„ู’ุนูŽุฑูŽุจููŠูู‘.
27
  >
28
  > *Hello, I am BayanSynth, a system for generating Arabic speech.*
29
 
 
 
 
 
 
 
 
 
 
 
30
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/01_basic.wav"></audio>
31
 
32
  ---
@@ -37,88 +90,317 @@ Trained on ~4 h of diacritized Arabic speech.
37
  >
38
  > *The Arabic language is a treasure of culture and heritage.*
39
 
 
 
 
 
 
 
 
40
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/02_prediacritized.wav"></audio>
41
 
42
  ---
43
 
44
- ### 3. Longer passage (auto-tashkeel, speed 0.88)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  > ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ุฃุญุฏ ุฃุจุฑุฒ ุงู„ุชุทูˆุฑุงุช ุงู„ุชูƒู†ูˆู„ูˆุฌูŠุฉ ููŠ ุนุตุฑู†ุง ุงู„ุญุฏูŠุซ. ูŠุนุชู…ุฏ ุนู„ู‰ ุชุญู„ูŠู„ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช ู„ุงุณุชุฎู„ุงุต ุฃู†ู…ุงุท ู…ุนู‚ุฏุฉ. ูˆู…ู† ุฃุจุฑุฒ ุชุทุจูŠู‚ุงุชู‡ ู†ุธู… ุงู„ุชุนุฑู ุนู„ู‰ ุงู„ุตูˆุช ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุชูˆู„ูŠุฏ ุงู„ู†ุตูˆุต.
47
  >
48
  > *Artificial intelligence is one of the most prominent technological advances of our era. It relies on analyzing massive amounts of data to extract complex patterns. Among its most notable applications: speech recognition, language translation, and text generation.*
49
 
 
 
 
 
 
 
 
 
 
 
50
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/04_long_text.wav"></audio>
51
 
52
  ---
53
 
54
- ### 4. Phonetics test (seed=42)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  > ุงู„ู’ุฌูŽูˆู’ุฏูŽุฉู ุงู„ู’ุนูŽุงู„ููŠูŽุฉู ู„ูุชูŽู‚ู’ู†ููŠูŽู‘ุงุชู ุงู„ุฐูŽู‘ูƒูŽุงุกู ุงู„ุงุตู’ุทูู†ูŽุงุนููŠูู‘ ุชูุณูŽุงู‡ูู…ู ูููŠ ุจูู†ูŽุงุกู ู…ูุณู’ุชูŽู‚ู’ุจูŽู„ู ุจูŽุงู‡ูุฑู ู„ูู„ู’ุฃูŽุฌู’ูŠูŽุงู„ู.
57
  >
58
  > *The high quality of AI technologies contributes to building a brilliant future for generations to come.*
59
 
60
- <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/10_phonetics_s2.wav"></audio>
 
 
 
 
 
 
 
 
 
 
61
 
62
  ---
63
 
64
- ### 5. Flow & rhythm (seed=42)
 
 
65
 
66
  > ุฅูู†ูŽู‘ ู†ูุธูŽุงู…ูŽ ุจูŽูŠูŽุงู†ูุณููŠู†ู’ุซ ูŠูŽู‡ู’ุฏููู ุฅูู„ูŽู‰ ุชูŽู‚ู’ุฏููŠู…ู ุชูŽุฌู’ุฑูุจูŽุฉู ุตูŽูˆู’ุชููŠูŽู‘ุฉู ููŽุฑููŠุฏูŽุฉูุŒ ุชูŽุฌู’ู…ูŽุนู ุจูŽูŠู’ู†ูŽ ุฏูู‚ูŽู‘ุฉู ุงู„ู†ูู‘ุทู’ู‚ู ูˆูŽุฌูŽู…ูŽุงู„ู ุงู„ู’ุฃูŽุฏูŽุงุกู.
67
  >
68
  > *BayanSynth aims to deliver a unique voice experience that combines precise pronunciation with beauty of delivery.*
69
 
70
- <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/08_flow.wav"></audio>
71
-
72
- ---
73
-
74
- ### 6. Flow, alternate seed (seed=99)
 
 
75
 
76
- Same text, different prosody:
77
 
78
- <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/11_flow_s2.wav"></audio>
79
 
80
  ---
81
 
82
- ### 7. Challenge: tashkeel disambiguation
 
 
83
 
84
  > ุนูŽู„ูู…ูŽ ุงู„ู’ุนูŽุงู„ูู…ู ุฃูŽู†ูŽู‘ ุงู„ู’ุนูŽู„ูŽู…ูŽ ูŠูŽุนู’ู„ููˆ ุจูุงู„ู’ุนูู„ู’ู…ูุŒ ููŽุงุณู’ุชูŽุนู’ู„ูŽู…ูŽ ุนูŽู†ู’ ุนูู„ููˆู…ู ุงู„ู’ุฃูŽูˆูŽู‘ู„ููŠู†ูŽ.
85
  >
86
  > *The scholar knew that the flag rises with knowledge, so he inquired about the sciences of the ancients.*
87
 
 
 
 
 
 
 
 
 
88
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/09_challenge.wav"></audio>
89
 
90
  ---
91
 
92
- ### 8. Instruct prompt: warm newsreader style
 
 
93
 
94
  > ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูู…ู’. ู‡ูŽุฐูŽุง ู…ูุซูŽุงู„ูŒ ุนูŽู„ูŽู‰ ุงุณู’ุชูุฎู’ุฏูŽุงู…ู ุงู„ุชูŽู‘ูˆู’ุฌููŠู‡ู ู„ูุถูŽุจู’ุทู ุฃูุณู’ู„ููˆุจู ุงู„ุตูŽู‘ูˆู’ุชู.
95
  >
96
  > *Welcome. This is an example of using an instruct prompt to control voice style.*
97
 
 
 
 
 
 
 
 
 
 
98
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/12_instruct.wav"></audio>
99
 
100
  ---
101
 
102
- ## Files
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  | File | Description |
105
  |------|-------------|
106
  | `epoch_28_whole.pt` | LoRA weights (LLM, 629 keys) โ€” main checkpoint |
107
  | `samples/*.wav` | Pre-generated audio demos |
108
 
109
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
  ```bash
112
- pip install bayansynthtts
113
  ```
114
 
 
 
 
 
 
 
115
  ```python
116
  from bayansynthtts import BayanSynthTTS
117
-
118
  tts = BayanSynthTTS()
119
- audio = tts.synthesize(
120
- "ู…ูŽุฑู’ุญูŽุจู‹ุงุŒ ุฃูŽู†ูŽุง ุจูŽูŠูŽุงู†ู’ุณููŠู†ู’ุซุŒ ู†ูุธูŽุงู…ูŒ ู„ูุชูŽูˆู’ู„ููŠุฏู ุงู„ู’ูƒูŽู„ูŽุงู…ู ุงู„ู’ุนูŽุฑูŽุจููŠูู‘.",
121
- auto_tashkeel=False,
122
- )
123
- tts.save_wav(audio, "output.wav")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - speech-synthesis
11
  ---
12
 
13
+ # BayanSynthTTS
14
 
15
+ **Arabic Text-to-Speech powered by CosyVoice3 with LoRA fine-tuning.**
16
+
17
+ > Text in. Speech out. Inference only without training or preprocessing.
18
 
19
  **GitHub:** [Ramendan/BayanSynthTTS](https://github.com/Ramendan/BayanSynthTTS)
20
 
21
+ ## Features
22
+
23
+ | Feature | Details |
24
+ |---------|---------|
25
+ | Arabic TTS | Natural-sounding Modern Standard Arabic |
26
+ | Auto-Tashkeel | Automatic diacritization via mishkal (always on by default) |
27
+ | Voice Cloning | Clone any voice from a 5-15 s clip (WAV/MP3/OGG/M4A/FLAC) |
28
+ | Example voices | Two reference voices (`default.wav` and `muffled-talking.wav`) are included; add your own to `voices/` |
29
+ | Speed control | Slow down or speed up synthesis (0.5โ€“2.0ร—) |
30
+ | LoRA Swapping | Change checkpoints via `conf/models.yaml` no code edits |
31
+ | Streaming | Chunk-by-chunk audio generation |
32
+ | Gradio UI | Simple web interface included |
33
+ | CLI | One-liner inference from terminal |
34
+ | Multilingual base | CosyVoice3 supports many languages; Arabic LoRA ships by default |
35
+
36
+ ---
37
+
38
+ > **Multilingual note:** the underlying CosyVoice3 base model is trained for zero-shot
39
+ > synthesis across a wide range of languages. BayanSynthTTS currently defaults to an
40
+ > Arabic-conditioned LoRA checkpoint and delivers the best results in Modern Standard
41
+ > Arabic. You are free to plug in other LoRA files (not provided here) for additional
42
+ > languages, though quality may vary.
43
+
44
  ---
45
 
46
  ## Audio Demos
47
 
48
+ All samples were generated with this library. No post-processing applied.
49
+
50
+ | # | Description | Duration |
51
+ |---|-------------|----------|
52
+ | 1 | Basic synthesis, pre-diacritized | ~5 s |
53
+ | 2 | Pre-diacritized text, mishkal off | ~4 s |
54
+ | 3 | Voice cloning from muffled reference | ~10 s |
55
+ | 4 | Longer passage, AI topic, 3 sentences | ~17 s |
56
+ | 5 | Slow speed (0.80x) | ~10 s |
57
+ | 6 | Fast speed (1.20x) | ~5 s |
58
+ | 7 | Phonetics test: halqiyyat, tanwin, shaddah | ~7 s |
59
+ | 8 | Flow and rhythm, connected speech | ~9 s |
60
+ | 9 | Challenge: identical root, different diacritics | ~5 s |
61
+ | 10 | Phonetics, alternate seed (seed=17) | ~9 s |
62
+ | 11 | Flow, alternate seed (seed=99) | ~10 s |
63
+ | 12 | Instruct prompt: warm newsreader style | ~8 s |
64
+
65
+ ---
66
+
67
+ ### 1. Basic synthesis
68
 
69
  > ู…ูŽุฑู’ุญูŽุจู‹ุงุŒ ุฃูŽู†ูŽุง ุจูŽูŠูŽุงู†ู’ุณููŠู†ู’ุซุŒ ู†ูุธูŽุงู…ูŒ ู„ูุชูŽูˆู’ู„ููŠุฏู ุงู„ู’ูƒูŽู„ูŽุงู…ู ุงู„ู’ุนูŽุฑูŽุจููŠูู‘.
70
  >
71
  > *Hello, I am BayanSynth, a system for generating Arabic speech.*
72
 
73
+ ```python
74
+ from bayansynthtts import BayanSynthTTS
75
+ tts = BayanSynthTTS()
76
+ audio = tts.synthesize(
77
+ "ู…ูŽุฑู’ุญูŽุจู‹ุงุŒ ุฃูŽู†ูŽุง ุจูŽูŠูŽุงู†ู’ุณููŠู†ู’ุซุŒ ู†ูุธูŽุงู…ูŒ ู„ูุชูŽูˆู’ู„ููŠุฏู ุงู„ู’ูƒูŽู„ูŽุงู…ู ุงู„ู’ุนูŽุฑูŽุจููŠูู‘.",
78
+ auto_tashkeel=False,
79
+ )
80
+ tts.save_wav(audio, "output.wav")
81
+ ```
82
+
83
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/01_basic.wav"></audio>
84
 
85
  ---
 
90
  >
91
  > *The Arabic language is a treasure of culture and heritage.*
92
 
93
+ ```python
94
+ audio = tts.synthesize(
95
+ "ุฅูู†ูŽู‘ ุงู„ู„ูู‘ุบูŽุฉูŽ ุงู„ู’ุนูŽุฑูŽุจููŠูŽู‘ุฉูŽ ูƒูŽู†ู’ุฒูŒ ู…ูู†ูŽ ุงู„ุซูŽู‘ู‚ูŽุงููŽุฉู ูˆูŽุงู„ุชูู‘ุฑูŽุงุซู.",
96
+ auto_tashkeel=False,
97
+ )
98
+ ```
99
+
100
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/02_prediacritized.wav"></audio>
101
 
102
  ---
103
 
104
+ ### 3. Voice cloning
105
+
106
+ > ู‡ูŽุฐูŽุง ุงู„ุตูŽู‘ูˆู’ุชู ู…ูุณู’ุชูŽู†ู’ุณูŽุฎูŒ ู…ูู†ู’ ู…ูŽู‚ู’ุทูŽุนู ุตูŽูˆู’ุชููŠูู‘ ู‚ูŽุตููŠุฑู. ูŠูู…ู’ูƒูู†ููƒูŽ ุงุณู’ุชูุฎู’ุฏูŽุงู…ู ุฃูŽูŠูู‘ ู…ูŽู‚ู’ุทูŽุนู ุจูู…ูุฏูŽู‘ุฉู ุฎูŽู…ู’ุณู ุฅูู„ูŽู‰ ุฎูŽู…ู’ุณูŽ ุนูŽุดูŽุฑูŽุฉูŽ ุซูŽุงู†ููŠูŽุฉู‹.
107
+ >
108
+ > *This voice is cloned from a short audio clip. You can use any clip between five and fifteen seconds.*
109
+
110
+ ```python
111
+ audio = tts.synthesize(
112
+ "ู‡ูŽุฐูŽุง ุงู„ุตูŽู‘ูˆู’ุชู ู…ูุณู’ุชูŽู†ู’ุณูŽุฎูŒ ู…ูู†ู’ ู…ูŽู‚ู’ุทูŽุนู ุตูŽูˆู’ุชููŠูู‘ ู‚ูŽุตููŠุฑู. "
113
+ "ูŠูู…ู’ูƒูู†ููƒูŽ ุงุณู’ุชูุฎู’ุฏูŽุงู…ู ุฃูŽูŠูู‘ ู…ูŽู‚ู’ุทูŽุนู ุจูู…ูุฏูŽู‘ุฉู ุฎูŽู…ู’ุณู ุฅูู„ูŽู‰ ุฎูŽู…ู’ุณูŽ ุนูŽุดูŽุฑูŽุฉูŽ ุซูŽุงู†ููŠูŽุฉู‹.",
114
+ ref_audio="voices/muffled_trim.wav",
115
+ auto_tashkeel=False,
116
+ )
117
+ ```
118
+
119
+ **Reference clip:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/ref_voice_muffled.wav"></audio>
120
+
121
+ **Result:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/03_voice_cloning.wav"></audio>
122
+
123
+ ---
124
+
125
+ ### 4. Longer passage (auto-tashkeel, speed 0.88)
126
 
127
  > ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ุฃุญุฏ ุฃุจุฑุฒ ุงู„ุชุทูˆุฑุงุช ุงู„ุชูƒู†ูˆู„ูˆุฌูŠุฉ ููŠ ุนุตุฑู†ุง ุงู„ุญุฏูŠุซ. ูŠุนุชู…ุฏ ุนู„ู‰ ุชุญู„ูŠู„ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช ู„ุงุณุชุฎู„ุงุต ุฃู†ู…ุงุท ู…ุนู‚ุฏุฉ. ูˆู…ู† ุฃุจุฑุฒ ุชุทุจูŠู‚ุงุชู‡ ู†ุธู… ุงู„ุชุนุฑู ุนู„ู‰ ุงู„ุตูˆุช ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุชูˆู„ูŠุฏ ุงู„ู†ุตูˆุต.
128
  >
129
  > *Artificial intelligence is one of the most prominent technological advances of our era. It relies on analyzing massive amounts of data to extract complex patterns. Among its most notable applications: speech recognition, language translation, and text generation.*
130
 
131
+ ```python
132
+ audio = tts.synthesize(
133
+ "ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ู‡ูˆ ุฃุญุฏ ุฃุจุฑุฒ ุงู„ุชุทูˆุฑุงุช ุงู„ุชูƒู†ูˆู„ูˆุฌูŠุฉ ููŠ ุนุตุฑู†ุง ุงู„ุญุฏูŠุซ. "
134
+ "ูŠุนุชู…ุฏ ุนู„ู‰ ุชุญู„ูŠู„ ูƒู…ูŠุงุช ุถุฎู…ุฉ ู…ู† ุงู„ุจูŠุงู†ุงุช ู„ุงุณุชุฎู„ุงุต ุฃู†ู…ุงุท ู…ุนู‚ุฏุฉ. "
135
+ "ูˆู…ู† ุฃุจุฑุฒ ุชุทุจูŠู‚ุงุชู‡ ู†ุธู… ุงู„ุชุนุฑู ุนู„ู‰ ุงู„ุตูˆุช ูˆุชุฑุฌู…ุฉ ุงู„ู„ุบุงุช ูˆุชูˆู„ูŠุฏ ุงู„ู†ุตูˆุต.",
136
+ auto_tashkeel=True,
137
+ speed=0.88,
138
+ )
139
+ ```
140
+
141
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/04_long_text.wav"></audio>
142
 
143
  ---
144
 
145
+ ### 5. Speed control
146
+
147
+ > ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูู…ู’ ูููŠ ุจูŽูŠูŽุงู†ู’ุณููŠู†ู’ุซู. ู‡ูŽุฐูŽุง ุชูŽูˆู’ู„ููŠุฏูŒ ุจูุณูุฑู’ุนูŽุฉู ู…ูุฎูŽููŽู‘ุถูŽุฉู ู„ูู„ุชูŽู‘ูˆู’ุถููŠุญู.
148
+ >
149
+ > *Welcome to BayanSynth. This is synthesis at reduced speed for demonstration.*
150
+
151
+ ```python
152
+ TEXT = "ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูู…ู’ ูููŠ ุจูŽูŠูŽุงู†ู’ุณููŠู†ู’ุซู. ู‡ูŽุฐูŽุง ุชูŽูˆู’ู„ููŠุฏูŒ ุจูุณูุฑู’ุนูŽุฉู ู…ูุฎูŽููŽู‘ุถูŽุฉู ู„ูู„ุชูŽู‘ูˆู’ุถููŠุญู."
153
+ audio = tts.synthesize(TEXT, speed=0.80, auto_tashkeel=False)
154
+ ```
155
+
156
+ **Slow (0.80ร—):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/05_slow_speed.wav"></audio>
157
+
158
+ **Fast (1.20ร—):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/06_fast_speed.wav"></audio>
159
+
160
+ ---
161
+
162
+ ### 6. Phonetics test: halqiyyat, tanwin, shaddah
163
+
164
+ Designed to exercise pharyngeal/velar consonants, gemination, and nunation at once:
165
 
166
  > ุงู„ู’ุฌูŽูˆู’ุฏูŽุฉู ุงู„ู’ุนูŽุงู„ููŠูŽุฉู ู„ูุชูŽู‚ู’ู†ููŠูŽู‘ุงุชู ุงู„ุฐูŽู‘ูƒูŽุงุกู ุงู„ุงุตู’ุทูู†ูŽุงุนููŠูู‘ ุชูุณูŽุงู‡ูู…ู ูููŠ ุจูู†ูŽุงุกู ู…ูุณู’ุชูŽู‚ู’ุจูŽู„ู ุจูŽุงู‡ูุฑู ู„ูู„ู’ุฃูŽุฌู’ูŠูŽุงู„ู.
167
  >
168
  > *The high quality of AI technologies contributes to building a brilliant future for generations to come.*
169
 
170
+ ```python
171
+ audio = tts.synthesize(
172
+ "ุงู„ู’ุฌูŽูˆู’ุฏูŽุฉู ุงู„ู’ุนูŽุงู„ููŠูŽุฉู ู„ูุชูŽู‚ู’ู†ููŠูŽู‘ุงุชู ุงู„ุฐูŽู‘ูƒูŽุงุกู ุงู„ุงุตู’ุทูู†ูŽุงุนููŠูู‘ "
173
+ "ุชูุณูŽุงู‡ูู…ู ูููŠ ุจูู†ูŽุงุกู ู…ูุณู’ุชูŽู‚ู’ุจูŽู„ู ุจูŽุงู‡ูุฑู ู„ูู„ู’ุฃูŽุฌู’ูŠูŽุงู„ู.",
174
+ auto_tashkeel=False,
175
+ )
176
+ ```
177
+
178
+ **seed=42:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/07_phonetics.wav"></audio>
179
+
180
+ **seed=17 (different prosody):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/10_phonetics_s2.wav"></audio>
181
 
182
  ---
183
 
184
+ ### 7. Flow & rhythm test: connected speech
185
+
186
+ Tests natural sandhi, liaison, and intonation across a multi-clause sentence:
187
 
188
  > ุฅูู†ูŽู‘ ู†ูุธูŽุงู…ูŽ ุจูŽูŠูŽุงู†ูุณููŠู†ู’ุซ ูŠูŽู‡ู’ุฏููู ุฅูู„ูŽู‰ ุชูŽู‚ู’ุฏููŠู…ู ุชูŽุฌู’ุฑูุจูŽุฉู ุตูŽูˆู’ุชููŠูŽู‘ุฉู ููŽุฑููŠุฏูŽุฉูุŒ ุชูŽุฌู’ู…ูŽุนู ุจูŽูŠู’ู†ูŽ ุฏูู‚ูŽู‘ุฉู ุงู„ู†ูู‘ุทู’ู‚ู ูˆูŽุฌูŽู…ูŽุงู„ู ุงู„ู’ุฃูŽุฏูŽุงุกู.
189
  >
190
  > *BayanSynth aims to deliver a unique voice experience that combines precise pronunciation with beauty of delivery.*
191
 
192
+ ```python
193
+ audio = tts.synthesize(
194
+ "ุฅูู†ูŽู‘ ู†ูุธูŽุงู…ูŽ ุจูŽูŠูŽุงู†ูุณููŠู†ู’ุซ ูŠูŽู‡ู’ุฏููู ุฅูู„ูŽู‰ ุชูŽู‚ู’ุฏููŠู…ู ุชูŽุฌู’ุฑูุจูŽุฉู ุตูŽูˆู’ุชููŠูŽู‘ุฉู ููŽุฑููŠุฏูŽุฉูุŒ "
195
+ "ุชูŽุฌู’ู…ูŽุนู ุจูŽูŠู’ู†ูŽ ุฏูู‚ูŽู‘ุฉู ุงู„ู†ูู‘ุทู’ู‚ู ูˆูŽุฌูŽู…ูŽุงู„ู ุงู„ู’ุฃูŽุฏูŽุงุกู.",
196
+ auto_tashkeel=False,
197
+ )
198
+ ```
199
 
200
+ **seed=42:** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/08_flow.wav"></audio>
201
 
202
+ **seed=99 (different prosody):** <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/11_flow_s2.wav"></audio>
203
 
204
  ---
205
 
206
+ ### 8. Challenge: tashkeel disambiguation
207
+
208
+ All five ุน-rooted words differ **only** by their diacritics; correct rendering proves the model reads harakat accurately:
209
 
210
  > ุนูŽู„ูู…ูŽ ุงู„ู’ุนูŽุงู„ูู…ู ุฃูŽู†ูŽู‘ ุงู„ู’ุนูŽู„ูŽู…ูŽ ูŠูŽุนู’ู„ููˆ ุจูุงู„ู’ุนูู„ู’ู…ูุŒ ููŽุงุณู’ุชูŽุนู’ู„ูŽู…ูŽ ุนูŽู†ู’ ุนูู„ููˆู…ู ุงู„ู’ุฃูŽูˆูŽู‘ู„ููŠู†ูŽ.
211
  >
212
  > *The scholar knew that the flag rises with knowledge, so he inquired about the sciences of the ancients.*
213
 
214
+ ```python
215
+ audio = tts.synthesize(
216
+ "ุนูŽู„ูู…ูŽ ุงู„ู’ุนูŽุงู„ูู…ู ุฃูŽู†ูŽู‘ ุงู„ู’ุนูŽู„ูŽู…ูŽ ูŠูŽุนู’ู„ููˆ ุจูุงู„ู’ุนูู„ู’ู…ูุŒ "
217
+ "ููŽุงุณู’ุชูŽุนู’ู„ูŽู…ูŽ ุนูŽู†ู’ ุนูู„ููˆู…ู ุงู„ู’ุฃูŽูˆูŽู‘ู„ููŠู†ูŽ.",
218
+ auto_tashkeel=False,
219
+ )
220
+ ```
221
+
222
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/09_challenge.wav"></audio>
223
 
224
  ---
225
 
226
+ ### 9. Instruct prompt: warm newsreader style
227
+
228
+ Pass a free-text style directive alongside the synthesis text to steer the speaker's tone, register, or delivery:
229
 
230
  > ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูู…ู’. ู‡ูŽุฐูŽุง ู…ูุซูŽุงู„ูŒ ุนูŽู„ูŽู‰ ุงุณู’ุชูุฎู’ุฏูŽุงู…ู ุงู„ุชูŽู‘ูˆู’ุฌููŠู‡ู ู„ูุถูŽุจู’ุทู ุฃูุณู’ู„ููˆุจู ุงู„ุตูŽู‘ูˆู’ุชู.
231
  >
232
  > *Welcome. This is an example of using an instruct prompt to control voice style.*
233
 
234
+ ```python
235
+ audio = tts.synthesize(
236
+ "ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูู…ู’. ู‡ูŽุฐูŽุง ู…ูุซูŽุงู„ูŒ ุนูŽู„ูŽู‰ ุงุณู’ุชูุฎู’ุฏูŽุงู…ู ุงู„ุชูŽู‘ูˆู’ุฌููŠู‡ู ู„ูุถูŽุจู’ุทู ุฃูุณู’ู„ููˆุจู ุงู„ุตูŽู‘ูˆู’ุชู.",
237
+ instruct="Speak in a warm, clear newsreader style with careful diction.",
238
+ auto_tashkeel=False,
239
+ seed=42,
240
+ )
241
+ ```
242
+
243
  <audio controls src="https://huggingface.co/Ramendan/BayanSynthTTS-checkpoints/resolve/main/samples/12_instruct.wav"></audio>
244
 
245
  ---
246
 
247
+ ## Quick Start
248
+
249
+ ### 1. Clone and install
250
+
251
+ ```bash
252
+ git clone https://github.com/Ramendan/BayanSynthTTS
253
+ cd BayanSynthTTS
254
+ python -m venv .venv
255
+ .venv\Scripts\activate # Windows
256
+ # source .venv/bin/activate # Linux / macOS
257
+ pip install -r requirements.txt
258
+ pip install -e . # installs bayansynthtts + bundled packages into the venv
259
+ ```
260
+
261
+ > The CosyVoice3 inference engine and Matcha-TTS decoder are **bundled directly in this repo**. No external private repos required.
262
+ >
263
+ > **Example voices:** two reference clips (`default.wav` and `muffled-talking.wav`) live in `voices/`. Drop additional 5-15 s recordings there and they automatically appear in the CLI/UI dropdown.
264
+
265
+ ### 2. Download models
266
+
267
+ ```bash
268
+ python scripts/setup_models.py
269
+ ```
270
+
271
+ This downloads everything automatically:
272
+ - CosyVoice3 base weights (~2 GB) from Hugging Face โ†’ `pretrained_models/CosyVoice3/`
273
+ - Arabic LoRA checkpoint from Hugging Face โ†’ `checkpoints/llm/epoch_28_whole.pt`
274
+ - Verifies the checkpoint SHA-256
275
+
276
+ > On Windows you can also double-click `scripts\setup_models.bat`.
277
+
278
+ ### 3. Run
279
+
280
+ **Web UI:**
281
+ ```bash
282
+ scripts\run_ui.bat # Windows GUI launcher
283
+ python bayansynthtts/app.py # Cross-platform (run from inside BayanSynthTTS/)
284
+ ```
285
+
286
+ ---
287
+
288
+ ## Files in this repo
289
 
290
  | File | Description |
291
  |------|-------------|
292
  | `epoch_28_whole.pt` | LoRA weights (LLM, 629 keys) โ€” main checkpoint |
293
  | `samples/*.wav` | Pre-generated audio demos |
294
 
295
+ ---
296
+
297
+ ## Swapping the LoRA Checkpoint
298
+
299
+ ### Via `conf/models.yaml` (recommended, no code changes)
300
+
301
+ ```yaml
302
+ llm_lora:
303
+ enabled: true
304
+ checkpoint: "checkpoints/llm/my_new_epoch.pt" # โ† change this line only
305
+ ```
306
+
307
+ ### Via Python constructor (for A/B testing at runtime)
308
+
309
+ ```python
310
+ tts = BayanSynthTTS(llm_checkpoint="checkpoints/llm/epoch_40.pt")
311
+ ```
312
+
313
+ ### Via CLI flag
314
 
315
  ```bash
316
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹" --llm checkpoints/llm/epoch_40.pt
317
  ```
318
 
319
+ ---
320
+
321
+ ## Adding Your Own Voices
322
+
323
+ Drop any 5-15 second Arabic clip into `voices/`. Supported formats: WAV, MP3, FLAC, OGG, M4A. Non-WAV files are auto-converted at runtime.
324
+
325
  ```python
326
  from bayansynthtts import BayanSynthTTS
 
327
  tts = BayanSynthTTS()
328
+ print(tts.list_voices()) # e.g. ['default.wav', 'muffled-talking.wav', 'my_voice.wav']
329
+ ```
330
+
331
+ ```bash
332
+ bayansynthtts "ู…ุฑุญุจุง" --voice voices/my_voice.wav
333
+ ```
334
+
335
+ ---
336
+
337
+ ## CLI Reference
338
+
339
+ ```bash
340
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูู…ู’" # basic synthesis โ†’ output.wav
341
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹" -o hello.wav # custom output path
342
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹" --voice voices/speaker2.wav # use specific voice
343
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹" --llm checkpoints/llm/new.pt # override LLM LoRA
344
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹" --speed 0.85 # slower speech
345
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹" --no-tashkeel # skip auto-diacritize
346
+ bayansynthtts "ู…ูŽุฑู’ุญูŽุจุงู‹" --seed 123 # reproducible output
347
+ bayansynthtts --help
348
+ ```
349
+
350
+ ---
351
+
352
+ ## API Reference
353
+
354
+ ### `BayanSynthTTS`
355
+
356
+ | Argument | Type | Default | Description |
357
+ |----------|------|---------|-------------|
358
+ | `model_dir` | `str` | from YAML | CosyVoice3 weights directory |
359
+ | `llm_checkpoint` | `str` | from YAML | LLM LoRA `.pt` path |
360
+ | `ref_audio` | `str` | from YAML | Default reference voice path |
361
+ | `instruct` | `str` | from YAML | Instruct prompt text |
362
+ | `config_path` | `str` | `conf/models.yaml` | Custom config file path |
363
+
364
+ ### `synthesize(text, *, ...)`
365
+
366
+ | Argument | Type | Default | Description |
367
+ |----------|------|---------|-------------|
368
+ | `text` | `str` | required | Arabic text (plain or diacritized) |
369
+ | `ref_audio` | `str` | default voice | Voice clone source (any format) |
370
+ | `instruct` | `str` | from config | Instruct prompt override |
371
+ | `speed` | `float` | `1.0` | Speed multiplier (0.5-2.0) |
372
+ | `stream` | `bool` | `False` | Yield chunks vs return full array |
373
+ | `seed` | `int` | `None` | Random seed for reproducibility |
374
+ | `auto_tashkeel` | `bool` | `True` | Auto-diacritize input text |
375
+
376
+ ### Tashkeel utilities
377
+
378
+ ```python
379
+ from bayansynthtts import auto_diacritize, has_harakat, strip_harakat, list_available_backends
380
+
381
+ auto_diacritize("ู…ุฑุญุจุง ุจูƒู…") # โ†’ "ู…ูŽุฑู’ุญูŽุจุงู‹ ุจููƒูู…ู’"
382
+ has_harakat("ู…ูŽุฑู’ุญูŽุจุงู‹") # โ†’ True
383
+ strip_harakat("ู…ูŽุฑู’ุญูŽุจุงู‹") # โ†’ "ู…ุฑุญุจุง"
384
+ list_available_backends() # โ†’ ['mishkal'] (or ['tashkeel', 'mishkal'])
385
  ```
386
+
387
+ ---
388
+
389
+ ## Troubleshooting
390
+
391
+ | Problem | Solution |
392
+ |---------|---------|
393
+ | `No module named 'cosyvoice'` | Run `pip install -e .` from inside `BayanSynthTTS/` |
394
+ | `No LLM checkpoint found` | Run `python scripts/setup_models.py` |
395
+ | `mishkal not found` | `pip install mishkal` |
396
+ | No audio generated | Check console for the specific mode that failed; verify `voices/default.wav` exists |
397
+ | MP3/M4A upload fails | Install ffmpeg: `winget install ffmpeg` (Windows) or `sudo apt install ffmpeg` (Linux) |
398
+
399
+ ---
400
+
401
+ ## License
402
+
403
+ Apache 2.0.
404
+
405
+ The underlying CosyVoice3 model is subject to its own license.
406
+ LoRA checkpoints trained on Common Voice Arabic data are released under CC-BY 4.0.