ibrahimabdelaal commited on
Commit
b358825
ยท
1 Parent(s): 1bd4aa0

Add F5-TTS Gradio Space with voice cloning

Browse files
Files changed (7) hide show
  1. .gitattributes +3 -32
  2. .gitignore +8 -0
  3. README.md +71 -6
  4. app.py +203 -0
  5. packages.txt +2 -0
  6. reference.wav +3 -0
  7. requirements.txt +5 -0
.gitattributes CHANGED
@@ -1,35 +1,6 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
  *.pt filter=lfs diff=lfs merge=lfs -text
23
  *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.wav filter=lfs diff=lfs merge=lfs -text
2
+ *.mp3 filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  *.pt filter=lfs diff=lfs merge=lfs -text
4
  *.pth filter=lfs diff=lfs merge=lfs -text
5
+ *.bin filter=lfs diff=lfs merge=lfs -text
6
  *.safetensors filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.py[cod]
3
+ *$py.class
4
+ *.so
5
+ .Python
6
+ build/
7
+ flagged/
8
+ gradio_queue.db
README.md CHANGED
@@ -1,12 +1,77 @@
1
  ---
2
- title: Arabic F5 TTS
3
- emoji: ๐ŸŒ–
4
- colorFrom: yellow
5
- colorTo: yellow
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Arabic F5-TTS
3
+ emoji: ๐ŸŽ™๏ธ
4
+ colorFrom: green
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
+ models:
12
+ - IbrahimSalah/Arabic-F5-TTS-v2
13
+ tags:
14
+ - text-to-speech
15
+ - tts
16
+ - arabic
17
+ - voice-cloning
18
+ - f5-tts
19
  ---
20
 
21
+ # ๐ŸŽ™๏ธ Arabic Text-to-Speech (F5-TTS Model)
22
+
23
+ High-quality Arabic text-to-speech synthesis using the F5-TTS model with voice cloning capabilities.
24
+
25
+ ## ๐ŸŒŸ Features
26
+
27
+ - **Voice Cloning**: Upload a reference audio to clone the voice style
28
+ - **Diacritized Text Support**: Uses fully diacritized Arabic text (ุชุดูƒูŠู„) for accurate pronunciation
29
+ - **High Quality**: Natural-sounding speech with controllable parameters
30
+ - **Fast Generation**: Efficient inference with NFE steps control
31
+ - **Speed Control**: Adjust speech speed from 0.5x to 2.0x
32
+
33
+ ## ๐Ÿš€ Quick Start
34
+
35
+ 1. **Enter diacritized Arabic text** (with ุชุดูƒูŠู„)
36
+ 2. **Use the default reference audio** or upload your own (WAV format, 5-30 seconds)
37
+ 3. **Provide the diacritized transcript** of your reference audio
38
+ 4. **Adjust settings** (optional) - NFE steps, CFG strength, speed
39
+ 5. **Click "Generate Speech"**
40
+
41
+ ## โš ๏ธ Important: Diacritized Text Required
42
+
43
+ This model requires **fully diacritized Arabic text (ุชุดูƒูŠู„)** for both:
44
+ - Input text to synthesize
45
+ - Reference audio transcript
46
+
47
+ ### How to Add Diacritics:
48
+
49
+ **Option 1: Use AI (Recommended)**
50
+ - Ask ChatGPT, Claude, or Gemini: "ุฃุถู ุงู„ุชุดูƒูŠู„ ุงู„ูƒุงู…ู„ ู„ู„ู†ุต ุงู„ุชุงู„ูŠ: [your text]"
51
+
52
+ **Option 2: Online Tools**
53
+ - [Mishkal Tashkeel](https://tahadz.com/mishkal)
54
+ - [Harakat.ai](https://harakat.ai)
55
+
56
+ ## ๐ŸŽฏ Model Information
57
+
58
+ - **Model ID**: `IbrahimSalah/Arabic-F5-TTS-v2`
59
+ - **Language**: Modern Standard Arabic (MSA) and dialects
60
+ - **Sample Rate**: 24kHz
61
+ - **Architecture**: Flow Matching based TTS (F5-TTS)
62
+
63
+ ## ๐Ÿ”ง Advanced Settings
64
+
65
+ - **NFE Steps**: Number of function evaluations (16-64, default: 32) - Higher = better quality but slower
66
+ - **CFG Strength**: Classifier-free guidance strength (0-3, default: 1.8) - Controls adherence to prompt
67
+ - **Speed**: Playback speed (0.5-2.0, default: 1.0)
68
+
69
+ ## ๐Ÿ”— Related Resources
70
+
71
+ - **Model Card**: [IbrahimSalah/Arabic-F5-TTS-v2](https://huggingface.co/IbrahimSalah/Arabic-F5-TTS-v2)
72
+ - **Spark TTS Arabic**: [IbrahimSalah/Arabic-TTS-Spark](https://huggingface.co/IbrahimSalah/Arabic-TTS-Spark)
73
+ - **Report Issues**: [Discussions](https://huggingface.co/IbrahimSalah/Arabic-F5-TTS-v2/discussions)
74
+
75
+ ## ๐Ÿ“„ License
76
+
77
+ MIT License
app.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ import torchaudio
4
+ import spaces
5
+ import os
6
+ import tempfile
7
+ from pathlib import Path
8
+ from huggingface_hub import hf_hub_download
9
+
10
+ # Import F5-TTS
11
+ from f5_tts.infer.utils_infer import infer_process, load_model, load_vocoder
12
+ from f5_tts.model import DiT, UNetT
13
+
14
+ # Global cache for models
15
+ model_cache = {}
16
+
17
+ def load_f5_model():
18
+ """Load F5-TTS model (cached)."""
19
+ if "model" not in model_cache:
20
+ print("Loading F5-TTS model...")
21
+
22
+ # Download model files
23
+ vocab_file = hf_hub_download(repo_id="IbrahimSalah/Arabic-F5-TTS-v2", filename="vocab.txt")
24
+ ckpt_file = hf_hub_download(repo_id="IbrahimSalah/Arabic-F5-TTS-v2", filename="model_547500_8_18.pt")
25
+ config_file = hf_hub_download(repo_id="IbrahimSalah/Arabic-F5-TTS-v2", filename="F5TTS_Base_8_18.yaml")
26
+
27
+ device = "cuda" if torch.cuda.is_available() else "cpu"
28
+
29
+ # Load model
30
+ model, vocab_char_map, vocab_size = load_model(
31
+ model_cls=DiT,
32
+ model_cfg=config_file,
33
+ ckpt_path=ckpt_file,
34
+ vocab_file=vocab_file,
35
+ device=device
36
+ )
37
+
38
+ model_cache["model"] = model
39
+ model_cache["vocab_char_map"] = vocab_char_map
40
+ model_cache["vocab_size"] = vocab_size
41
+ model_cache["device"] = device
42
+ print("Model loaded successfully!")
43
+
44
+ return model_cache["model"], model_cache["vocab_char_map"], model_cache["vocab_size"], model_cache["device"]
45
+
46
+
47
+ @spaces.GPU(duration=120)
48
+ def generate_speech(
49
+ text: str,
50
+ reference_audio,
51
+ reference_transcript: str,
52
+ nfe_step: int = 32,
53
+ cfg_strength: float = 1.8,
54
+ speed: float = 1.0,
55
+ progress=gr.Progress()
56
+ ):
57
+ """Generate speech using F5-TTS."""
58
+ try:
59
+ # Load model
60
+ progress(0.1, desc="Loading model...")
61
+ model, vocab_char_map, vocab_size, device = load_f5_model()
62
+
63
+ # Validate inputs
64
+ if not text.strip():
65
+ return None, "โŒ Please enter text to synthesize."
66
+
67
+ if reference_audio is None:
68
+ return None, "โŒ Please upload a reference audio file."
69
+
70
+ if not reference_transcript.strip():
71
+ return None, "โŒ Please enter the reference transcript."
72
+
73
+ # Generate audio
74
+ progress(0.3, desc="Generating audio...")
75
+
76
+ # Create temporary output file
77
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
78
+ output_path = tmp_file.name
79
+
80
+ # Run inference
81
+ audio, sample_rate, _ = infer_process(
82
+ ref_audio=reference_audio,
83
+ ref_text=reference_transcript,
84
+ gen_text=text,
85
+ model_obj=model,
86
+ vocoder=None,
87
+ mel_spec_type="vocos",
88
+ show_info=print,
89
+ progress=progress,
90
+ target_rms=0.1,
91
+ cross_fade_duration=0.15,
92
+ nfe_step=nfe_step,
93
+ cfg_strength=cfg_strength,
94
+ sway_sampling_coef=-1.0,
95
+ speed=speed,
96
+ fix_duration=None,
97
+ device=device,
98
+ vocab_char_map=vocab_char_map,
99
+ )
100
+
101
+ # Save audio
102
+ progress(0.9, desc="Saving audio...")
103
+ torchaudio.save(output_path, audio, sample_rate)
104
+
105
+ duration = audio.shape[-1] / sample_rate
106
+ status = f"โœ… Generated {duration:.2f}s audio"
107
+
108
+ progress(1.0, desc="Complete!")
109
+ return output_path, status
110
+
111
+ except Exception as e:
112
+ import traceback
113
+ error_msg = f"โŒ Error: {str(e)}\n{traceback.format_exc()}"
114
+ print(error_msg)
115
+ return None, error_msg
116
+
117
+
118
+ # Default examples
119
+ DEFAULT_REFERENCE_TEXT = "ู„ูŽุง ูŠูŽู…ูุฑูู‘ ูŠูŽูˆู’ู…ูŒ ุฅูู„ูŽู‘ุง ูˆูŽุฃูŽุณู’ุชูŽู‚ู’ุจูู„ู ุนูุฏูŽู‘ุฉูŽ ุฑูŽุณูŽุงุฆูู„ูŽุŒ ุชูŽุชูŽุถูŽู…ูŽู‘ู†ู ุฃูŽุณู’ุฆูู„ูŽุฉู‹ ู…ูู„ูุญูŽู‘ุฉู’."
120
+ DEFAULT_TEXT = "ุชูุณูŽุงู‡ูู…ู ุงู„ุชูู‘ู‚ู’ู†ููŠูŽู‘ุงุชู ุงู„ู’ุญูŽุฏููŠุซูŽุฉู ูููŠ ุชูŽุณู’ู‡ููŠู„ู ุญูŽูŠูŽุงุฉู ุงู„ู’ุฅูู†ู’ุณูŽุงู†ูุŒ ูˆูŽุฐูŽู„ููƒูŽ ู…ูู†ู’ ุฎูู„ูŽุงู„ู ุชูŽุทู’ูˆููŠุฑู ุฃูŽู†ู’ุธูู…ูŽุฉู ุฐูŽูƒููŠูŽู‘ุฉู ุชูŽุนู’ุชูŽู…ูุฏู ุนูŽู„ูŽู‰ ุงู„ุฐูŽู‘ูƒูŽุงุกู ุงู„ูุงุตู’ุทูู†ูŽุงุนููŠูู‘."
121
+ DEFAULT_REFERENCE_AUDIO = "reference.wav"
122
+
123
+ # Create Gradio interface
124
+ with gr.Blocks(title="Arabic F5-TTS", theme=gr.themes.Soft()) as demo:
125
+ gr.Markdown("""
126
+ # ๐ŸŽ™๏ธ Arabic Text-to-Speech | F5-TTS Model
127
+
128
+ High-quality Arabic TTS with voice cloning. **Diacritized text (ุชุดูƒูŠู„) required.**
129
+
130
+ **Model:** [IbrahimSalah/Arabic-F5-TTS-v2](https://huggingface.co/IbrahimSalah/Arabic-F5-TTS-v2)
131
+ """)
132
+
133
+ with gr.Row():
134
+ with gr.Column(scale=1):
135
+ text_input = gr.Textbox(
136
+ label="๐Ÿ“ Text to Synthesize (Arabic with Tashkeel)",
137
+ placeholder="ุฃูŽุฏู’ุฎูู„ู’ ู†ูŽุตู‹ู‘ุง ุนูŽุฑูŽุจููŠู‹ู‘ุง ู…ูุดูŽูƒูŽู‘ู„ู‹ุง ๏ฟฝ๏ฟฝูู†ูŽุง...",
138
+ lines=6,
139
+ value=DEFAULT_TEXT
140
+ )
141
+
142
+ with gr.Row():
143
+ with gr.Column():
144
+ gr.Markdown("**๐ŸŽต Reference Audio**")
145
+ reference_audio = gr.Audio(
146
+ label="",
147
+ type="filepath",
148
+ value=DEFAULT_REFERENCE_AUDIO
149
+ )
150
+
151
+ with gr.Column():
152
+ reference_transcript = gr.Textbox(
153
+ label="๐Ÿ“„ Reference Transcript (with Tashkeel)",
154
+ placeholder="ุงู„ู†ุต ุงู„ู…ู‚ุงุจู„ ู„ู„ุตูˆุช ุงู„ู…ุฑุฌุนูŠ...",
155
+ lines=4,
156
+ value=DEFAULT_REFERENCE_TEXT
157
+ )
158
+
159
+ with gr.Accordion("โš™๏ธ Advanced Settings", open=False):
160
+ with gr.Row():
161
+ nfe_step = gr.Slider(16, 64, value=32, step=1, label="NFE Steps")
162
+ cfg_strength = gr.Slider(0.0, 3.0, value=1.8, step=0.1, label="CFG Strength")
163
+ with gr.Row():
164
+ speed = gr.Slider(0.5, 2.0, value=1.0, step=0.1, label="Speed")
165
+
166
+ generate_btn = gr.Button("๐ŸŽค Generate Speech", variant="primary", size="lg")
167
+
168
+ with gr.Column(scale=1):
169
+ output_audio = gr.Audio(label="๐Ÿ”Š Generated Speech", type="filepath")
170
+ status_text = gr.Textbox(label="Status", interactive=False, lines=2)
171
+
172
+ gr.Markdown("""
173
+ ### โ„น๏ธ Requirements
174
+ - **Diacritized text is required** (ุชุดูƒูŠู„/ุชุดูƒูŠู„)
175
+ - Reference audio: 5-30 seconds, clear speech
176
+ - Use AI (ChatGPT/Claude) or [online tools](https://tahadz.com/mishkal) to add diacritics
177
+
178
+ ### ๐Ÿ”— Resources
179
+ - [Model Card](https://huggingface.co/IbrahimSalah/Arabic-F5-TTS-v2)
180
+ - [Spark TTS](https://huggingface.co/IbrahimSalah/Arabic-TTS-Spark)
181
+ - [Report Issues](https://huggingface.co/IbrahimSalah/Arabic-F5-TTS-v2/discussions)
182
+ """)
183
+
184
+ # Examples
185
+ with gr.Accordion("๐Ÿ“š Examples", open=False):
186
+ gr.Examples(
187
+ examples=[
188
+ [DEFAULT_TEXT, DEFAULT_REFERENCE_AUDIO, DEFAULT_REFERENCE_TEXT, 32, 1.8, 1.0],
189
+ ["ุงู„ุณูŽู‘ู„ูŽุงู…ู ุนูŽู„ูŽูŠู’ูƒูู…ู’ ูˆูŽุฑูŽุญู’ู…ูŽุฉู ุงู„ู„ูŽู‘ู‡ู ูˆูŽุจูŽุฑูŽูƒูŽุงุชูู‡ูุŒ ูƒูŽูŠู’ููŽ ุญูŽุงู„ููƒูŽ ุงู„ู’ูŠูŽูˆู’ู…ูŽุŸ", DEFAULT_REFERENCE_AUDIO, DEFAULT_REFERENCE_TEXT, 32, 1.8, 1.0],
190
+ ["ุงู„ุฐูŽู‘ูƒูŽุงุกู ุงู„ูุงุตู’ุทูู†ูŽุงุนููŠูู‘ ูŠูุบูŽูŠูู‘ุฑู ุงู„ู’ุนูŽุงู„ูŽู…ูŽ ุจูุณูุฑู’ุนูŽุฉู ูƒูŽุจููŠุฑูŽุฉู.", DEFAULT_REFERENCE_AUDIO, DEFAULT_REFERENCE_TEXT, 32, 1.8, 1.0]
191
+ ],
192
+ inputs=[text_input, reference_audio, reference_transcript, nfe_step, cfg_strength, speed]
193
+ )
194
+
195
+ generate_btn.click(
196
+ fn=generate_speech,
197
+ inputs=[text_input, reference_audio, reference_transcript, nfe_step, cfg_strength, speed],
198
+ outputs=[output_audio, status_text]
199
+ )
200
+
201
+ if __name__ == "__main__":
202
+ demo.queue(max_size=20)
203
+ demo.launch()
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ libsndfile1
2
+ ffmpeg
reference.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d6db1e038c67df75cdde9ad1e43ba05f660eebc9346a30617d9b2f3892a5b201
3
+ size 1058478
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ gradio==4.44.0
2
+ torch>=2.0.0
3
+ torchaudio>=2.0.0
4
+ spaces
5
+ git+https://github.com/ibrahimabdelaal/F5-TTS-Arabic.git