Files changed (1) hide show
  1. README.md +1 -138
README.md CHANGED
@@ -1,138 +1 @@
1
- ---
2
- language:
3
- - ar
4
- base_model:
5
- - SparkAudio/Spark-TTS-0.5B
6
- tags:
7
- - speech
8
- - arabic
9
- - spark
10
- - tts
11
- - text-to-speech
12
- license: fair-noncommercial-research-license
13
- ---
14
- # Spark-TTS Arabic
15
- ## نموذج تحويل النص إلى كلام باللغة العربية
16
-
17
- Arabic text-to-speech model fine-tuned on 300 hours of clean Arabic audio data. Delivers consistent, high-quality speech synthesis for Modern Standard Arabic with full diacritization.
18
-
19
- ## Model Details
20
-
21
- **Training Data:** ~300 hours of clean Arabic audio
22
- **Language:** Modern Standard Arabic (MSA)
23
- **Sample Rate:** 24kHz
24
-
25
- ## Usage
26
-
27
- ### Quick Start
28
-
29
- see the [Colab notebook](https://colab.research.google.com/drive/1-Jxgy8BjvyWHKppdBPtz4s35Er3qDv-K?usp=sharing).
30
- HF space : [Arabic Spark TTS Space](https://huggingface.co/spaces/IbrahimSalah/Arabic-TTS-Spark).
31
-
32
-
33
-
34
- ```python
35
- from transformers import AutoProcessor, AutoModel
36
- import soundfile as sf
37
- import torch
38
-
39
- # Load model
40
- model_id = "IbrahimSalah/Arabic-TTS-Spark"
41
- device = "cuda" if torch.cuda.is_available() else "cpu"
42
-
43
- processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
44
- model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval().to(device)
45
-
46
- # Prepare inputs
47
- inputs = processor(
48
- text="YOUR_TEXT_WITH_TASHKEEL",
49
- prompt_speech_path="path/to/reference.wav",
50
- prompt_text="REFERENCE_TEXT_WITH_TASHKEEL",
51
- return_tensors="pt"
52
- ).to(device)
53
-
54
- # Generate
55
- with torch.no_grad():
56
- output_ids = model.generate(**inputs, max_new_tokens=8000, temperature=0.8)
57
-
58
- # Decode
59
- output = processor.decode(generated_ids=output_ids)
60
- sf.write("output.wav", output["audio"], output["sampling_rate"])
61
- ```
62
-
63
- ## Key Features
64
-
65
- - High-quality Arabic speech synthesis with natural prosody
66
- - Efficient voice cloning from reference audio
67
- - Advanced text chunking for long-form content
68
- - Built-in audio post-processing (normalization, silence removal, crossfading)
69
- - Works best with moderate text lengths
70
- - Adjustable generation parameters (temperature, top_k, top_p)
71
-
72
- ## Input Requirements
73
-
74
- **Critical:** Text must include full Arabic diacritization (tashkeel). The model is trained exclusively on fully diacritized text and will not perform well on non-diacritized input.
75
-
76
- Example of correct input:
77
- ```
78
- إِنَّ الْعِلْمَ نُورٌ يُقْذَفُ فِي الْقَلْبِ
79
- ```
80
-
81
- ### Generation Parameters
82
-
83
- ```python
84
- tts.generate_long_text(
85
- text=your_text,
86
- prompt_audio_path="reference.wav",
87
- prompt_transcript="reference_text",
88
- output_path="output.wav",
89
- max_chunk_length=300, # Characters per chunk
90
- crossfade_duration=0.08, # Crossfade duration in seconds
91
- normalize_audio_flag=True,
92
- remove_silence_flag=True,
93
- temperature=0.8, # Generation randomness
94
- top_p=0.95, # Nucleus sampling
95
- top_k=50 # Top-k sampling
96
- )
97
- ```
98
-
99
- ## Sample Output
100
-
101
- **Text:** "إِنَّ الدَّوْلَةَ لَهَا أَعْمَارٌ طَبِيعِيَّةٌ كَمَا لِلْأَشْخَاصِ. وَأَنَّهَا تَنْتَقِلُ فِي أَطْوَارٍ مُخْتَلِفَةٍ، فَيَكُونُ الْجِيلُ الْأَوَّلُ مِنْ أَهْلِ الدَّوْلَةِ، قَدْ حَافَظُوا عَلَى الْخُشُونَةِ الْبَدَوِيَّةِ، وَالتَّوَحُّشِ، وَالشَّظَفِ، وَالْبَأْسِ، وَالِاشْتِرَاكِ فِي الْمَجْدِ. فَتَكُونُ حُدُودُهُمْ مَرْهُوبَةً، وَجَوَانِبُهُمْ مُعَزَّزَةً. ثُمَّ يَأْتِي الْجِيلُ الثَّانِي، فَيَتَحَوَّلُ حَالُهُمْ بِالْمُلْكِ وَالتَّرَفِ مِنَ الْبَدَاوَةِ إِلَى الْحَضَارَةِ، وَمِنَ الْخُشُونَةِ إِلَى التَّرَفِ. فَيَنْكَسِرُ سَوْرَةُ الْعَصَبِيَّةِ قَلِيلًا. ثُمَّ يَأْتِي الْجِيلُ الثَّالِثُ، فَيَكُونُونَ قَدْ نَسُوا عَهْدَ الْبَدَاوَةِ وَالْخُشُونَةِ، وَيَنْغَمِسُونَ فِي النَّعِيمِ وَالتَّرَفِ، وَيَصِيرُونَ عِيَالًا عَلَى الدَّوْلَةِ. فَيَسْقُطُونَ فِي الْهَرَمِ وَالزَّوَالِ، وَيَحْتَاجُونَ إِلَى مَنْ يُدَافِعُ عَنْهُمْ، فَتَبْدَأُ الدَّوْلَةُ فِي الِانْقِرَاضِ."
102
-
103
- <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/645098004f731658826cfe57/FCGgeIu1F89rvNI55aVIx.wav"></audio>
104
- ## refrence audio
105
-
106
- <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/645098004f731658826cfe57/cA9Z77_P0Rm2-hu1eosOC.wav"></audio>
107
-
108
- ## Further Fine-tuning
109
-
110
- The model can be further fine-tuned for:
111
- - Non-diacritized text (requires additional training)
112
- - Specific voice characteristics
113
- - Domain-specific vocabulary
114
- - Dialectal variations
115
-
116
- Fine-tuning infrastructure: [Spark-TTS Fine-tune](https://github.com/tuan12378/Spark-TTS-finetune)
117
-
118
-
119
- ## License
120
-
121
- This model is released under a **Non-Commercial License**.
122
-
123
- - You may use this model for research, educational, and personal non-commercial purposes.
124
- - Commercial use is strictly prohibited without explicit permission.
125
- - If you wish to use this model for commercial purposes, please contact the model author.
126
-
127
-
128
- ## Acknowledgments
129
-
130
- - Base model: [Spark-TTS](https://github.com/tuan12378/Spark-TTS-finetune) by tuan12378
131
-
132
- ## Limitations
133
-
134
- - Requires fully diacritized Arabic text as input
135
- - Optimized for Modern Standard Arabic (MSA), not dialectal Arabic
136
- - Performance may vary with very long texts without proper chunking
137
- - Voice cloning quality depends on reference audio quality and length
138
- - Generation speed scales with text length
 
1
+ مرحبا مليون كيف حالك