Update README.md
#2
by
AbdulRahman1123 - opened
README.md
CHANGED
|
@@ -1,138 +1 @@
|
|
| 1 |
-
|
| 2 |
-
language:
|
| 3 |
-
- ar
|
| 4 |
-
base_model:
|
| 5 |
-
- SparkAudio/Spark-TTS-0.5B
|
| 6 |
-
tags:
|
| 7 |
-
- speech
|
| 8 |
-
- arabic
|
| 9 |
-
- spark
|
| 10 |
-
- tts
|
| 11 |
-
- text-to-speech
|
| 12 |
-
license: fair-noncommercial-research-license
|
| 13 |
-
---
|
| 14 |
-
# Spark-TTS Arabic
|
| 15 |
-
## نموذج تحويل النص إلى كلام باللغة العربية
|
| 16 |
-
|
| 17 |
-
Arabic text-to-speech model fine-tuned on 300 hours of clean Arabic audio data. Delivers consistent, high-quality speech synthesis for Modern Standard Arabic with full diacritization.
|
| 18 |
-
|
| 19 |
-
## Model Details
|
| 20 |
-
|
| 21 |
-
**Training Data:** ~300 hours of clean Arabic audio
|
| 22 |
-
**Language:** Modern Standard Arabic (MSA)
|
| 23 |
-
**Sample Rate:** 24kHz
|
| 24 |
-
|
| 25 |
-
## Usage
|
| 26 |
-
|
| 27 |
-
### Quick Start
|
| 28 |
-
|
| 29 |
-
see the [Colab notebook](https://colab.research.google.com/drive/1-Jxgy8BjvyWHKppdBPtz4s35Er3qDv-K?usp=sharing).
|
| 30 |
-
HF space : [Arabic Spark TTS Space](https://huggingface.co/spaces/IbrahimSalah/Arabic-TTS-Spark).
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
```python
|
| 35 |
-
from transformers import AutoProcessor, AutoModel
|
| 36 |
-
import soundfile as sf
|
| 37 |
-
import torch
|
| 38 |
-
|
| 39 |
-
# Load model
|
| 40 |
-
model_id = "IbrahimSalah/Arabic-TTS-Spark"
|
| 41 |
-
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 42 |
-
|
| 43 |
-
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 44 |
-
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).eval().to(device)
|
| 45 |
-
|
| 46 |
-
# Prepare inputs
|
| 47 |
-
inputs = processor(
|
| 48 |
-
text="YOUR_TEXT_WITH_TASHKEEL",
|
| 49 |
-
prompt_speech_path="path/to/reference.wav",
|
| 50 |
-
prompt_text="REFERENCE_TEXT_WITH_TASHKEEL",
|
| 51 |
-
return_tensors="pt"
|
| 52 |
-
).to(device)
|
| 53 |
-
|
| 54 |
-
# Generate
|
| 55 |
-
with torch.no_grad():
|
| 56 |
-
output_ids = model.generate(**inputs, max_new_tokens=8000, temperature=0.8)
|
| 57 |
-
|
| 58 |
-
# Decode
|
| 59 |
-
output = processor.decode(generated_ids=output_ids)
|
| 60 |
-
sf.write("output.wav", output["audio"], output["sampling_rate"])
|
| 61 |
-
```
|
| 62 |
-
|
| 63 |
-
## Key Features
|
| 64 |
-
|
| 65 |
-
- High-quality Arabic speech synthesis with natural prosody
|
| 66 |
-
- Efficient voice cloning from reference audio
|
| 67 |
-
- Advanced text chunking for long-form content
|
| 68 |
-
- Built-in audio post-processing (normalization, silence removal, crossfading)
|
| 69 |
-
- Works best with moderate text lengths
|
| 70 |
-
- Adjustable generation parameters (temperature, top_k, top_p)
|
| 71 |
-
|
| 72 |
-
## Input Requirements
|
| 73 |
-
|
| 74 |
-
**Critical:** Text must include full Arabic diacritization (tashkeel). The model is trained exclusively on fully diacritized text and will not perform well on non-diacritized input.
|
| 75 |
-
|
| 76 |
-
Example of correct input:
|
| 77 |
-
```
|
| 78 |
-
إِنَّ الْعِلْمَ نُورٌ يُقْذَفُ فِي الْقَلْبِ
|
| 79 |
-
```
|
| 80 |
-
|
| 81 |
-
### Generation Parameters
|
| 82 |
-
|
| 83 |
-
```python
|
| 84 |
-
tts.generate_long_text(
|
| 85 |
-
text=your_text,
|
| 86 |
-
prompt_audio_path="reference.wav",
|
| 87 |
-
prompt_transcript="reference_text",
|
| 88 |
-
output_path="output.wav",
|
| 89 |
-
max_chunk_length=300, # Characters per chunk
|
| 90 |
-
crossfade_duration=0.08, # Crossfade duration in seconds
|
| 91 |
-
normalize_audio_flag=True,
|
| 92 |
-
remove_silence_flag=True,
|
| 93 |
-
temperature=0.8, # Generation randomness
|
| 94 |
-
top_p=0.95, # Nucleus sampling
|
| 95 |
-
top_k=50 # Top-k sampling
|
| 96 |
-
)
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
## Sample Output
|
| 100 |
-
|
| 101 |
-
**Text:** "إِنَّ الدَّوْلَةَ لَهَا أَعْمَارٌ طَبِيعِيَّةٌ كَمَا لِلْأَشْخَاصِ. وَأَنَّهَا تَنْتَقِلُ فِي أَطْوَارٍ مُخْتَلِفَةٍ، فَيَكُونُ الْجِيلُ الْأَوَّلُ مِنْ أَهْلِ الدَّوْلَةِ، قَدْ حَافَظُوا عَلَى الْخُشُونَةِ الْبَدَوِيَّةِ، وَالتَّوَحُّشِ، وَالشَّظَفِ، وَالْبَأْسِ، وَالِاشْتِرَاكِ فِي الْمَجْدِ. فَتَكُونُ حُدُودُهُمْ مَرْهُوبَةً، وَجَوَانِبُهُمْ مُعَزَّزَةً. ثُمَّ يَأْتِي الْجِيلُ الثَّانِي، فَيَتَحَوَّلُ حَالُهُمْ بِالْمُلْكِ وَالتَّرَفِ مِنَ الْبَدَاوَةِ إِلَى الْحَضَارَةِ، وَمِنَ الْخُشُونَةِ إِلَى التَّرَفِ. فَيَنْكَسِرُ سَوْرَةُ الْعَصَبِيَّةِ قَلِيلًا. ثُمَّ يَأْتِي الْجِيلُ الثَّالِثُ، فَيَكُونُونَ قَدْ نَسُوا عَهْدَ الْبَدَاوَةِ وَالْخُشُونَةِ، وَيَنْغَمِسُونَ فِي النَّعِيمِ وَالتَّرَفِ، وَيَصِيرُونَ عِيَالًا عَلَى الدَّوْلَةِ. فَيَسْقُطُونَ فِي الْهَرَمِ وَالزَّوَالِ، وَيَحْتَاجُونَ إِلَى مَنْ يُدَافِعُ عَنْهُمْ، فَتَبْدَأُ الدَّوْلَةُ فِي الِانْقِرَاضِ."
|
| 102 |
-
|
| 103 |
-
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/645098004f731658826cfe57/FCGgeIu1F89rvNI55aVIx.wav"></audio>
|
| 104 |
-
## refrence audio
|
| 105 |
-
|
| 106 |
-
<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/645098004f731658826cfe57/cA9Z77_P0Rm2-hu1eosOC.wav"></audio>
|
| 107 |
-
|
| 108 |
-
## Further Fine-tuning
|
| 109 |
-
|
| 110 |
-
The model can be further fine-tuned for:
|
| 111 |
-
- Non-diacritized text (requires additional training)
|
| 112 |
-
- Specific voice characteristics
|
| 113 |
-
- Domain-specific vocabulary
|
| 114 |
-
- Dialectal variations
|
| 115 |
-
|
| 116 |
-
Fine-tuning infrastructure: [Spark-TTS Fine-tune](https://github.com/tuan12378/Spark-TTS-finetune)
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
## License
|
| 120 |
-
|
| 121 |
-
This model is released under a **Non-Commercial License**.
|
| 122 |
-
|
| 123 |
-
- You may use this model for research, educational, and personal non-commercial purposes.
|
| 124 |
-
- Commercial use is strictly prohibited without explicit permission.
|
| 125 |
-
- If you wish to use this model for commercial purposes, please contact the model author.
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
## Acknowledgments
|
| 129 |
-
|
| 130 |
-
- Base model: [Spark-TTS](https://github.com/tuan12378/Spark-TTS-finetune) by tuan12378
|
| 131 |
-
|
| 132 |
-
## Limitations
|
| 133 |
-
|
| 134 |
-
- Requires fully diacritized Arabic text as input
|
| 135 |
-
- Optimized for Modern Standard Arabic (MSA), not dialectal Arabic
|
| 136 |
-
- Performance may vary with very long texts without proper chunking
|
| 137 |
-
- Voice cloning quality depends on reference audio quality and length
|
| 138 |
-
- Generation speed scales with text length
|
|
|
|
| 1 |
+
مرحبا مليون كيف حالك
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|