Text-to-Speech
Safetensors
Arabic

💜 Github   |   🤗 Hugging Face   |   📚 Cookbooks  
🖥️ Demo  

An independent Arabic Text-to-Speech (TTS) model based on the Rectified Flow Diffusion Transformer (RF-DiT) architecture.with Voice Design capabilities for controllable speaker identity, pitch, and style.Instead of requiring reference audio for voice cloning, this model features Voice Design 7 different voices

The acoustic model was trained entirely from scratch on Arabic speech data using random initialization, with independently developed training and inference pipelines.

The current version was trained on approximately 400–500 hours of carefully filtered Arabic speech (SNR > 20dB). Due to the limited availability of large-scale open Arabic speech datasets, synthesis quality may still vary depending on:

  • text length
  • punctuation & formatting
  • inference settings
  • reference audio quality
  • dialect variation

The model was trained without diacritics, e.g., "هذا السؤال وحده يمكن ان يغير حياتك بالكامل"

Some artifacts, instability, repetition, or pronunciation mistakes may still occur during generation, especially on long or complex sentences.

Future versions will focus on:

  • scaling training data
  • improving stability
  • enhancing pronunciation accuracy
  • reducing audio artifacts
  • improving expressive speech generation

🤝 Community Contributions Welcome

Contributions are highly appreciated, including:

  • Arabic speech datasets

  • training improvements

  • inference optimizations

  • bug fixes

  • evaluation & testing

  • documentation improvements---

    📊 Technical Specifications & Requirements

Specification Value / Description
Total Parameters ~553.4 Million
Core Architecture model_dim: 1280, 12 Transformer layers, 20 attention heads, mlp_ratio: 2.875
Latent Space 32-dimensional continuous latent space via DACVAE
Sample Rate 44100. Hz
Current Training Data ~400–500 hours of high-quality Arabic speech (SNR > 20dB)

📌 Project Overview

This project is an diffusion-based TTS system, inspired by modern architectures :

  • Echo-TTS
  • Irodori-TTS

🏗️ Architecture

Instead of relying on discrete audio tokens common in traditional TTS systems, this model generates Continuous Latent Representations using DACVAE.

Component Description
RF-DiT Diffusion transformer responsible for step-by-step generation of acoustic latent representations
DACVAE Encodes/decodes audio into a high-fidelity continuous latent space
Arabic Text Encoder Processes Arabic text representations (hidden_size: 768)
Continuous Latent Space Preserves fine acoustic details and minimizes spectral distortion

🌊 Continuous Latent Space

The system converts audio into compact continuous latent vectors (32-dim), which the diffusion model then learns to generate directly. This approach enables:

  • ✅ Smoother temporal generation
  • ✅ Reduced quantization artifacts
  • ✅ Preservation of fine acoustic details (breathing, vocal characteristics, prosody)
  • ✅ Improved stability for longer utterances

🎛️ Style & Pitch Control

The RF-DiT architecture supports conditional style embedding, allowing control over: - Speaker identity & pitch/timbre - Speech rate & rhythm - Expressive characteristics
(Based on inference settings and the provided reference audio)

Integrated Watermarking: Integrated SilentCipher to apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage.

🚀 Roadmap & Upcoming Updates

Feature Planned Updates
Speakers Expand support to a larger pool of male & female speakers
Training Data Scale to ~1000–2000 hours of high-quality Arabic speech
Quality & Stability Improve pronunciation accuracy & reduce spectral artifacts
Voice Cloning Experimental support for Zero-Shot Voice Cloning (3–10s reference)
Expressivity Integration of fine-grained emotional & stylistic controls

🎧 Audio Samples

| ** نبرة طبيعية** | | ** نبرة اخباريه** | | ** نبرة هادئة** | | ** صوت انثوي نبرة رسمية** | | ** نبرة دينية** | | ** نبرة رسمية هادئة** | | ** صوت انثوي نبره اخباريه** |

🚀 Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

https://github.com/sherif1313/3arab-TTS

Installation

git clone https://github.com/sherif1313/3arab-TTS.git
cd 3arab-TTS
uv sync

🙏 Acknowledgments by:

Aratako/Irodori-TTS
jordand/echo-tts-base
LlamaForCausalLM
facebook/dacvae-watermarked (Audio latent encoder)

All model training, pipeline implementation, and acoustic weights were developed independently. No proprietary acoustic weights, private datasets, or closed-source pipelines were used during development.

📜 License

Licensed under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sherif1313/3arab-TTS-500M-v1-VoiceDesign

Finetuned
(1)
this model

Datasets used to train sherif1313/3arab-TTS-500M-v1-VoiceDesign

Spaces using sherif1313/3arab-TTS-500M-v1-VoiceDesign 2

Collection including sherif1313/3arab-TTS-500M-v1-VoiceDesign