Spaces:
Sleeping
Sleeping
| license: mit | |
| title: ' π€ Long-Form Text-to-Speech Generator' | |
| sdk: gradio | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: red | |
| pinned: true | |
| short_description: 'Unlimited Text Length**: Handle texts of any size.' | |
| # π€ Long-Form Text-to-Speech Generator | |
| A powerful Hugging Face Space that converts text of any length into natural, human-like speech using completely free AI models. | |
| ## β¨ Features | |
| - **π Unlimited Text Length**: Handle texts of any size, from short sentences to entire articles | |
| - **π€ Human-like Voice**: Uses Microsoft's SpeechT5 model for natural speech synthesis | |
| - **β‘ Smart Text Processing**: Intelligent chunking preserves sentence flow and natural pauses | |
| - **π Completely Free**: Uses only open-source models, no API keys required | |
| - **π§ Auto-preprocessing**: Handles abbreviations, numbers, and text normalization | |
| - **π± Easy to Use**: Simple web interface built with Gradio | |
| ## π οΈ How It Works | |
| 1. **Text Preprocessing**: Cleans and normalizes input text, handling abbreviations and numbers | |
| 2. **Smart Chunking**: Splits long text at natural sentence boundaries (max 500 chars per chunk) | |
| 3. **Speech Generation**: Processes each chunk using SpeechT5 TTS model | |
| 4. **Audio Merging**: Combines all audio segments with natural pauses between chunks | |
| ## π Models Used | |
| - **Text-to-Speech**: `microsoft/speecht5_tts` - High-quality neural TTS | |
| - **Vocoder**: `microsoft/speecht5_hifigan` - Neural vocoder for audio generation | |
| - **Speaker Embeddings**: CMU Arctic dataset for consistent voice characteristics | |
| ## π» Usage | |
| 1. Enter or paste your text in the input box (no length limit!) | |
| 2. Click "Generate Speech" | |
| 3. Wait for processing (longer texts take more time) | |
| 4. Download or play the generated audio | |
| ## π Tips for Best Results | |
| - Use proper punctuation for natural pauses | |
| - Well-formatted text produces better speech quality | |
| - The system automatically handles common abbreviations | |
| - Numbers are converted to spoken form | |
| ## π§ Technical Details | |
| - **Architecture**: Transformer-based neural TTS | |
| - **Sample Rate**: 16 kHz | |
| - **Audio Format**: WAV | |
| - **Processing**: CPU-optimized (works on free Hugging Face hardware) | |
| - **Memory Efficient**: Processes text in chunks to handle large documents | |
| ## π Local Installation | |
| ```bash | |
| git clone <your-space-url> | |
| cd <your-space-name> | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |
| ## π License | |
| This project uses open-source models and is available for free use. Please check individual model licenses: | |
| - SpeechT5: Microsoft Research License | |
| - CMU Arctic: Academic/Research License | |
| ## π€ Contributing | |
| Feel free to submit issues and enhancement requests! | |
| ## π Links | |
| - [SpeechT5 Paper](https://arxiv.org/abs/2110.07205) | |
| - [Hugging Face Transformers](https://huggingface.co/transformers/) | |
| - [Gradio Documentation](https://gradio.app/docs/) | |
| --- | |
| **Built with β€οΈ using Hugging Face Transformers and Gradio** |