| | --- |
| | colorFrom: blue |
| | colorTo: purple |
| | sdk: docker |
| | app_port: 7860 |
| | license: mit |
| | title: VoiceAPI |
| | tags: |
| | - tts |
| | - text-to-speech |
| | - indian-languages |
| | - vits |
| | - multilingual |
| | - speech-synthesis |
| | language: |
| | - hi |
| | - bn |
| | - mr |
| | - te |
| | - kn |
| | - en |
| | - bho |
| | - mai |
| | - mag |
| | - hne |
| | - gu |
| | --- |
| | |
| | # 🎙️ VoiceAPI - Multi-lingual Indian Language TTS |
| |
|
| | An advanced **multi-speaker, multilingual text-to-speech (TTS) synthesizer** supporting 11 Indian languages with 21 voice options. |
| |
|
| | **Live API**: [https://harshil748-voiceapi.hf.space](https://harshil748-voiceapi.hf.space) |
| |
|
| | ## 🌟 Features |
| |
|
| | - **11 Indian Languages**: Hindi, Bengali, Marathi, Telugu, Kannada, Gujarati, Bhojpuri, Chhattisgarhi, Maithili, Magahi, English |
| | - **21 Voice Options**: Male and female voices for each language |
| | - **High-Quality Audio**: 22050 Hz sample rate, natural prosody |
| | - **REST API**: Simple GET/POST endpoints for easy integration |
| | - **Real-time Synthesis**: Fast inference on CPU/GPU |
| |
|
| | ## 🗣️ Supported Languages |
| |
|
| | | Language | Code | Female | Male | Script | |
| | |----------|------|--------|------|--------| |
| | | Hindi | hi | ✅ | ✅ | देवनागरी | |
| | | Bengali | bn | ✅ | ✅ | বাংলা | |
| | | Marathi | mr | ✅ | ✅ | देवनागरी | |
| | | Telugu | te | ✅ | ✅ | తెలుగు | |
| | | Kannada | kn | ✅ | ✅ | ಕನ್ನಡ | |
| | | Gujarati | gu | ✅ (MMS) | - | ગુજરાતી | |
| | | Bhojpuri | bho | ✅ | ✅ | देवनागरी | |
| | | Chhattisgarhi | hne | ✅ | ✅ | देवनागरी | |
| | | Maithili | mai | ✅ | ✅ | देवनागरी | |
| | | Magahi | mag | ✅ | ✅ | देवनागरी | |
| | | English | en | ✅ | ✅ | Latin | |
| |
|
| | ## 📡 API Usage |
| |
|
| | ### Endpoint |
| |
|
| | \`\`\` |
| | GET/POST /Get_Inference |
| | \`\`\` |
| | |
| | ### Parameters |
| | |
| | | Parameter | Type | Required | Description | |
| | |-----------|------|----------|-------------| |
| | | \`text\` | string | Yes | Text to synthesize (lowercase for English) | |
| | | \`lang\` | string | Yes | Language name (hindi, bengali, etc.) | |
| | | \`speaker_wav\` | file | Yes | Reference WAV file (for API compatibility) | |
| |
|
| | ### Example (Python) |
| |
|
| | \`\`\`python |
| | import requests |
| |
|
| | base_url = 'https://harshil748-voiceapi.hf.space/Get_Inference' |
| | WavPath = 'reference.wav' |
| |
|
| | params = { |
| | 'text': 'नमस्ते, आप कैसे हैं?', |
| | 'lang': 'hindi', |
| | } |
| | |
| | with open(WavPath, "rb") as AudioFile: |
| | response = requests.get(base_url, params=params, files={'speaker_wav': AudioFile.read()}) |
| | |
| | if response.status_code == 200: |
| | with open('output.wav', 'wb') as f: |
| | f.write(response.content) |
| | print("Audio saved as 'output.wav'") |
| | \`\`\` |
| | |
| | ### Example (cURL) |
| | |
| | \`\`\`bash |
| | curl -X POST "https://harshil748-voiceapi.hf.space/Get_Inference?text=hello&lang=english" \\ |
| | -F "speaker_wav=@reference.wav" \\ |
| | -o output.wav |
| | \`\`\` |
| | |
| | ## 🏗️ Model Architecture |
| | |
| | - **Base Model**: VITS (Variational Inference with adversarial learning for Text-to-Speech) |
| | - **Encoder**: Transformer-based text encoder (6 layers, 192 hidden channels) |
| | - **Decoder**: HiFi-GAN neural vocoder |
| | - **Duration Predictor**: Stochastic duration predictor for natural prosody |
| | - **Sample Rate**: 22050 Hz (16000 Hz for Gujarati MMS) |
| | |
| | ## 📊 Training |
| | |
| | ### Datasets Used |
| | |
| | | Dataset | Languages | Source | License | |
| | |---------|-----------|--------|---------| |
| | | OpenSLR-103 | Hindi | [OpenSLR](https://www.openslr.org/103/) | CC BY 4.0 | |
| | | OpenSLR-37 | Bengali | [OpenSLR](https://www.openslr.org/37/) | CC BY 4.0 | |
| | | OpenSLR-64 | Marathi | [OpenSLR](https://www.openslr.org/64/) | CC BY 4.0 | |
| | | OpenSLR-66 | Telugu | [OpenSLR](https://www.openslr.org/66/) | CC BY 4.0 | |
| | | OpenSLR-79 | Kannada | [OpenSLR](https://www.openslr.org/79/) | CC BY 4.0 | |
| | | OpenSLR-78 | Gujarati | [OpenSLR](https://www.openslr.org/78/) | CC BY 4.0 | |
| | | Common Voice | Hindi, Bengali | [Mozilla](https://commonvoice.mozilla.org/) | CC0 | |
| | | IndicTTS | Multiple | [IIT Madras](https://www.iitm.ac.in/donlab/tts/) | Research | |
| | | Indic-Voices | Multiple | [AI4Bharat](https://ai4bharat.iitm.ac.in/indic-voices/) | CC BY 4.0 | |
| | |
| | ### Training Configuration |
| | |
| | - **Epochs**: 1000 |
| | - **Batch Size**: 32 |
| | - **Learning Rate**: 2e-4 |
| | - **Optimizer**: AdamW |
| | - **FP16 Training**: Enabled |
| | - **Hardware**: NVIDIA V100/A100 GPUs |
| | |
| | See \`training/\` directory for full training scripts and configurations. |
| | |
| | ## 🚀 Deployment |
| | |
| | This API is deployed on HuggingFace Spaces using Docker: |
| | |
| | \`\`\`dockerfile |
| | FROM python:3.10-slim |
| | # ... installs dependencies |
| | # Downloads models from Harshil748/VoiceAPI-Models |
| | # Runs FastAPI server on port 7860 |
| | \`\`\` |
| | |
| | Models are hosted separately at [Harshil748/VoiceAPI-Models](https://huggingface.co/Harshil748/VoiceAPI-Models) (~8GB). |
| | |
| | ## 📁 Project Structure |
| | |
| | \`\`\` |
| | |
| | VoiceAPI/ |
| | ├── app.py # HuggingFace Spaces entry point |
| | ├── Dockerfile # Docker configuration |
| | ├── requirements.txt # Python dependencies |
| | ├── download_models.py # Model downloader |
| | ├── src/ |
| | │ ├── api.py # FastAPI REST server |
| | │ ├── engine.py # TTS inference engine |
| | │ ├── config.py # Voice configurations |
| | │ └── tokenizer.py # Text tokenization |
| | └── training/ |
| | ├── train_vits.py # VITS training script |
| | ├── prepare_dataset.py # Data preparation |
| | ├── export_model.py # Model export |
| | ├── datasets.csv # Dataset links |
| | └── configs/ # Training configs |
| | |
| | \`\`\` |
| | |
| | ## 📜 License |
| |
|
| | - **Code**: MIT License |
| | - **Models**: CC BY 4.0 (following SYSPIN licensing) |
| | - **Datasets**: Individual licenses (see training/datasets.csv) |
| |
|
| | ## 🙏 Acknowledgments |
| |
|
| | - [SYSPIN IISc SPIRE Lab](https://syspin.iisc.ac.in/) for pre-trained VITS models |
| | - [Facebook MMS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) for Gujarati TTS |
| | - [Coqui TTS](https://github.com/coqui-ai/TTS) for the TTS library |
| | - [AI4Bharat](https://ai4bharat.iitm.ac.in/) for Indian language resources |
| |
|
| | ## 📧 Contact |
| |
|
| | Built for the **Voice Tech for All** Hackathon - Multi-lingual TTS for healthcare assistants serving low-income communities. |
| |
|