Spaces:
Running
Running
| # 🏗️ VoiceAPI System Architecture | |
| ## High-Level System Diagram | |
| ```mermaid | |
| flowchart TB | |
| subgraph Client["📱 Client Applications"] | |
| Web["🌐 Web App"] | |
| Mobile["📱 Mobile App"] | |
| Healthcare["🏥 Healthcare Assistant"] | |
| end | |
| subgraph API["🚀 FastAPI Server (Port 7860)"] | |
| Endpoint["/Get_Inference API"] | |
| LangRouter["Language Router"] | |
| end | |
| subgraph Engine["⚙️ TTS Engine"] | |
| Normalizer["Text Normalizer"] | |
| Tokenizer["Tokenizer"] | |
| StyleProc["Style Processor"] | |
| subgraph Models["�� Model Types"] | |
| VITS["VITS JIT Models\n(.pt files)"] | |
| Coqui["Coqui TTS\n(.pth files)"] | |
| MMS["Facebook MMS\n(HuggingFace)"] | |
| end | |
| end | |
| subgraph Languages["🗣️ 11 Languages"] | |
| Hindi["🇮🇳 Hindi"] | |
| Bengali["🇧🇩 Bengali"] | |
| Marathi["Marathi"] | |
| Telugu["Telugu"] | |
| Kannada["Kannada"] | |
| Gujarati["Gujarati"] | |
| Bhojpuri["Bhojpuri"] | |
| Others["+ 4 more"] | |
| end | |
| subgraph Output["🔊 Audio Output"] | |
| WAV["WAV File\n22050 Hz"] | |
| end | |
| Client -->|HTTP GET/POST| Endpoint | |
| Endpoint -->|text, lang| LangRouter | |
| LangRouter --> Normalizer | |
| Normalizer --> Tokenizer | |
| Tokenizer --> Models | |
| VITS --> StyleProc | |
| Coqui --> StyleProc | |
| MMS --> StyleProc | |
| StyleProc --> WAV | |
| WAV -->|Response| Client | |
| Models --> Languages | |
| ``` | |
| ## Data Flow Diagram | |
| ```mermaid | |
| sequenceDiagram | |
| participant C as Client | |
| participant A as API Server | |
| participant E as TTS Engine | |
| participant M as Model | |
| participant S as Style Processor | |
| C->>A: GET /Get_Inference?text=नमस्ते&lang=hindi | |
| A->>A: Parse parameters | |
| A->>E: synthesize(text, voice) | |
| E->>E: Normalize text | |
| E->>E: Tokenize to IDs | |
| E->>M: Load model (if not cached) | |
| M->>M: Forward pass (inference) | |
| M-->>E: Raw audio tensor | |
| E->>S: Apply style (pitch, speed, energy) | |
| S-->>E: Processed audio | |
| E-->>A: TTSOutput (audio, sample_rate) | |
| A->>A: Convert to WAV bytes | |
| A-->>C: audio/wav response | |
| ``` | |
| ## Model Architecture | |
| ```mermaid | |
| flowchart LR | |
| subgraph Input["📝 Input"] | |
| Text["Text Input"] | |
| end | |
| subgraph TextEncoder["🔤 Text Encoder"] | |
| Embed["Character Embedding"] | |
| TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"] | |
| end | |
| subgraph FlowModel["🌊 Flow Model"] | |
| Prior["Prior Encoder"] | |
| Flow["Normalizing Flow"] | |
| Duration["Duration Predictor"] | |
| end | |
| subgraph Decoder["🔊 HiFi-GAN Decoder"] | |
| Upsample["Upsampling Layers"] | |
| ResBlocks["Residual Blocks"] | |
| Output["Audio Waveform"] | |
| end | |
| Text --> Embed --> TransEnc | |
| TransEnc --> Prior | |
| TransEnc --> Duration | |
| Prior --> Flow | |
| Duration --> Flow | |
| Flow --> Upsample --> ResBlocks --> Output | |
| ``` | |
| ## Training Pipeline | |
| ```mermaid | |
| flowchart TD | |
| subgraph Data["📊 Training Data"] | |
| OpenSLR["OpenSLR Datasets"] | |
| CommonVoice["Mozilla Common Voice"] | |
| IndicTTS["IndicTTS Corpus"] | |
| AI4Bharat["AI4Bharat Indic-Voices"] | |
| end | |
| subgraph Prep["🔧 Data Preparation"] | |
| Download["Download Audio"] | |
| Normalize["Normalize to 22050 Hz"] | |
| Transcript["Generate Transcripts"] | |
| Split["Train/Val Split"] | |
| end | |
| subgraph Train["🏋️ Training"] | |
| Config["Load Config YAML"] | |
| VITS_Train["VITS Training\n(1000 epochs)"] | |
| Checkpoint["Save Checkpoints"] | |
| end | |
| subgraph Export["📦 Export"] | |
| JIT["JIT Trace Model"] | |
| Chars["Generate chars.txt"] | |
| Package["Package for Inference"] | |
| end | |
| Data --> Download --> Normalize --> Transcript --> Split | |
| Split --> Config --> VITS_Train --> Checkpoint | |
| Checkpoint --> JIT --> Chars --> Package | |
| ``` | |
| ## Deployment Architecture | |
| ```mermaid | |
| flowchart TB | |
| subgraph HF["☁️ HuggingFace Infrastructure"] | |
| subgraph Space["🚀 HF Space (Docker)"] | |
| Docker["Docker Container"] | |
| FastAPI["FastAPI Server\n:7860"] | |
| Models_Dir["models/ directory"] | |
| end | |
| subgraph ModelRepo["📦 Model Repository"] | |
| ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"] | |
| end | |
| end | |
| subgraph External["🌐 External Services"] | |
| MMS_HF["facebook/mms-tts-guj\n(Gujarati)"] | |
| end | |
| User["👤 User"] -->|HTTPS| FastAPI | |
| Docker -->|Build time| ModelFiles | |
| FastAPI -->|Runtime| MMS_HF | |
| Models_Dir -.->|Loaded from| ModelFiles | |
| ``` | |
| ## Voice Configuration Map | |
| ```mermaid | |
| mindmap | |
| root((VoiceAPI)) | |
| Hindi | |
| hi_male | |
| hi_female | |
| Bengali | |
| bn_male | |
| bn_female | |
| Marathi | |
| mr_male | |
| mr_female | |
| Telugu | |
| te_male | |
| te_female | |
| Kannada | |
| kn_male | |
| kn_female | |
| Gujarati | |
| gu_mms | |
| Bhojpuri | |
| bho_male | |
| bho_female | |
| Chhattisgarhi | |
| hne_male | |
| hne_female | |
| Maithili | |
| mai_male | |
| mai_female | |
| Magahi | |
| mag_male | |
| mag_female | |
| English | |
| en_male | |
| en_female | |
| ``` | |
| ## Component Interaction | |
| | Component | File | Purpose | | |
| |-----------|------|---------| | |
| | API Server | `src/api.py` | FastAPI REST endpoints | | |
| | TTS Engine | `src/engine.py` | Model loading & inference | | |
| | Tokenizer | `src/tokenizer.py` | Text → Token IDs | | |
| | Config | `src/config.py` | Language & model configs | | |
| | Model Loader | `src/model_loader.py` | Model file management | | |
| ## Performance Characteristics | |
| | Metric | Value | | |
| |--------|-------| | |
| | Inference Time | ~200-500ms per sentence | | |
| | Model Load Time | ~2-5s per voice | | |
| | Audio Sample Rate | 22050 Hz (16000 Hz for Gujarati) | | |
| | Supported Formats | WAV | | |
| | Concurrent Requests | Limited by memory | | |
| --- | |
| *Built for Voice Tech for All Hackathon* | |