VoiceAPI / ARCHITECTURE.md
Harshil748's picture
Add voice cloning endpoint and XTTS model integration
51b23f6
# 🏗️ VoiceAPI System Architecture
## High-Level System Diagram
```mermaid
flowchart TB
subgraph Client["📱 Client Applications"]
Web["🌐 Web App"]
Mobile["📱 Mobile App"]
Healthcare["🏥 Healthcare Assistant"]
end
subgraph API["🚀 FastAPI Server (Port 7860)"]
Endpoint["/Get_Inference API"]
LangRouter["Language Router"]
end
subgraph Engine["⚙️ TTS Engine"]
Normalizer["Text Normalizer"]
Tokenizer["Tokenizer"]
StyleProc["Style Processor"]
subgraph Models["�� Model Types"]
VITS["VITS JIT Models\n(.pt files)"]
Coqui["Coqui TTS\n(.pth files)"]
MMS["Facebook MMS\n(HuggingFace)"]
end
end
subgraph Languages["🗣️ 11 Languages"]
Hindi["🇮🇳 Hindi"]
Bengali["🇧🇩 Bengali"]
Marathi["Marathi"]
Telugu["Telugu"]
Kannada["Kannada"]
Gujarati["Gujarati"]
Bhojpuri["Bhojpuri"]
Others["+ 4 more"]
end
subgraph Output["🔊 Audio Output"]
WAV["WAV File\n22050 Hz"]
end
Client -->|HTTP GET/POST| Endpoint
Endpoint -->|text, lang| LangRouter
LangRouter --> Normalizer
Normalizer --> Tokenizer
Tokenizer --> Models
VITS --> StyleProc
Coqui --> StyleProc
MMS --> StyleProc
StyleProc --> WAV
WAV -->|Response| Client
Models --> Languages
```
## Data Flow Diagram
```mermaid
sequenceDiagram
participant C as Client
participant A as API Server
participant E as TTS Engine
participant M as Model
participant S as Style Processor
C->>A: GET /Get_Inference?text=नमस्ते&lang=hindi
A->>A: Parse parameters
A->>E: synthesize(text, voice)
E->>E: Normalize text
E->>E: Tokenize to IDs
E->>M: Load model (if not cached)
M->>M: Forward pass (inference)
M-->>E: Raw audio tensor
E->>S: Apply style (pitch, speed, energy)
S-->>E: Processed audio
E-->>A: TTSOutput (audio, sample_rate)
A->>A: Convert to WAV bytes
A-->>C: audio/wav response
```
## Model Architecture
```mermaid
flowchart LR
subgraph Input["📝 Input"]
Text["Text Input"]
end
subgraph TextEncoder["🔤 Text Encoder"]
Embed["Character Embedding"]
TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"]
end
subgraph FlowModel["🌊 Flow Model"]
Prior["Prior Encoder"]
Flow["Normalizing Flow"]
Duration["Duration Predictor"]
end
subgraph Decoder["🔊 HiFi-GAN Decoder"]
Upsample["Upsampling Layers"]
ResBlocks["Residual Blocks"]
Output["Audio Waveform"]
end
Text --> Embed --> TransEnc
TransEnc --> Prior
TransEnc --> Duration
Prior --> Flow
Duration --> Flow
Flow --> Upsample --> ResBlocks --> Output
```
## Training Pipeline
```mermaid
flowchart TD
subgraph Data["📊 Training Data"]
OpenSLR["OpenSLR Datasets"]
CommonVoice["Mozilla Common Voice"]
IndicTTS["IndicTTS Corpus"]
AI4Bharat["AI4Bharat Indic-Voices"]
end
subgraph Prep["🔧 Data Preparation"]
Download["Download Audio"]
Normalize["Normalize to 22050 Hz"]
Transcript["Generate Transcripts"]
Split["Train/Val Split"]
end
subgraph Train["🏋️ Training"]
Config["Load Config YAML"]
VITS_Train["VITS Training\n(1000 epochs)"]
Checkpoint["Save Checkpoints"]
end
subgraph Export["📦 Export"]
JIT["JIT Trace Model"]
Chars["Generate chars.txt"]
Package["Package for Inference"]
end
Data --> Download --> Normalize --> Transcript --> Split
Split --> Config --> VITS_Train --> Checkpoint
Checkpoint --> JIT --> Chars --> Package
```
## Deployment Architecture
```mermaid
flowchart TB
subgraph HF["☁️ HuggingFace Infrastructure"]
subgraph Space["🚀 HF Space (Docker)"]
Docker["Docker Container"]
FastAPI["FastAPI Server\n:7860"]
Models_Dir["models/ directory"]
end
subgraph ModelRepo["📦 Model Repository"]
ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"]
end
end
subgraph External["🌐 External Services"]
MMS_HF["facebook/mms-tts-guj\n(Gujarati)"]
end
User["👤 User"] -->|HTTPS| FastAPI
Docker -->|Build time| ModelFiles
FastAPI -->|Runtime| MMS_HF
Models_Dir -.->|Loaded from| ModelFiles
```
## Voice Configuration Map
```mermaid
mindmap
root((VoiceAPI))
Hindi
hi_male
hi_female
Bengali
bn_male
bn_female
Marathi
mr_male
mr_female
Telugu
te_male
te_female
Kannada
kn_male
kn_female
Gujarati
gu_mms
Bhojpuri
bho_male
bho_female
Chhattisgarhi
hne_male
hne_female
Maithili
mai_male
mai_female
Magahi
mag_male
mag_female
English
en_male
en_female
```
## Component Interaction
| Component | File | Purpose |
|-----------|------|---------|
| API Server | `src/api.py` | FastAPI REST endpoints |
| TTS Engine | `src/engine.py` | Model loading & inference |
| Tokenizer | `src/tokenizer.py` | Text → Token IDs |
| Config | `src/config.py` | Language & model configs |
| Model Loader | `src/model_loader.py` | Model file management |
## Performance Characteristics
| Metric | Value |
|--------|-------|
| Inference Time | ~200-500ms per sentence |
| Model Load Time | ~2-5s per voice |
| Audio Sample Rate | 22050 Hz (16000 Hz for Gujarati) |
| Supported Formats | WAV |
| Concurrent Requests | Limited by memory |
---
*Built for Voice Tech for All Hackathon*