Spaces:
Running
Running
๐๏ธ VoiceAPI System Architecture
High-Level System Diagram
flowchart TB
subgraph Client["๐ฑ Client Applications"]
Web["๐ Web App"]
Mobile["๐ฑ Mobile App"]
Healthcare["๐ฅ Healthcare Assistant"]
end
subgraph API["๐ FastAPI Server (Port 7860)"]
Endpoint["/Get_Inference API"]
LangRouter["Language Router"]
end
subgraph Engine["โ๏ธ TTS Engine"]
Normalizer["Text Normalizer"]
Tokenizer["Tokenizer"]
StyleProc["Style Processor"]
subgraph Models["๏ฟฝ๏ฟฝ Model Types"]
VITS["VITS JIT Models\n(.pt files)"]
Coqui["Coqui TTS\n(.pth files)"]
MMS["Facebook MMS\n(HuggingFace)"]
end
end
subgraph Languages["๐ฃ๏ธ 11 Languages"]
Hindi["๐ฎ๐ณ Hindi"]
Bengali["๐ง๐ฉ Bengali"]
Marathi["Marathi"]
Telugu["Telugu"]
Kannada["Kannada"]
Gujarati["Gujarati"]
Bhojpuri["Bhojpuri"]
Others["+ 4 more"]
end
subgraph Output["๐ Audio Output"]
WAV["WAV File\n22050 Hz"]
end
Client -->|HTTP GET/POST| Endpoint
Endpoint -->|text, lang| LangRouter
LangRouter --> Normalizer
Normalizer --> Tokenizer
Tokenizer --> Models
VITS --> StyleProc
Coqui --> StyleProc
MMS --> StyleProc
StyleProc --> WAV
WAV -->|Response| Client
Models --> Languages
Data Flow Diagram
sequenceDiagram
participant C as Client
participant A as API Server
participant E as TTS Engine
participant M as Model
participant S as Style Processor
C->>A: GET /Get_Inference?text=เคจเคฎเคธเฅเคคเฅ&lang=hindi
A->>A: Parse parameters
A->>E: synthesize(text, voice)
E->>E: Normalize text
E->>E: Tokenize to IDs
E->>M: Load model (if not cached)
M->>M: Forward pass (inference)
M-->>E: Raw audio tensor
E->>S: Apply style (pitch, speed, energy)
S-->>E: Processed audio
E-->>A: TTSOutput (audio, sample_rate)
A->>A: Convert to WAV bytes
A-->>C: audio/wav response
Model Architecture
flowchart LR
subgraph Input["๐ Input"]
Text["Text Input"]
end
subgraph TextEncoder["๐ค Text Encoder"]
Embed["Character Embedding"]
TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"]
end
subgraph FlowModel["๐ Flow Model"]
Prior["Prior Encoder"]
Flow["Normalizing Flow"]
Duration["Duration Predictor"]
end
subgraph Decoder["๐ HiFi-GAN Decoder"]
Upsample["Upsampling Layers"]
ResBlocks["Residual Blocks"]
Output["Audio Waveform"]
end
Text --> Embed --> TransEnc
TransEnc --> Prior
TransEnc --> Duration
Prior --> Flow
Duration --> Flow
Flow --> Upsample --> ResBlocks --> Output
Training Pipeline
flowchart TD
subgraph Data["๐ Training Data"]
OpenSLR["OpenSLR Datasets"]
CommonVoice["Mozilla Common Voice"]
IndicTTS["IndicTTS Corpus"]
AI4Bharat["AI4Bharat Indic-Voices"]
end
subgraph Prep["๐ง Data Preparation"]
Download["Download Audio"]
Normalize["Normalize to 22050 Hz"]
Transcript["Generate Transcripts"]
Split["Train/Val Split"]
end
subgraph Train["๐๏ธ Training"]
Config["Load Config YAML"]
VITS_Train["VITS Training\n(1000 epochs)"]
Checkpoint["Save Checkpoints"]
end
subgraph Export["๐ฆ Export"]
JIT["JIT Trace Model"]
Chars["Generate chars.txt"]
Package["Package for Inference"]
end
Data --> Download --> Normalize --> Transcript --> Split
Split --> Config --> VITS_Train --> Checkpoint
Checkpoint --> JIT --> Chars --> Package
Deployment Architecture
flowchart TB
subgraph HF["โ๏ธ HuggingFace Infrastructure"]
subgraph Space["๐ HF Space (Docker)"]
Docker["Docker Container"]
FastAPI["FastAPI Server\n:7860"]
Models_Dir["models/ directory"]
end
subgraph ModelRepo["๐ฆ Model Repository"]
ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"]
end
end
subgraph External["๐ External Services"]
MMS_HF["facebook/mms-tts-guj\n(Gujarati)"]
end
User["๐ค User"] -->|HTTPS| FastAPI
Docker -->|Build time| ModelFiles
FastAPI -->|Runtime| MMS_HF
Models_Dir -.->|Loaded from| ModelFiles
Voice Configuration Map
mindmap
root((VoiceAPI))
Hindi
hi_male
hi_female
Bengali
bn_male
bn_female
Marathi
mr_male
mr_female
Telugu
te_male
te_female
Kannada
kn_male
kn_female
Gujarati
gu_mms
Bhojpuri
bho_male
bho_female
Chhattisgarhi
hne_male
hne_female
Maithili
mai_male
mai_female
Magahi
mag_male
mag_female
English
en_male
en_female
Component Interaction
| Component | File | Purpose |
|---|---|---|
| API Server | src/api.py |
FastAPI REST endpoints |
| TTS Engine | src/engine.py |
Model loading & inference |
| Tokenizer | src/tokenizer.py |
Text โ Token IDs |
| Config | src/config.py |
Language & model configs |
| Model Loader | src/model_loader.py |
Model file management |
Performance Characteristics
| Metric | Value |
|---|---|
| Inference Time | ~200-500ms per sentence |
| Model Load Time | ~2-5s per voice |
| Audio Sample Rate | 22050 Hz (16000 Hz for Gujarati) |
| Supported Formats | WAV |
| Concurrent Requests | Limited by memory |
Built for Voice Tech for All Hackathon