Spaces:

Harshil748
/

VoiceAPI

Running

App Files Files Community

VoiceAPI / ARCHITECTURE.md

Harshil748

Add voice cloning endpoint and XTTS model integration

51b23f6 2 months ago

preview code

raw

history blame contribute delete

5.78 kB

	# 🏗️ VoiceAPI System Architecture

	## High-Level System Diagram

	```mermaid
	flowchart TB
	subgraph Client["📱 Client Applications"]
	Web["🌐 Web App"]
	Mobile["📱 Mobile App"]
	Healthcare["🏥 Healthcare Assistant"]
	end

	subgraph API["🚀 FastAPI Server (Port 7860)"]
	Endpoint["/Get_Inference API"]
	LangRouter["Language Router"]
	end

	subgraph Engine["⚙️ TTS Engine"]
	Normalizer["Text Normalizer"]
	Tokenizer["Tokenizer"]
	StyleProc["Style Processor"]

	subgraph Models["�� Model Types"]
	VITS["VITS JIT Models\n(.pt files)"]
	Coqui["Coqui TTS\n(.pth files)"]
	MMS["Facebook MMS\n(HuggingFace)"]
	end
	end

	subgraph Languages["🗣️ 11 Languages"]
	Hindi["🇮🇳 Hindi"]
	Bengali["🇧🇩 Bengali"]
	Marathi["Marathi"]
	Telugu["Telugu"]
	Kannada["Kannada"]
	Gujarati["Gujarati"]
	Bhojpuri["Bhojpuri"]
	Others["+ 4 more"]
	end

	subgraph Output["🔊 Audio Output"]
	WAV["WAV File\n22050 Hz"]
	end

	Client -->\|HTTP GET/POST\| Endpoint
	Endpoint -->\|text, lang\| LangRouter
	LangRouter --> Normalizer
	Normalizer --> Tokenizer
	Tokenizer --> Models
	VITS --> StyleProc
	Coqui --> StyleProc
	MMS --> StyleProc
	StyleProc --> WAV
	WAV -->\|Response\| Client

	Models --> Languages
	```

	## Data Flow Diagram

	```mermaid
	sequenceDiagram
	participant C as Client
	participant A as API Server
	participant E as TTS Engine
	participant M as Model
	participant S as Style Processor

	C->>A: GET /Get_Inference?text=नमस्ते&lang=hindi
	A->>A: Parse parameters
	A->>E: synthesize(text, voice)
	E->>E: Normalize text
	E->>E: Tokenize to IDs
	E->>M: Load model (if not cached)
	M->>M: Forward pass (inference)
	M-->>E: Raw audio tensor
	E->>S: Apply style (pitch, speed, energy)
	S-->>E: Processed audio
	E-->>A: TTSOutput (audio, sample_rate)
	A->>A: Convert to WAV bytes
	A-->>C: audio/wav response
	```

	## Model Architecture

	```mermaid
	flowchart LR
	subgraph Input["📝 Input"]
	Text["Text Input"]
	end

	subgraph TextEncoder["🔤 Text Encoder"]
	Embed["Character Embedding"]
	TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"]
	end

	subgraph FlowModel["🌊 Flow Model"]
	Prior["Prior Encoder"]
	Flow["Normalizing Flow"]
	Duration["Duration Predictor"]
	end

	subgraph Decoder["🔊 HiFi-GAN Decoder"]
	Upsample["Upsampling Layers"]
	ResBlocks["Residual Blocks"]
	Output["Audio Waveform"]
	end

	Text --> Embed --> TransEnc
	TransEnc --> Prior
	TransEnc --> Duration
	Prior --> Flow
	Duration --> Flow
	Flow --> Upsample --> ResBlocks --> Output
	```

	## Training Pipeline

	```mermaid
	flowchart TD
	subgraph Data["📊 Training Data"]
	OpenSLR["OpenSLR Datasets"]
	CommonVoice["Mozilla Common Voice"]
	IndicTTS["IndicTTS Corpus"]
	AI4Bharat["AI4Bharat Indic-Voices"]
	end

	subgraph Prep["🔧 Data Preparation"]
	Download["Download Audio"]
	Normalize["Normalize to 22050 Hz"]
	Transcript["Generate Transcripts"]
	Split["Train/Val Split"]
	end

	subgraph Train["🏋️ Training"]
	Config["Load Config YAML"]
	VITS_Train["VITS Training\n(1000 epochs)"]
	Checkpoint["Save Checkpoints"]
	end

	subgraph Export["📦 Export"]
	JIT["JIT Trace Model"]
	Chars["Generate chars.txt"]
	Package["Package for Inference"]
	end

	Data --> Download --> Normalize --> Transcript --> Split
	Split --> Config --> VITS_Train --> Checkpoint
	Checkpoint --> JIT --> Chars --> Package
	```

	## Deployment Architecture

	```mermaid
	flowchart TB
	subgraph HF["☁️ HuggingFace Infrastructure"]
	subgraph Space["🚀 HF Space (Docker)"]
	Docker["Docker Container"]
	FastAPI["FastAPI Server\n:7860"]
	Models_Dir["models/ directory"]
	end

	subgraph ModelRepo["📦 Model Repository"]
	ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"]
	end
	end

	subgraph External["🌐 External Services"]
	MMS_HF["facebook/mms-tts-guj\n(Gujarati)"]
	end

	User["👤 User"] -->\|HTTPS\| FastAPI
	Docker -->\|Build time\| ModelFiles
	FastAPI -->\|Runtime\| MMS_HF
	Models_Dir -.->\|Loaded from\| ModelFiles
	```

	## Voice Configuration Map

	```mermaid
	mindmap
	root((VoiceAPI))
	Hindi
	hi_male
	hi_female
	Bengali
	bn_male
	bn_female
	Marathi
	mr_male
	mr_female
	Telugu
	te_male
	te_female
	Kannada
	kn_male
	kn_female
	Gujarati
	gu_mms
	Bhojpuri
	bho_male
	bho_female
	Chhattisgarhi
	hne_male
	hne_female
	Maithili
	mai_male
	mai_female
	Magahi
	mag_male
	mag_female
	English
	en_male
	en_female
	```

	## Component Interaction

	\| Component \| File \| Purpose \|
	\|-----------\|------\|---------\|
	\| API Server \| `src/api.py` \| FastAPI REST endpoints \|
	\| TTS Engine \| `src/engine.py` \| Model loading & inference \|
	\| Tokenizer \| `src/tokenizer.py` \| Text → Token IDs \|
	\| Config \| `src/config.py` \| Language & model configs \|
	\| Model Loader \| `src/model_loader.py` \| Model file management \|

	## Performance Characteristics

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Inference Time \| ~200-500ms per sentence \|
	\| Model Load Time \| ~2-5s per voice \|
	\| Audio Sample Rate \| 22050 Hz (16000 Hz for Gujarati) \|
	\| Supported Formats \| WAV \|
	\| Concurrent Requests \| Limited by memory \|

	---
	Built for Voice Tech for All Hackathon