File size: 5,779 Bytes
51b23f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
# ๐Ÿ—๏ธ VoiceAPI System Architecture

## High-Level System Diagram

```mermaid
flowchart TB
    subgraph Client["๐Ÿ“ฑ Client Applications"]
        Web["๐ŸŒ Web App"]
        Mobile["๐Ÿ“ฑ Mobile App"]
        Healthcare["๐Ÿฅ Healthcare Assistant"]
    end

    subgraph API["๐Ÿš€ FastAPI Server (Port 7860)"]
        Endpoint["/Get_Inference API"]
        LangRouter["Language Router"]
    end

    subgraph Engine["โš™๏ธ TTS Engine"]
        Normalizer["Text Normalizer"]
        Tokenizer["Tokenizer"]
        StyleProc["Style Processor"]
        
        subgraph Models["๏ฟฝ๏ฟฝ Model Types"]
            VITS["VITS JIT Models\n(.pt files)"]
            Coqui["Coqui TTS\n(.pth files)"]
            MMS["Facebook MMS\n(HuggingFace)"]
        end
    end

    subgraph Languages["๐Ÿ—ฃ๏ธ 11 Languages"]
        Hindi["๐Ÿ‡ฎ๐Ÿ‡ณ Hindi"]
        Bengali["๐Ÿ‡ง๐Ÿ‡ฉ Bengali"]
        Marathi["Marathi"]
        Telugu["Telugu"]
        Kannada["Kannada"]
        Gujarati["Gujarati"]
        Bhojpuri["Bhojpuri"]
        Others["+ 4 more"]
    end

    subgraph Output["๐Ÿ”Š Audio Output"]
        WAV["WAV File\n22050 Hz"]
    end

    Client -->|HTTP GET/POST| Endpoint
    Endpoint -->|text, lang| LangRouter
    LangRouter --> Normalizer
    Normalizer --> Tokenizer
    Tokenizer --> Models
    VITS --> StyleProc
    Coqui --> StyleProc
    MMS --> StyleProc
    StyleProc --> WAV
    WAV -->|Response| Client

    Models --> Languages
```

## Data Flow Diagram

```mermaid
sequenceDiagram
    participant C as Client
    participant A as API Server
    participant E as TTS Engine
    participant M as Model
    participant S as Style Processor

    C->>A: GET /Get_Inference?text=เคจเคฎเคธเฅเคคเฅ‡&lang=hindi
    A->>A: Parse parameters
    A->>E: synthesize(text, voice)
    E->>E: Normalize text
    E->>E: Tokenize to IDs
    E->>M: Load model (if not cached)
    M->>M: Forward pass (inference)
    M-->>E: Raw audio tensor
    E->>S: Apply style (pitch, speed, energy)
    S-->>E: Processed audio
    E-->>A: TTSOutput (audio, sample_rate)
    A->>A: Convert to WAV bytes
    A-->>C: audio/wav response
```

## Model Architecture

```mermaid
flowchart LR
    subgraph Input["๐Ÿ“ Input"]
        Text["Text Input"]
    end

    subgraph TextEncoder["๐Ÿ”ค Text Encoder"]
        Embed["Character Embedding"]
        TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"]
    end

    subgraph FlowModel["๐ŸŒŠ Flow Model"]
        Prior["Prior Encoder"]
        Flow["Normalizing Flow"]
        Duration["Duration Predictor"]
    end

    subgraph Decoder["๐Ÿ”Š HiFi-GAN Decoder"]
        Upsample["Upsampling Layers"]
        ResBlocks["Residual Blocks"]
        Output["Audio Waveform"]
    end

    Text --> Embed --> TransEnc
    TransEnc --> Prior
    TransEnc --> Duration
    Prior --> Flow
    Duration --> Flow
    Flow --> Upsample --> ResBlocks --> Output
```

## Training Pipeline

```mermaid
flowchart TD
    subgraph Data["๐Ÿ“Š Training Data"]
        OpenSLR["OpenSLR Datasets"]
        CommonVoice["Mozilla Common Voice"]
        IndicTTS["IndicTTS Corpus"]
        AI4Bharat["AI4Bharat Indic-Voices"]
    end

    subgraph Prep["๐Ÿ”ง Data Preparation"]
        Download["Download Audio"]
        Normalize["Normalize to 22050 Hz"]
        Transcript["Generate Transcripts"]
        Split["Train/Val Split"]
    end

    subgraph Train["๐Ÿ‹๏ธ Training"]
        Config["Load Config YAML"]
        VITS_Train["VITS Training\n(1000 epochs)"]
        Checkpoint["Save Checkpoints"]
    end

    subgraph Export["๐Ÿ“ฆ Export"]
        JIT["JIT Trace Model"]
        Chars["Generate chars.txt"]
        Package["Package for Inference"]
    end

    Data --> Download --> Normalize --> Transcript --> Split
    Split --> Config --> VITS_Train --> Checkpoint
    Checkpoint --> JIT --> Chars --> Package
```

## Deployment Architecture

```mermaid
flowchart TB
    subgraph HF["โ˜๏ธ HuggingFace Infrastructure"]
        subgraph Space["๐Ÿš€ HF Space (Docker)"]
            Docker["Docker Container"]
            FastAPI["FastAPI Server\n:7860"]
            Models_Dir["models/ directory"]
        end
        
        subgraph ModelRepo["๐Ÿ“ฆ Model Repository"]
            ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"]
        end
    end

    subgraph External["๐ŸŒ External Services"]
        MMS_HF["facebook/mms-tts-guj\n(Gujarati)"]
    end

    User["๐Ÿ‘ค User"] -->|HTTPS| FastAPI
    Docker -->|Build time| ModelFiles
    FastAPI -->|Runtime| MMS_HF
    Models_Dir -.->|Loaded from| ModelFiles
```

## Voice Configuration Map

```mermaid
mindmap
  root((VoiceAPI))
    Hindi
      hi_male
      hi_female
    Bengali
      bn_male
      bn_female
    Marathi
      mr_male
      mr_female
    Telugu
      te_male
      te_female
    Kannada
      kn_male
      kn_female
    Gujarati
      gu_mms
    Bhojpuri
      bho_male
      bho_female
    Chhattisgarhi
      hne_male
      hne_female
    Maithili
      mai_male
      mai_female
    Magahi
      mag_male
      mag_female
    English
      en_male
      en_female
```

## Component Interaction

| Component | File | Purpose |
|-----------|------|---------|
| API Server | `src/api.py` | FastAPI REST endpoints |
| TTS Engine | `src/engine.py` | Model loading & inference |
| Tokenizer | `src/tokenizer.py` | Text โ†’ Token IDs |
| Config | `src/config.py` | Language & model configs |
| Model Loader | `src/model_loader.py` | Model file management |

## Performance Characteristics

| Metric | Value |
|--------|-------|
| Inference Time | ~200-500ms per sentence |
| Model Load Time | ~2-5s per voice |
| Audio Sample Rate | 22050 Hz (16000 Hz for Gujarati) |
| Supported Formats | WAV |
| Concurrent Requests | Limited by memory |

---
*Built for Voice Tech for All Hackathon*