Spaces:

Harshil748
/

VoiceAPI

Running

File size: 5,779 Bytes

51b23f6

# 🏗️ VoiceAPI System Architecture

## High-Level System Diagram

```mermaid
flowchart TB
    subgraph Client["📱 Client Applications"]
        Web["🌐 Web App"]
        Mobile["📱 Mobile App"]
        Healthcare["🏥 Healthcare Assistant"]
    end

    subgraph API["🚀 FastAPI Server (Port 7860)"]
        Endpoint["/Get_Inference API"]
        LangRouter["Language Router"]
    end

    subgraph Engine["⚙️ TTS Engine"]
        Normalizer["Text Normalizer"]
        Tokenizer["Tokenizer"]
        StyleProc["Style Processor"]
        
        subgraph Models["�� Model Types"]
            VITS["VITS JIT Models\n(.pt files)"]
            Coqui["Coqui TTS\n(.pth files)"]
            MMS["Facebook MMS\n(HuggingFace)"]
        end
    end

    subgraph Languages["🗣️ 11 Languages"]
        Hindi["🇮🇳 Hindi"]
        Bengali["🇧🇩 Bengali"]
        Marathi["Marathi"]
        Telugu["Telugu"]
        Kannada["Kannada"]
        Gujarati["Gujarati"]
        Bhojpuri["Bhojpuri"]
        Others["+ 4 more"]
    end

    subgraph Output["🔊 Audio Output"]
        WAV["WAV File\n22050 Hz"]
    end

    Client -->|HTTP GET/POST| Endpoint
    Endpoint -->|text, lang| LangRouter
    LangRouter --> Normalizer
    Normalizer --> Tokenizer
    Tokenizer --> Models
    VITS --> StyleProc
    Coqui --> StyleProc
    MMS --> StyleProc
    StyleProc --> WAV
    WAV -->|Response| Client

    Models --> Languages
```

## Data Flow Diagram

```mermaid
sequenceDiagram
    participant C as Client
    participant A as API Server
    participant E as TTS Engine
    participant M as Model
    participant S as Style Processor

    C->>A: GET /Get_Inference?text=नमस्ते&lang=hindi
    A->>A: Parse parameters
    A->>E: synthesize(text, voice)
    E->>E: Normalize text
    E->>E: Tokenize to IDs
    E->>M: Load model (if not cached)
    M->>M: Forward pass (inference)
    M-->>E: Raw audio tensor
    E->>S: Apply style (pitch, speed, energy)
    S-->>E: Processed audio
    E-->>A: TTSOutput (audio, sample_rate)
    A->>A: Convert to WAV bytes
    A-->>C: audio/wav response
```

## Model Architecture

```mermaid
flowchart LR
    subgraph Input["📝 Input"]
        Text["Text Input"]
    end

    subgraph TextEncoder["🔤 Text Encoder"]
        Embed["Character Embedding"]
        TransEnc["Transformer Encoder\n(6 layers, 192 hidden)"]
    end

    subgraph FlowModel["🌊 Flow Model"]
        Prior["Prior Encoder"]
        Flow["Normalizing Flow"]
        Duration["Duration Predictor"]
    end

    subgraph Decoder["🔊 HiFi-GAN Decoder"]
        Upsample["Upsampling Layers"]
        ResBlocks["Residual Blocks"]
        Output["Audio Waveform"]
    end

    Text --> Embed --> TransEnc
    TransEnc --> Prior
    TransEnc --> Duration
    Prior --> Flow
    Duration --> Flow
    Flow --> Upsample --> ResBlocks --> Output
```

## Training Pipeline

```mermaid
flowchart TD
    subgraph Data["📊 Training Data"]
        OpenSLR["OpenSLR Datasets"]
        CommonVoice["Mozilla Common Voice"]
        IndicTTS["IndicTTS Corpus"]
        AI4Bharat["AI4Bharat Indic-Voices"]
    end

    subgraph Prep["🔧 Data Preparation"]
        Download["Download Audio"]
        Normalize["Normalize to 22050 Hz"]
        Transcript["Generate Transcripts"]
        Split["Train/Val Split"]
    end

    subgraph Train["🏋️ Training"]
        Config["Load Config YAML"]
        VITS_Train["VITS Training\n(1000 epochs)"]
        Checkpoint["Save Checkpoints"]
    end

    subgraph Export["📦 Export"]
        JIT["JIT Trace Model"]
        Chars["Generate chars.txt"]
        Package["Package for Inference"]
    end

    Data --> Download --> Normalize --> Transcript --> Split
    Split --> Config --> VITS_Train --> Checkpoint
    Checkpoint --> JIT --> Chars --> Package
```

## Deployment Architecture

```mermaid
flowchart TB
    subgraph HF["☁️ HuggingFace Infrastructure"]
        subgraph Space["🚀 HF Space (Docker)"]
            Docker["Docker Container"]
            FastAPI["FastAPI Server\n:7860"]
            Models_Dir["models/ directory"]
        end
        
        subgraph ModelRepo["📦 Model Repository"]
            ModelFiles["Harshil748/VoiceAPI-Models\n(~8GB)"]
        end
    end

    subgraph External["🌐 External Services"]
        MMS_HF["facebook/mms-tts-guj\n(Gujarati)"]
    end

    User["👤 User"] -->|HTTPS| FastAPI
    Docker -->|Build time| ModelFiles
    FastAPI -->|Runtime| MMS_HF
    Models_Dir -.->|Loaded from| ModelFiles
```

## Voice Configuration Map

```mermaid
mindmap
  root((VoiceAPI))
    Hindi
      hi_male
      hi_female
    Bengali
      bn_male
      bn_female
    Marathi
      mr_male
      mr_female
    Telugu
      te_male
      te_female
    Kannada
      kn_male
      kn_female
    Gujarati
      gu_mms
    Bhojpuri
      bho_male
      bho_female
    Chhattisgarhi
      hne_male
      hne_female
    Maithili
      mai_male
      mai_female
    Magahi
      mag_male
      mag_female
    English
      en_male
      en_female
```

## Component Interaction

| Component | File | Purpose |
|-----------|------|---------|
| API Server | `src/api.py` | FastAPI REST endpoints |
| TTS Engine | `src/engine.py` | Model loading & inference |
| Tokenizer | `src/tokenizer.py` | Text → Token IDs |
| Config | `src/config.py` | Language & model configs |
| Model Loader | `src/model_loader.py` | Model file management |

## Performance Characteristics

| Metric | Value |
|--------|-------|
| Inference Time | ~200-500ms per sentence |
| Model Load Time | ~2-5s per voice |
| Audio Sample Rate | 22050 Hz (16000 Hz for Gujarati) |
| Supported Formats | WAV |
| Concurrent Requests | Limited by memory |

---
*Built for Voice Tech for All Hackathon*