# System Architecture

## Overview
Whisper German ASR is a modular, production-ready speech recognition system with multiple deployment options.

---

## High-Level Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     User Interfaces                          │
├─────────────────────────────────────────────────────────────┤
│  Web Browser  │  Mobile App  │  CLI  │  API Clients         │
└────────┬──────┴──────┬───────┴───┬───┴──────┬───────────────┘
         │             │           │          │
         ▼             ▼           ▼          ▼
┌─────────────┐  ┌──────────┐  ┌─────┐  ┌──────────┐
│   Gradio    │  │  Custom  │  │ CLI │  │ REST API │
│    Demo     │  │    UI    │  │     │  │  Client  │
└──────┬──────┘  └─────┬────┘  └──┬──┘  └────┬─────┘
       │               │           │          │
       └───────────────┴───────────┴──────────┘
                       │
                       ▼
         ┌─────────────────────────────┐
         │     FastAPI Application     │
         │  ┌───────────────────────┐  │
         │  │  /transcribe endpoint │  │
         │  │  /health endpoint     │  │
         │  │  /docs endpoint       │  │
         │  └───────────────────────┘  │
         └──────────────┬──────────────┘
                        │
                        ▼
         ┌─────────────────────────────┐
         │   Whisper Model Pipeline    │
         │  ┌───────────────────────┐  │
         │  │ 1. Audio Processing   │  │
         │  │    - Load audio       │  │
         │  │    - Resample 16kHz   │  │
         │  │    - Convert to mono  │  │
         │  ├───────────────────────┤  │
         │  │ 2. Feature Extraction │  │
         │  │    - Mel spectrogram  │  │
         │  │    - Normalization    │  │
         │  ├───────────────────────┤  │
         │  │ 3. Model Inference    │  │
         │  │    - Encoder          │  │
         │  │    - Decoder          │  │
         │  │    - Beam search      │  │
         │  ├───────────────────────┤  │
         │  │ 4. Post-processing    │  │
         │  │    - Token decoding   │  │
         │  │    - Text formatting  │  │
         │  └───────────────────────┘  │
         └──────────────┬──────────────┘
                        │
                        ▼
         ┌─────────────────────────────┐
         │      Response/Output        │
         │   German Transcription      │
         └─────────────────────────────┘
```

---

## Component Details

### 1. User Interfaces

#### Gradio Demo (`demo/app.py`)
```
┌─────────────────────────────────┐
│       Gradio Interface          │
├─────────────────────────────────┤
│  ┌──────────────────────────┐   │
│  │  Audio Input             │   │
│  │  - Microphone            │   │
│  │  - File Upload           │   │
│  └──────────────────────────┘   │
│  ┌──────────────────────────┐   │
│  │  Transcribe Button       │   │
│  └──────────────────────────┘   │
│  ┌──────────────────────────┐   │
│  │  Output Display          │   │
│  │  - Transcription         │   │
│  │  - Duration              │   │
│  └──────────────────────────┘   │
└─────────────────────────────────┘
```

#### REST API (`api/main.py`)
```
┌─────────────────────────────────┐
│        FastAPI Server           │
├─────────────────────────────────┤
│  Endpoints:                     │
│  ┌──────────────────────────┐   │
│  │ POST /transcribe         │   │
│  │  - Upload audio file     │   │
│  │  - Returns JSON          │   │
│  └──────────────────────────┘   │
│  ┌──────────────────────────┐   │
│  │ GET /health              │   │
│  │  - Model status          │   │
│  │  - Device info           │   │
│  └──────────────────────────┘   │
│  ┌──────────────────────────┐   │
│  │ GET /docs                │   │
│  │  - Swagger UI            │   │
│  │  - API documentation     │   │
│  └──────────────────────────┘   │
└─────────────────────────────────┘
```

### 2. Processing Pipeline

```
Audio Input
    │
    ▼
┌─────────────────┐
│ Audio Loading   │  librosa.load()
│ - Load file     │  sr=16000, mono=True
│ - Resample      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Preprocessing   │  WhisperProcessor
│ - Mel spectro   │  80 channels
│ - Normalization │  3000 frames (30s)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Model Inference │  WhisperForConditionalGeneration
│ - Encoder       │  6 layers
│ - Decoder       │  6 layers
│ - Generation    │  Beam search (size=5)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Decoding        │  processor.batch_decode()
│ - Token→Text    │  skip_special_tokens=True
│ - Formatting    │
└────────┬────────┘
         │
         ▼
German Transcription
```

### 3. Model Architecture

```
┌─────────────────────────────────────────────────┐
│         Whisper-small Architecture              │
├─────────────────────────────────────────────────┤
│                                                 │
│  Input: 80-channel Mel Spectrogram             │
│         (80 x 3000 = 30 seconds)                │
│                                                 │
│  ┌───────────────────────────────────────┐     │
│  │           Encoder (6 layers)          │     │
│  │  ┌─────────────────────────────────┐  │     │
│  │  │  Conv1D → Conv1D → Positional   │  │     │
│  │  │  Embedding → Transformer Blocks │  │     │
│  │  └─────────────────────────────────┘  │     │
│  │  Output: 384-dim embeddings           │     │
│  └──────────────────┬────────────────────┘     │
│                     │                           │
│                     ▼                           │
│  ┌───────────────────────────────────────┐     │
│  │           Decoder (6 layers)          │     │
│  │  ┌─────────────────────────────────┐  │     │
│  │  │  Token Embedding → Positional   │  │     │
│  │  │  Embedding → Transformer Blocks │  │     │
│  │  │  → Cross-Attention → Output     │  │     │
│  │  └─────────────────────────────────┘  │     │
│  │  Output: Token probabilities          │     │
│  └───────────────────────────────────────┘     │
│                                                 │
│  Parameters: 242M                               │
│  Language: German (de)                          │
│  Task: Transcribe                               │
└─────────────────────────────────────────────────┘
```

---

## Deployment Architectures

### Local Development
```
┌──────────────────────────────┐
│     Developer Machine        │
│  ┌────────────────────────┐  │
│  │  Python Environment    │  │
│  │  - FastAPI/Gradio      │  │
│  │  - Whisper Model       │  │
│  │  - Dependencies        │  │
│  └────────────────────────┘  │
│  Ports: 8000 (API)           │
│         7860 (Demo)          │
└──────────────────────────────┘
```

### Docker Deployment
```
┌─────────────────────────────────────┐
│         Docker Host                 │
│  ┌───────────────────────────────┐  │
│  │  Container: whisper-api       │  │
│  │  - FastAPI                    │  │
│  │  - Port 8000                  │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │  Container: whisper-demo      │  │
│  │  - Gradio                     │  │
│  │  - Port 7860                  │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │  Volume: whisper_test_tuned   │  │
│  │  - Shared model files         │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘
```

### Cloud Deployment (AWS)
```
┌─────────────────────────────────────────────────┐
│                  AWS Cloud                      │
│  ┌───────────────────────────────────────────┐  │
│  │  Application Load Balancer                │  │
│  │  - HTTPS (443)                            │  │
│  │  - Health checks                          │  │
│  └──────────────┬────────────────────────────┘  │
│                 │                                │
│                 ▼                                │
│  ┌───────────────────────────────────────────┐  │
│  │  ECS Fargate Service                      │  │
│  │  ┌─────────────────────────────────────┐  │  │
│  │  │  Task 1: whisper-asr                │  │  │
│  │  │  - 1 vCPU, 2GB RAM                  │  │  │
│  │  │  - Container: API                   │  │  │
│  │  └─────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────┐  │  │
│  │  │  Task 2: whisper-asr                │  │  │
│  │  │  - Auto-scaling (2-10 tasks)        │  │  │
│  │  └─────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────┐  │
│  │  S3 Bucket                                │  │
│  │  - Model files                            │  │
│  │  - Static assets                          │  │
│  └───────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────┐  │
│  │  CloudWatch                               │  │
│  │  - Logs                                   │  │
│  │  - Metrics                                │  │
│  │  - Alarms                                 │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘
```

### HuggingFace Spaces
```
┌─────────────────────────────────────┐
│      HuggingFace Spaces             │
│  ┌───────────────────────────────┐  │
│  │  Gradio Space                 │  │
│  │  - app.py                     │  │
│  │  - requirements.txt           │  │
│  │  - README.md                  │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │  Model from HF Hub            │  │
│  │  - YOUR_USER/whisper-de       │  │
│  │  - Auto-loaded                │  │
│  └───────────────────────────────┘  │
│  ┌───────────────────────────────┐  │
│  │  Hardware                     │  │
│  │  - CPU Basic (free)           │  │
│  │  - GPU T4 (paid)              │  │
│  └───────────────────────────────┘  │
│  Public URL: https://hf.co/spaces/  │
│              YOUR_USER/whisper-de   │
└─────────────────────────────────────┘
```

---

## Data Flow

### Transcription Request Flow
```
1. User uploads audio
        │
        ▼
2. API receives file
        │
        ▼
3. Load audio with librosa
   - Decode format (mp3/wav/etc)
   - Resample to 16kHz
   - Convert to mono
        │
        ▼
4. WhisperProcessor
   - Compute mel spectrogram
   - Normalize features
   - Pad/truncate to 30s
        │
        ▼
5. Model.generate()
   - Encoder: audio → embeddings
   - Decoder: embeddings → tokens
   - Beam search for best sequence
        │
        ▼
6. Processor.decode()
   - Tokens → text
   - Remove special tokens
   - Format output
        │
        ▼
7. Return JSON response
   {
     "transcription": "...",
     "duration": 2.5,
     "language": "de"
   }
```

---

## Technology Stack

```
┌─────────────────────────────────────┐
│         Frontend/Interface          │
├─────────────────────────────────────┤
│  - Gradio 4.0+                      │
│  - HTML/CSS/JavaScript              │
│  - Swagger UI (FastAPI)             │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│           Backend/API               │
├─────────────────────────────────────┤
│  - FastAPI 0.104+                   │
│  - Uvicorn (ASGI server)            │
│  - Pydantic (validation)            │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│          ML Framework               │
├─────────────────────────────────────┤
│  - PyTorch 2.2+                     │
│  - Transformers 4.42+               │
│  - Datasets 2.19+                   │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│       Audio Processing              │
├─────────────────────────────────────┤
│  - Librosa 0.10+                    │
│  - SoundFile 0.12+                  │
│  - FFmpeg (system)                  │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│         Evaluation                  │
├─────────────────────────────────────┤
│  - jiwer 4.0+ (WER/CER)             │
│  - NumPy 1.24+                      │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│      Deployment/DevOps              │
├─────────────────────────────────────┤
│  - Docker                           │
│  - Docker Compose                   │
│  - GitHub Actions                   │
└─────────────────────────────────────┘
```

---

## Performance Characteristics

### Latency
```
Component                Time
─────────────────────────────────
Audio Loading           50-100ms
Feature Extraction      100-200ms
Model Inference (CPU)   1-3s
Model Inference (GPU)   200-500ms
Post-processing         10-50ms
─────────────────────────────────
Total (CPU)             1.2-3.4s
Total (GPU)             360-850ms
```

### Throughput
```
Hardware        Samples/sec
────────────────────────────
CPU (4 cores)   0.3-0.5
GPU (T4)        2-5
GPU (A100)      10-20
```

### Resource Usage
```
Component       CPU    Memory   GPU Memory
─────────────────────────────────────────
Model Loading   -      1.5GB    1GB
Inference       100%   2GB      1.5GB
API Server      10%    200MB    -
Gradio Demo     5%     100MB    -
```

---

## Security Architecture

```
┌─────────────────────────────────────┐
│         Security Layers             │
├─────────────────────────────────────┤
│  1. Network Layer                   │
│     - HTTPS/TLS                     │
│     - CORS policies                 │
│     - Rate limiting                 │
│                                     │
│  2. Application Layer               │
│     - Input validation              │
│     - File type checking            │
│     - Size limits                   │
│     - Error handling                │
│                                     │
│  3. Authentication (optional)       │
│     - API keys                      │
│     - OAuth2                        │
│     - JWT tokens                    │
│                                     │
│  4. Infrastructure                  │
│     - Container isolation           │
│     - Resource limits               │
│     - Secrets management            │
└─────────────────────────────────────┘
```

---

## Monitoring & Observability

```
┌─────────────────────────────────────┐
│         Monitoring Stack            │
├─────────────────────────────────────┤
│  Logs                               │
│  - Application logs (Python)        │
│  - Access logs (Uvicorn)            │
│  - Error logs                       │
│                                     │
│  Metrics                            │
│  - Request count                    │
│  - Latency (p50, p95, p99)          │
│  - Error rate                       │
│  - Model inference time             │
│  - Resource usage (CPU/RAM/GPU)     │
│                                     │
│  Health Checks                      │
│  - /health endpoint                 │
│  - Model loaded status              │
│  - Device availability              │
│                                     │
│  Tools                              │
│  - TensorBoard (training)           │
│  - CloudWatch/Stackdriver (cloud)   │
│  - Prometheus + Grafana (optional)  │
└─────────────────────────────────────┘
```

---

This architecture provides:
- ✅ Modularity and separation of concerns
- ✅ Scalability (horizontal and vertical)
- ✅ Multiple deployment options
- ✅ Production-ready monitoring
- ✅ Security best practices
- ✅ High availability potential