Spaces:
Sleeping
Sleeping
| # System Architecture | |
| ## Overview | |
| Whisper German ASR is a modular, production-ready speech recognition system with multiple deployment options. | |
| --- | |
| ## High-Level Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β User Interfaces β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Web Browser β Mobile App β CLI β API Clients β | |
| ββββββββββ¬βββββββ΄βββββββ¬ββββββββ΄ββββ¬ββββ΄βββββββ¬ββββββββββββββββ | |
| β β β β | |
| βΌ βΌ βΌ βΌ | |
| βββββββββββββββ ββββββββββββ βββββββ ββββββββββββ | |
| β Gradio β β Custom β β CLI β β REST API β | |
| β Demo β β UI β β β β Client β | |
| ββββββββ¬βββββββ βββββββ¬βββββ ββββ¬βββ ββββββ¬ββββββ | |
| β β β β | |
| βββββββββββββββββ΄ββββββββββββ΄βββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β FastAPI Application β | |
| β βββββββββββββββββββββββββ β | |
| β β /transcribe endpoint β β | |
| β β /health endpoint β β | |
| β β /docs endpoint β β | |
| β βββββββββββββββββββββββββ β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Whisper Model Pipeline β | |
| β βββββββββββββββββββββββββ β | |
| β β 1. Audio Processing β β | |
| β β - Load audio β β | |
| β β - Resample 16kHz β β | |
| β β - Convert to mono β β | |
| β βββββββββββββββββββββββββ€ β | |
| β β 2. Feature Extraction β β | |
| β β - Mel spectrogram β β | |
| β β - Normalization β β | |
| β βββββββββββββββββββββββββ€ β | |
| β β 3. Model Inference β β | |
| β β - Encoder β β | |
| β β - Decoder β β | |
| β β - Beam search β β | |
| β βββββββββββββββββββββββββ€ β | |
| β β 4. Post-processing β β | |
| β β - Token decoding β β | |
| β β - Text formatting β β | |
| β βββββββββββββββββββββββββ β | |
| ββββββββββββββββ¬βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Response/Output β | |
| β German Transcription β | |
| βββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Component Details | |
| ### 1. User Interfaces | |
| #### Gradio Demo (`demo/app.py`) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββ | |
| β Gradio Interface β | |
| βββββββββββββββββββββββββββββββββββ€ | |
| β ββββββββββββββββββββββββββββ β | |
| β β Audio Input β β | |
| β β - Microphone β β | |
| β β - File Upload β β | |
| β ββββββββββββββββββββββββββββ β | |
| β ββββββββββββββββββββββββββββ β | |
| β β Transcribe Button β β | |
| β ββββββββββββββββββββββββββββ β | |
| β ββββββββββββββββββββββββββββ β | |
| β β Output Display β β | |
| β β - Transcription β β | |
| β β - Duration β β | |
| β ββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββ | |
| ``` | |
| #### REST API (`api/main.py`) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββ | |
| β FastAPI Server β | |
| βββββββββββββββββββββββββββββββββββ€ | |
| β Endpoints: β | |
| β ββββββββββββββββββββββββββββ β | |
| β β POST /transcribe β β | |
| β β - Upload audio file β β | |
| β β - Returns JSON β β | |
| β ββββββββββββββββββββββββββββ β | |
| β ββββββββββββββββββββββββββββ β | |
| β β GET /health β β | |
| β β - Model status β β | |
| β β - Device info β β | |
| β ββββββββββββββββββββββββββββ β | |
| β ββββββββββββββββββββββββββββ β | |
| β β GET /docs β β | |
| β β - Swagger UI β β | |
| β β - API documentation β β | |
| β ββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### 2. Processing Pipeline | |
| ``` | |
| Audio Input | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Audio Loading β librosa.load() | |
| β - Load file β sr=16000, mono=True | |
| β - Resample β | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Preprocessing β WhisperProcessor | |
| β - Mel spectro β 80 channels | |
| β - Normalization β 3000 frames (30s) | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Model Inference β WhisperForConditionalGeneration | |
| β - Encoder β 6 layers | |
| β - Decoder β 6 layers | |
| β - Generation β Beam search (size=5) | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Decoding β processor.batch_decode() | |
| β - TokenβText β skip_special_tokens=True | |
| β - Formatting β | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| German Transcription | |
| ``` | |
| ### 3. Model Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Whisper-small Architecture β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β Input: 80-channel Mel Spectrogram β | |
| β (80 x 3000 = 30 seconds) β | |
| β β | |
| β βββββββββββββββββββββββββββββββββββββββββ β | |
| β β Encoder (6 layers) β β | |
| β β βββββββββββββββββββββββββββββββββββ β β | |
| β β β Conv1D β Conv1D β Positional β β β | |
| β β β Embedding β Transformer Blocks β β β | |
| β β βββββββββββββββββββββββββββββββββββ β β | |
| β β Output: 384-dim embeddings β β | |
| β ββββββββββββββββββββ¬βββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββ β | |
| β β Decoder (6 layers) β β | |
| β β βββββββββββββββββββββββββββββββββββ β β | |
| β β β Token Embedding β Positional β β β | |
| β β β Embedding β Transformer Blocks β β β | |
| β β β β Cross-Attention β Output β β β | |
| β β βββββββββββββββββββββββββββββββββββ β β | |
| β β Output: Token probabilities β β | |
| β βββββββββββββββββββββββββββββββββββββββββ β | |
| β β | |
| β Parameters: 242M β | |
| β Language: German (de) β | |
| β Task: Transcribe β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Deployment Architectures | |
| ### Local Development | |
| ``` | |
| ββββββββββββββββββββββββββββββββ | |
| β Developer Machine β | |
| β ββββββββββββββββββββββββββ β | |
| β β Python Environment β β | |
| β β - FastAPI/Gradio β β | |
| β β - Whisper Model β β | |
| β β - Dependencies β β | |
| β ββββββββββββββββββββββββββ β | |
| β Ports: 8000 (API) β | |
| β 7860 (Demo) β | |
| ββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Docker Deployment | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Docker Host β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β Container: whisper-api β β | |
| β β - FastAPI β β | |
| β β - Port 8000 β β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β Container: whisper-demo β β | |
| β β - Gradio β β | |
| β β - Port 7860 β β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β Volume: whisper_test_tuned β β | |
| β β - Shared model files β β | |
| β βββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Cloud Deployment (AWS) | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β AWS Cloud β | |
| β βββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Application Load Balancer β β | |
| β β - HTTPS (443) β β | |
| β β - Health checks β β | |
| β ββββββββββββββββ¬βββββββββββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β ECS Fargate Service β β | |
| β β βββββββββββββββββββββββββββββββββββββββ β β | |
| β β β Task 1: whisper-asr β β β | |
| β β β - 1 vCPU, 2GB RAM β β β | |
| β β β - Container: API β β β | |
| β β βββββββββββββββββββββββββββββββββββββββ β β | |
| β β βββββββββββββββββββββββββββββββββββββββ β β | |
| β β β Task 2: whisper-asr β β β | |
| β β β - Auto-scaling (2-10 tasks) β β β | |
| β β βββββββββββββββββββββββββββββββββββββββ β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β S3 Bucket β β | |
| β β - Model files β β | |
| β β - Static assets β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β CloudWatch β β | |
| β β - Logs β β | |
| β β - Metrics β β | |
| β β - Alarms β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### HuggingFace Spaces | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β HuggingFace Spaces β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β Gradio Space β β | |
| β β - app.py β β | |
| β β - requirements.txt β β | |
| β β - README.md β β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β Model from HF Hub β β | |
| β β - YOUR_USER/whisper-de β β | |
| β β - Auto-loaded β β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β β Hardware β β | |
| β β - CPU Basic (free) β β | |
| β β - GPU T4 (paid) β β | |
| β βββββββββββββββββββββββββββββββββ β | |
| β Public URL: https://hf.co/spaces/ β | |
| β YOUR_USER/whisper-de β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Data Flow | |
| ### Transcription Request Flow | |
| ``` | |
| 1. User uploads audio | |
| β | |
| βΌ | |
| 2. API receives file | |
| β | |
| βΌ | |
| 3. Load audio with librosa | |
| - Decode format (mp3/wav/etc) | |
| - Resample to 16kHz | |
| - Convert to mono | |
| β | |
| βΌ | |
| 4. WhisperProcessor | |
| - Compute mel spectrogram | |
| - Normalize features | |
| - Pad/truncate to 30s | |
| β | |
| βΌ | |
| 5. Model.generate() | |
| - Encoder: audio β embeddings | |
| - Decoder: embeddings β tokens | |
| - Beam search for best sequence | |
| β | |
| βΌ | |
| 6. Processor.decode() | |
| - Tokens β text | |
| - Remove special tokens | |
| - Format output | |
| β | |
| βΌ | |
| 7. Return JSON response | |
| { | |
| "transcription": "...", | |
| "duration": 2.5, | |
| "language": "de" | |
| } | |
| ``` | |
| --- | |
| ## Technology Stack | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Frontend/Interface β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β - Gradio 4.0+ β | |
| β - HTML/CSS/JavaScript β | |
| β - Swagger UI (FastAPI) β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Backend/API β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β - FastAPI 0.104+ β | |
| β - Uvicorn (ASGI server) β | |
| β - Pydantic (validation) β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β ML Framework β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β - PyTorch 2.2+ β | |
| β - Transformers 4.42+ β | |
| β - Datasets 2.19+ β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Audio Processing β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β - Librosa 0.10+ β | |
| β - SoundFile 0.12+ β | |
| β - FFmpeg (system) β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Evaluation β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β - jiwer 4.0+ (WER/CER) β | |
| β - NumPy 1.24+ β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Deployment/DevOps β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β - Docker β | |
| β - Docker Compose β | |
| β - GitHub Actions β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Performance Characteristics | |
| ### Latency | |
| ``` | |
| Component Time | |
| βββββββββββββββββββββββββββββββββ | |
| Audio Loading 50-100ms | |
| Feature Extraction 100-200ms | |
| Model Inference (CPU) 1-3s | |
| Model Inference (GPU) 200-500ms | |
| Post-processing 10-50ms | |
| βββββββββββββββββββββββββββββββββ | |
| Total (CPU) 1.2-3.4s | |
| Total (GPU) 360-850ms | |
| ``` | |
| ### Throughput | |
| ``` | |
| Hardware Samples/sec | |
| ββββββββββββββββββββββββββββ | |
| CPU (4 cores) 0.3-0.5 | |
| GPU (T4) 2-5 | |
| GPU (A100) 10-20 | |
| ``` | |
| ### Resource Usage | |
| ``` | |
| Component CPU Memory GPU Memory | |
| βββββββββββββββββββββββββββββββββββββββββ | |
| Model Loading - 1.5GB 1GB | |
| Inference 100% 2GB 1.5GB | |
| API Server 10% 200MB - | |
| Gradio Demo 5% 100MB - | |
| ``` | |
| --- | |
| ## Security Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Security Layers β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β 1. Network Layer β | |
| β - HTTPS/TLS β | |
| β - CORS policies β | |
| β - Rate limiting β | |
| β β | |
| β 2. Application Layer β | |
| β - Input validation β | |
| β - File type checking β | |
| β - Size limits β | |
| β - Error handling β | |
| β β | |
| β 3. Authentication (optional) β | |
| β - API keys β | |
| β - OAuth2 β | |
| β - JWT tokens β | |
| β β | |
| β 4. Infrastructure β | |
| β - Container isolation β | |
| β - Resource limits β | |
| β - Secrets management β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Monitoring & Observability | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Monitoring Stack β | |
| βββββββββββββββββββββββββββββββββββββββ€ | |
| β Logs β | |
| β - Application logs (Python) β | |
| β - Access logs (Uvicorn) β | |
| β - Error logs β | |
| β β | |
| β Metrics β | |
| β - Request count β | |
| β - Latency (p50, p95, p99) β | |
| β - Error rate β | |
| β - Model inference time β | |
| β - Resource usage (CPU/RAM/GPU) β | |
| β β | |
| β Health Checks β | |
| β - /health endpoint β | |
| β - Model loaded status β | |
| β - Device availability β | |
| β β | |
| β Tools β | |
| β - TensorBoard (training) β | |
| β - CloudWatch/Stackdriver (cloud) β | |
| β - Prometheus + Grafana (optional) β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| This architecture provides: | |
| - β Modularity and separation of concerns | |
| - β Scalability (horizontal and vertical) | |
| - β Multiple deployment options | |
| - β Production-ready monitoring | |
| - β Security best practices | |
| - β High availability potential | |