Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.3.0
System Architecture
Overview
Whisper German ASR is a modular, production-ready speech recognition system with multiple deployment options.
High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Interfaces β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Web Browser β Mobile App β CLI β API Clients β
ββββββββββ¬βββββββ΄βββββββ¬ββββββββ΄ββββ¬ββββ΄βββββββ¬ββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ ββββββββββββ βββββββ ββββββββββββ
β Gradio β β Custom β β CLI β β REST API β
β Demo β β UI β β β β Client β
ββββββββ¬βββββββ βββββββ¬βββββ ββββ¬βββ ββββββ¬ββββββ
β β β β
βββββββββββββββββ΄ββββββββββββ΄βββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β FastAPI Application β
β βββββββββββββββββββββββββ β
β β /transcribe endpoint β β
β β /health endpoint β β
β β /docs endpoint β β
β βββββββββββββββββββββββββ β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Whisper Model Pipeline β
β βββββββββββββββββββββββββ β
β β 1. Audio Processing β β
β β - Load audio β β
β β - Resample 16kHz β β
β β - Convert to mono β β
β βββββββββββββββββββββββββ€ β
β β 2. Feature Extraction β β
β β - Mel spectrogram β β
β β - Normalization β β
β βββββββββββββββββββββββββ€ β
β β 3. Model Inference β β
β β - Encoder β β
β β - Decoder β β
β β - Beam search β β
β βββββββββββββββββββββββββ€ β
β β 4. Post-processing β β
β β - Token decoding β β
β β - Text formatting β β
β βββββββββββββββββββββββββ β
ββββββββββββββββ¬βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Response/Output β
β German Transcription β
βββββββββββββββββββββββββββββββ
Component Details
1. User Interfaces
Gradio Demo (demo/app.py)
βββββββββββββββββββββββββββββββββββ
β Gradio Interface β
βββββββββββββββββββββββββββββββββββ€
β ββββββββββββββββββββββββββββ β
β β Audio Input β β
β β - Microphone β β
β β - File Upload β β
β ββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββ β
β β Transcribe Button β β
β ββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββ β
β β Output Display β β
β β - Transcription β β
β β - Duration β β
β ββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ
REST API (api/main.py)
βββββββββββββββββββββββββββββββββββ
β FastAPI Server β
βββββββββββββββββββββββββββββββββββ€
β Endpoints: β
β ββββββββββββββββββββββββββββ β
β β POST /transcribe β β
β β - Upload audio file β β
β β - Returns JSON β β
β ββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββ β
β β GET /health β β
β β - Model status β β
β β - Device info β β
β ββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββ β
β β GET /docs β β
β β - Swagger UI β β
β β - API documentation β β
β ββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ
2. Processing Pipeline
Audio Input
β
βΌ
βββββββββββββββββββ
β Audio Loading β librosa.load()
β - Load file β sr=16000, mono=True
β - Resample β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Preprocessing β WhisperProcessor
β - Mel spectro β 80 channels
β - Normalization β 3000 frames (30s)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Model Inference β WhisperForConditionalGeneration
β - Encoder β 6 layers
β - Decoder β 6 layers
β - Generation β Beam search (size=5)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Decoding β processor.batch_decode()
β - TokenβText β skip_special_tokens=True
β - Formatting β
ββββββββββ¬βββββββββ
β
βΌ
German Transcription
3. Model Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Whisper-small Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Input: 80-channel Mel Spectrogram β
β (80 x 3000 = 30 seconds) β
β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β Encoder (6 layers) β β
β β βββββββββββββββββββββββββββββββββββ β β
β β β Conv1D β Conv1D β Positional β β β
β β β Embedding β Transformer Blocks β β β
β β βββββββββββββββββββββββββββββββββββ β β
β β Output: 384-dim embeddings β β
β ββββββββββββββββββββ¬βββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββ β
β β Decoder (6 layers) β β
β β βββββββββββββββββββββββββββββββββββ β β
β β β Token Embedding β Positional β β β
β β β Embedding β Transformer Blocks β β β
β β β β Cross-Attention β Output β β β
β β βββββββββββββββββββββββββββββββββββ β β
β β Output: Token probabilities β β
β βββββββββββββββββββββββββββββββββββββββββ β
β β
β Parameters: 242M β
β Language: German (de) β
β Task: Transcribe β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Deployment Architectures
Local Development
ββββββββββββββββββββββββββββββββ
β Developer Machine β
β ββββββββββββββββββββββββββ β
β β Python Environment β β
β β - FastAPI/Gradio β β
β β - Whisper Model β β
β β - Dependencies β β
β ββββββββββββββββββββββββββ β
β Ports: 8000 (API) β
β 7860 (Demo) β
ββββββββββββββββββββββββββββββββ
Docker Deployment
βββββββββββββββββββββββββββββββββββββββ
β Docker Host β
β βββββββββββββββββββββββββββββββββ β
β β Container: whisper-api β β
β β - FastAPI β β
β β - Port 8000 β β
β βββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββ β
β β Container: whisper-demo β β
β β - Gradio β β
β β - Port 7860 β β
β βββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββ β
β β Volume: whisper_test_tuned β β
β β - Shared model files β β
β βββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββ
Cloud Deployment (AWS)
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Cloud β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β β Application Load Balancer β β
β β - HTTPS (443) β β
β β - Health checks β β
β ββββββββββββββββ¬βββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β β ECS Fargate Service β β
β β βββββββββββββββββββββββββββββββββββββββ β β
β β β Task 1: whisper-asr β β β
β β β - 1 vCPU, 2GB RAM β β β
β β β - Container: API β β β
β β βββββββββββββββββββββββββββββββββββββββ β β
β β βββββββββββββββββββββββββββββββββββββββ β β
β β β Task 2: whisper-asr β β β
β β β - Auto-scaling (2-10 tasks) β β β
β β βββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β β S3 Bucket β β
β β - Model files β β
β β - Static assets β β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββ β
β β CloudWatch β β
β β - Logs β β
β β - Metrics β β
β β - Alarms β β
β βββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
HuggingFace Spaces
βββββββββββββββββββββββββββββββββββββββ
β HuggingFace Spaces β
β βββββββββββββββββββββββββββββββββ β
β β Gradio Space β β
β β - app.py β β
β β - requirements.txt β β
β β - README.md β β
β βββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββ β
β β Model from HF Hub β β
β β - YOUR_USER/whisper-de β β
β β - Auto-loaded β β
β βββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββ β
β β Hardware β β
β β - CPU Basic (free) β β
β β - GPU T4 (paid) β β
β βββββββββββββββββββββββββββββββββ β
β Public URL: https://hf.co/spaces/ β
β YOUR_USER/whisper-de β
βββββββββββββββββββββββββββββββββββββββ
Data Flow
Transcription Request Flow
1. User uploads audio
β
βΌ
2. API receives file
β
βΌ
3. Load audio with librosa
- Decode format (mp3/wav/etc)
- Resample to 16kHz
- Convert to mono
β
βΌ
4. WhisperProcessor
- Compute mel spectrogram
- Normalize features
- Pad/truncate to 30s
β
βΌ
5. Model.generate()
- Encoder: audio β embeddings
- Decoder: embeddings β tokens
- Beam search for best sequence
β
βΌ
6. Processor.decode()
- Tokens β text
- Remove special tokens
- Format output
β
βΌ
7. Return JSON response
{
"transcription": "...",
"duration": 2.5,
"language": "de"
}
Technology Stack
βββββββββββββββββββββββββββββββββββββββ
β Frontend/Interface β
βββββββββββββββββββββββββββββββββββββββ€
β - Gradio 4.0+ β
β - HTML/CSS/JavaScript β
β - Swagger UI (FastAPI) β
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
β Backend/API β
βββββββββββββββββββββββββββββββββββββββ€
β - FastAPI 0.104+ β
β - Uvicorn (ASGI server) β
β - Pydantic (validation) β
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
β ML Framework β
βββββββββββββββββββββββββββββββββββββββ€
β - PyTorch 2.2+ β
β - Transformers 4.42+ β
β - Datasets 2.19+ β
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
β Audio Processing β
βββββββββββββββββββββββββββββββββββββββ€
β - Librosa 0.10+ β
β - SoundFile 0.12+ β
β - FFmpeg (system) β
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
β Evaluation β
βββββββββββββββββββββββββββββββββββββββ€
β - jiwer 4.0+ (WER/CER) β
β - NumPy 1.24+ β
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
β Deployment/DevOps β
βββββββββββββββββββββββββββββββββββββββ€
β - Docker β
β - Docker Compose β
β - GitHub Actions β
βββββββββββββββββββββββββββββββββββββββ
Performance Characteristics
Latency
Component Time
βββββββββββββββββββββββββββββββββ
Audio Loading 50-100ms
Feature Extraction 100-200ms
Model Inference (CPU) 1-3s
Model Inference (GPU) 200-500ms
Post-processing 10-50ms
βββββββββββββββββββββββββββββββββ
Total (CPU) 1.2-3.4s
Total (GPU) 360-850ms
Throughput
Hardware Samples/sec
ββββββββββββββββββββββββββββ
CPU (4 cores) 0.3-0.5
GPU (T4) 2-5
GPU (A100) 10-20
Resource Usage
Component CPU Memory GPU Memory
βββββββββββββββββββββββββββββββββββββββββ
Model Loading - 1.5GB 1GB
Inference 100% 2GB 1.5GB
API Server 10% 200MB -
Gradio Demo 5% 100MB -
Security Architecture
βββββββββββββββββββββββββββββββββββββββ
β Security Layers β
βββββββββββββββββββββββββββββββββββββββ€
β 1. Network Layer β
β - HTTPS/TLS β
β - CORS policies β
β - Rate limiting β
β β
β 2. Application Layer β
β - Input validation β
β - File type checking β
β - Size limits β
β - Error handling β
β β
β 3. Authentication (optional) β
β - API keys β
β - OAuth2 β
β - JWT tokens β
β β
β 4. Infrastructure β
β - Container isolation β
β - Resource limits β
β - Secrets management β
βββββββββββββββββββββββββββββββββββββββ
Monitoring & Observability
βββββββββββββββββββββββββββββββββββββββ
β Monitoring Stack β
βββββββββββββββββββββββββββββββββββββββ€
β Logs β
β - Application logs (Python) β
β - Access logs (Uvicorn) β
β - Error logs β
β β
β Metrics β
β - Request count β
β - Latency (p50, p95, p99) β
β - Error rate β
β - Model inference time β
β - Resource usage (CPU/RAM/GPU) β
β β
β Health Checks β
β - /health endpoint β
β - Model loaded status β
β - Device availability β
β β
β Tools β
β - TensorBoard (training) β
β - CloudWatch/Stackdriver (cloud) β
β - Prometheus + Grafana (optional) β
βββββββββββββββββββββββββββββββββββββββ
This architecture provides:
- β Modularity and separation of concerns
- β Scalability (horizontal and vertical)
- β Multiple deployment options
- β Production-ready monitoring
- β Security best practices
- β High availability potential