voiceforge-universal / docs /ARCHITECTURE.md
creator-o1
Initial commit: Complete VoiceForge Enterprise Speech AI Platform
d00203b

VoiceForge Architecture

Overview

VoiceForge is a production-grade Speech-to-Text and Text-to-Speech application built with modern Python technologies. This document describes the system architecture and key design decisions.

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Load Balancer                             β”‚
β”‚                      (Nginx / Cloud LB)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚               β”‚               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
     β”‚   Frontend  β”‚ β”‚   Backend  β”‚ β”‚   Worker   β”‚
     β”‚  Streamlit  β”‚ β”‚   FastAPI  β”‚ β”‚   Celery   β”‚
     β”‚   :8501     β”‚ β”‚   :8000    β”‚ β”‚            β”‚
     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
            β”‚               β”‚               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
     β”‚              Service Layer                    β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
     β”‚  β”‚   STT   β”‚ β”‚   TTS   β”‚ β”‚   File Service  β”‚ β”‚
     β”‚  β”‚ Service β”‚ β”‚ Service β”‚ β”‚                 β”‚ β”‚
     β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚          β”‚
     β”‚  β”‚   NLP   β”‚ β”‚  Export β”‚          β”‚          β”‚
     β”‚  β”‚ Service β”‚ β”‚ Service β”‚          β”‚          β”‚
     β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜          β”‚          β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚           β”‚               β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚              Data Layer                        β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
     β”‚  β”‚PostgreSQLβ”‚  β”‚ Redis β”‚  β”‚  File Storage β”‚  β”‚
     β”‚  β”‚  :5432   β”‚  β”‚ :6379 β”‚  β”‚    /uploads   β”‚  β”‚
     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚           External APIs                       β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
     β”‚  β”‚ Google Cloud    β”‚  β”‚  Google Cloud    β”‚  β”‚
     β”‚  β”‚ Speech-to-Text  β”‚  β”‚  Text-to-Speech  β”‚  β”‚
     β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

Frontend (Streamlit)

  • Purpose: Web interface for users
  • Technology: Streamlit 1.31+
  • Key Features:
    • Real-time microphone recording (WebRTC)
    • File upload with drag-and-drop
    • Audio waveform visualization
    • Transcript editing and export
    • Voice selection and preview

Backend (FastAPI)

  • Purpose: REST API server
  • Technology: FastAPI 0.109+
  • Key Features:
    • OpenAPI documentation
    • CORS middleware
    • JWT authentication (Phase 3)
    • Request validation
    • Error handling

Worker (Celery)

  • Purpose: Background task processing
  • Technology: Celery 5.3+ with Redis broker
  • Key Features:
    • Long audio file processing
    • Batch transcription
    • NLP analysis tasks

Database (PostgreSQL)

  • Purpose: Persistent data storage
  • Technology: PostgreSQL 15+
  • Tables:
    • users - User accounts
    • audio_files - Uploaded audio metadata
    • transcripts - Transcription results
    • user_preferences - User settings
    • usage_events - Analytics data
    • api_keys - Enterprise API keys

Cache (Redis)

  • Purpose: Caching and task queue
  • Technology: Redis 7+
  • Use Cases:
    • Voice list caching
    • Transcription result caching
    • Celery task queue
    • Session storage

Observability (Prometheus)

  • Purpose: Application monitoring
  • Technology: prometheus-fastapi-instrumentator
  • Key Metrics:
    • Request latency and throughput
    • Error rates
    • Endpoint usage statistics

Data Flow

Speech-to-Text Flow

1. User uploads audio file
2. Frontend sends to /api/v1/stt/upload
3. Backend validates file format and size
4. File saved to storage
5. STT Service calls Google Cloud Speech API
6. Results processed (words, segments, timestamps)
7. Transcript saved to database
8. Response returned to frontend

Text-to-Speech Flow

1. User enters text
2. Frontend sends to /api/v1/tts/synthesize
3. Backend validates text and voice
4. TTS Service calls Google Cloud TTS API
5. Audio returned as base64
6. Frontend plays/downloads audio

Design Decisions

Why PostgreSQL with JSONB?

  • Single database simplifies deployment
  • JSONB supports flexible document storage for segments
  • SQL for relational queries (users, files)
  • Full-text search capability

Why Streamlit?

  • Rapid development for data apps
  • Built-in components for audio
  • Easy deployment
  • Python-native (no JS required)

Why Google Cloud APIs?

  • Industry-leading accuracy
  • 100+ languages supported
  • 200+ voice options
  • Generous free tier

Security Considerations

  • Secrets via environment variables
  • HTTPS in production
  • JWT for authentication
  • Per-user data isolation
  • Temporary file cleanup

Deployment Options

Local Development

# Backend
cd backend
uvicorn app.main:app --reload

# Frontend
cd frontend
streamlit run streamlit_app.py

Docker Compose

docker-compose -f deploy/docker/docker-compose.dev.yml up

Production

  • Deploy to any container orchestrator
  • Use managed PostgreSQL (Cloud SQL, RDS)
  • Use managed Redis (Memorystore, ElastiCache)
  • Load balance with Nginx/Cloud LB