Spaces:

lordofgaming
/

voiceforge-universal

Sleeping

App Files Files Community

voiceforge-universal / docs /ARCHITECTURE.md

creator-o1

Initial commit: Complete VoiceForge Enterprise Speech AI Platform

d00203b 14 days ago

preview code

raw

history blame contribute delete

7.42 kB

VoiceForge Architecture

Overview

VoiceForge is a production-grade Speech-to-Text and Text-to-Speech application built with modern Python technologies. This document describes the system architecture and key design decisions.

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         Load Balancer                             │
│                      (Nginx / Cloud LB)                           │
└───────────────────────────┬─────────────────────────────────────┘
                            │
            ┌───────────────┼───────────────┐
            │               │               │
     ┌──────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
     │   Frontend  │ │   Backend  │ │   Worker   │
     │  Streamlit  │ │   FastAPI  │ │   Celery   │
     │   :8501     │ │   :8000    │ │            │
     └──────┬──────┘ └─────┬──────┘ └─────┬──────┘
            │               │               │
     ┌──────▼───────────────▼───────────────▼──────┐
     │              Service Layer                    │
     │  ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │
     │  │   STT   │ │   TTS   │ │   File Service  │ │
     │  │ Service │ │ Service │ │                 │ │
     │  └────┬────┘ └────┬────┘ └────────┬────────┘ │
     │  ┌─────────┐ ┌─────────┐          │          │
     │  │   NLP   │ │  Export │          │          │
     │  │ Service │ │ Service │          │          │
     │  └────┬────┘ └────┬────┘          │          │
     └───────┼───────────┼───────────────┼──────────┘
             │           │               │
     ┌───────▼───────────▼───────────────▼──────────┐
     │              Data Layer                        │
     │  ┌──────────┐  ┌───────┐  ┌───────────────┐  │
     │  │PostgreSQL│  │ Redis │  │  File Storage │  │
     │  │  :5432   │  │ :6379 │  │    /uploads   │  │
     │  └──────────┘  └───────┘  └───────────────┘  │
     └────────────────────────────────────────────────┘
             │
     ┌───────▼─────────────────────────────────────┐
     │           External APIs                       │
     │  ┌─────────────────┐  ┌──────────────────┐  │
     │  │ Google Cloud    │  │  Google Cloud    │  │
     │  │ Speech-to-Text  │  │  Text-to-Speech  │  │
     │  └─────────────────┘  └──────────────────┘  │
     └─────────────────────────────────────────────┘

Components

Frontend (Streamlit)

Purpose: Web interface for users
Technology: Streamlit 1.31+
Key Features:
- Real-time microphone recording (WebRTC)
- File upload with drag-and-drop
- Audio waveform visualization
- Transcript editing and export
- Voice selection and preview

Backend (FastAPI)

Purpose: REST API server
Technology: FastAPI 0.109+
Key Features:
- OpenAPI documentation
- CORS middleware
- JWT authentication (Phase 3)
- Request validation
- Error handling

Worker (Celery)

Purpose: Background task processing
Technology: Celery 5.3+ with Redis broker
Key Features:
- Long audio file processing
- Batch transcription
- NLP analysis tasks

Database (PostgreSQL)

Purpose: Persistent data storage
Technology: PostgreSQL 15+
Tables:
- users - User accounts
- audio_files - Uploaded audio metadata
- transcripts - Transcription results
- user_preferences - User settings
- usage_events - Analytics data
- api_keys - Enterprise API keys

Cache (Redis)

Purpose: Caching and task queue
Technology: Redis 7+
Use Cases:
- Voice list caching
- Transcription result caching
- Celery task queue
- Session storage

Observability (Prometheus)

Purpose: Application monitoring
Technology: prometheus-fastapi-instrumentator
Key Metrics:
- Request latency and throughput
- Error rates
- Endpoint usage statistics

Data Flow

Speech-to-Text Flow

1. User uploads audio file
2. Frontend sends to /api/v1/stt/upload
3. Backend validates file format and size
4. File saved to storage
5. STT Service calls Google Cloud Speech API
6. Results processed (words, segments, timestamps)
7. Transcript saved to database
8. Response returned to frontend

Text-to-Speech Flow

1. User enters text
2. Frontend sends to /api/v1/tts/synthesize
3. Backend validates text and voice
4. TTS Service calls Google Cloud TTS API
5. Audio returned as base64
6. Frontend plays/downloads audio

Design Decisions

Why PostgreSQL with JSONB?

Single database simplifies deployment
JSONB supports flexible document storage for segments
SQL for relational queries (users, files)
Full-text search capability

Why Streamlit?

Rapid development for data apps
Built-in components for audio
Easy deployment
Python-native (no JS required)

Why Google Cloud APIs?

Industry-leading accuracy
100+ languages supported
200+ voice options
Generous free tier

Security Considerations

Secrets via environment variables
HTTPS in production
JWT for authentication
Per-user data isolation
Temporary file cleanup

Deployment Options

Local Development

# Backend
cd backend
uvicorn app.main:app --reload

# Frontend
cd frontend
streamlit run streamlit_app.py

Docker Compose

docker-compose -f deploy/docker/docker-compose.dev.yml up

Production

Deploy to any container orchestrator
Use managed PostgreSQL (Cloud SQL, RDS)
Use managed Redis (Memorystore, ElastiCache)
Load balance with Nginx/Cloud LB