Spaces:

lordofgaming
/

voiceforge-universal

Sleeping

File size: 7,424 Bytes

d00203b

# VoiceForge Architecture

## Overview

VoiceForge is a production-grade Speech-to-Text and Text-to-Speech application built with modern Python technologies. This document describes the system architecture and key design decisions.

## System Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                         Load Balancer                             │
│                      (Nginx / Cloud LB)                           │
└───────────────────────────┬─────────────────────────────────────┘
                            │
            ┌───────────────┼───────────────┐
            │               │               │
     ┌──────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
     │   Frontend  │ │   Backend  │ │   Worker   │
     │  Streamlit  │ │   FastAPI  │ │   Celery   │
     │   :8501     │ │   :8000    │ │            │
     └──────┬──────┘ └─────┬──────┘ └─────┬──────┘
            │               │               │
     ┌──────▼───────────────▼───────────────▼──────┐
     │              Service Layer                    │
     │  ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │
     │  │   STT   │ │   TTS   │ │   File Service  │ │
     │  │ Service │ │ Service │ │                 │ │
     │  └────┬────┘ └────┬────┘ └────────┬────────┘ │
     │  ┌─────────┐ ┌─────────┐          │          │
     │  │   NLP   │ │  Export │          │          │
     │  │ Service │ │ Service │          │          │
     │  └────┬────┘ └────┬────┘          │          │
     └───────┼───────────┼───────────────┼──────────┘
             │           │               │
     ┌───────▼───────────▼───────────────▼──────────┐
     │              Data Layer                        │
     │  ┌──────────┐  ┌───────┐  ┌───────────────┐  │
     │  │PostgreSQL│  │ Redis │  │  File Storage │  │
     │  │  :5432   │  │ :6379 │  │    /uploads   │  │
     │  └──────────┘  └───────┘  └───────────────┘  │
     └────────────────────────────────────────────────┘
             │
     ┌───────▼─────────────────────────────────────┐
     │           External APIs                       │
     │  ┌─────────────────┐  ┌──────────────────┐  │
     │  │ Google Cloud    │  │  Google Cloud    │  │
     │  │ Speech-to-Text  │  │  Text-to-Speech  │  │
     │  └─────────────────┘  └──────────────────┘  │
     └─────────────────────────────────────────────┘
```

## Components

### Frontend (Streamlit)

- **Purpose**: Web interface for users
- **Technology**: Streamlit 1.31+
- **Key Features**:
  - Real-time microphone recording (WebRTC)
  - File upload with drag-and-drop
  - Audio waveform visualization
  - Transcript editing and export
  - Voice selection and preview

### Backend (FastAPI)

- **Purpose**: REST API server
- **Technology**: FastAPI 0.109+
- **Key Features**:
  - OpenAPI documentation
  - CORS middleware
  - JWT authentication (Phase 3)
  - Request validation
  - Error handling

### Worker (Celery)

- **Purpose**: Background task processing
- **Technology**: Celery 5.3+ with Redis broker
- **Key Features**:
  - Long audio file processing
  - Batch transcription
  - NLP analysis tasks

### Database (PostgreSQL)

- **Purpose**: Persistent data storage
- **Technology**: PostgreSQL 15+
- **Tables**:
  - `users` - User accounts
  - `audio_files` - Uploaded audio metadata
  - `transcripts` - Transcription results
  - `user_preferences` - User settings
  - `usage_events` - Analytics data
  - `api_keys` - Enterprise API keys

### Cache (Redis)

- **Purpose**: Caching and task queue
- **Technology**: Redis 7+
- **Use Cases**:
  - Voice list caching
  - Transcription result caching
  - Celery task queue
  - Session storage

### Observability (Prometheus)

- **Purpose**: Application monitoring
- **Technology**: prometheus-fastapi-instrumentator
- **Key Metrics**:
  - Request latency and throughput
  - Error rates
  - Endpoint usage statistics

## Data Flow

### Speech-to-Text Flow

```
1. User uploads audio file
2. Frontend sends to /api/v1/stt/upload
3. Backend validates file format and size
4. File saved to storage
5. STT Service calls Google Cloud Speech API
6. Results processed (words, segments, timestamps)
7. Transcript saved to database
8. Response returned to frontend
```

### Text-to-Speech Flow

```
1. User enters text
2. Frontend sends to /api/v1/tts/synthesize
3. Backend validates text and voice
4. TTS Service calls Google Cloud TTS API
5. Audio returned as base64
6. Frontend plays/downloads audio
```

## Design Decisions

### Why PostgreSQL with JSONB?

- Single database simplifies deployment
- JSONB supports flexible document storage for segments
- SQL for relational queries (users, files)
- Full-text search capability

### Why Streamlit?

- Rapid development for data apps
- Built-in components for audio
- Easy deployment
- Python-native (no JS required)

### Why Google Cloud APIs?

- Industry-leading accuracy
- 100+ languages supported
- 200+ voice options
- Generous free tier

## Security Considerations

- Secrets via environment variables
- HTTPS in production
- JWT for authentication
- Per-user data isolation
- Temporary file cleanup

## Deployment Options

### Local Development

```bash
# Backend
cd backend
uvicorn app.main:app --reload

# Frontend
cd frontend
streamlit run streamlit_app.py
```

### Docker Compose

```bash
docker-compose -f deploy/docker/docker-compose.dev.yml up
```

### Production

- Deploy to any container orchestrator
- Use managed PostgreSQL (Cloud SQL, RDS)
- Use managed Redis (Memorystore, ElastiCache)
- Load balance with Nginx/Cloud LB