voiceforge-universal / docs /ARCHITECTURE_PLAN.md
creator-o1
Initial commit: Complete VoiceForge Enterprise Speech AI Platform
d00203b

VoiceForge → Universal Communication Platform

Implementation Plan

Transform VoiceForge from a Speech-to-Text/TTS application into a Universal Communication Platform supporting speech, text, translation, and sign language.

This is a major expansion requiring significant development time. Recommend implementing in phases, with user testing between phases.


User Review Required

Scope & Priority Decisions

  1. Phase Priority: Should we start with Phase 1 (Translation, Batch, Live STT) or jump to a different phase?
  2. Sign Language: This is the most complex feature. Options:
    • Option A: ASL only (most resources available)
    • Option B: Multi-sign-language support (ASL, ISL, BSL) - requires more training data
    • Option C: Defer to Phase 5 after core features stable
  3. Voice Cloning: Coqui TTS is large (~2GB models). Accept increased storage requirements?
  4. API Authentication: Enable for all endpoints or just new features?

Proposed Changes

async def batch_transcribe(files: List[UploadFile]): # Offload to Celery workers using BatchedInferencePipeline (2-3x faster) # Return job_id for tracking

@router.get("/batch/{job_id}/status") async def batch_status(job_id: str): # Query Redis/DB for status of multiple files


---

#### [NEW] Real-time Live STT WebSocket
`backend/app/api/routes/ws_stt.py`

**Technology**: `faster-whisper` + `silero-vad` (Voice Activity Detection) + WebSockets.

**Architecture**:
1. **Frontend**: Record 16kHz mono audio, send 100-200ms binary chunks via WS.
2. **Backend**: 
   - Buffer chunks.
   - Run **Silero-VAD** to detect speech vs silence.
   - On silence OR 5s limit, run `model.transcribe` with `beam_size=1` (greedy) for <100ms latency.
   - Send partial/final results back.

**Frontend**: New live transcription component with waveform visualization

---

### Phase 2: AI Intelligence Layer

#### [NEW] Intelligent Meeting Minutes
`backend/app/services/meeting_service.py`
`frontend/pages/8_📅_Meeting_Minutes.py`

**Goal**: Transform audio recordings into structured meeting minutes with speakers and action items.

**Changes**:
1.  **Database**: Update `Transcript` model (`backend/app/models/transcript.py`) to add `action_items` (JSON) column.
2.  **NLP Service**: Update `backend/app/services/nlp_service.py` to add regex-based action item extraction.
3.  **Meeting Service**: Create orchestrator service:
    *   Input: Audio File
    *   Step 1: STT + Diarization (reuse `DiarizationService`)
    *   Step 2: Sentiment Analysis (reuse `NLPService`)
    *   Step 3: Summarization (reuse `NLPService`)
    *   Step 4: Action Item Extraction (new)
    *   Output: JSON with all metadata + DB record.
4.  **Frontend**: Dashboard to view minutes, filter by speaker, and export to PDF.

**key Dependencies**: `pyannote.audio` (already installed), `sumy` (already installed).

---

#### [PENDING] Emotion & Sentiment Analysis
(To be implemented in Phase 2.2)
`backend/app/services/emotion_service.py`

**Technology**: **HuBERT-Large (fine-tuned on IEMOCAP)**. Research shows HuBERT captures prosody/pitch better than Wav2Vec2 in 2024.

```python
class EmotionService:
    - analyze_audio(audio_path) → {emotion: str, intensity: float, timeline: List}
    # Emotions: neutral, happy, sad, angry, fearful, surprised

[NEW] Custom Vocabulary

backend/app/services/vocabulary_service.py

Strategy: faster-whisper initialization with initial_prompt or keyword boosting (if supported by CTranslate2 version).

class VocabularyService:
    - generate_initial_prompt(user_id) # Feed custom words into Whisper's context
    - apply_corrections(transcript, user_vocab) # Regex-based post-processing

Presets: Medical, Legal, Technical, Financial


Phase 3: Advanced Audio


[NEW] Audio Editor

backend/app/services/audio_editor_service.py

class AudioEditorService:
    - trim(audio_path, start, end) → trimmed_audio
    - merge(audio_paths: List) → merged_audio
    - convert_format(audio_path, target_format)
    - extract_segment(audio_path, timestamp)

Frontend: Waveform editor with drag-select for trim/cut


[NEW] Voice Cloning

backend/app/services/voice_clone_service.py

Technology: Coqui XTTS v2 (multilingual, 3-10s voice sample needed)

class VoiceCloneService:
    - clone_voice(sample_audio) → voice_id
    - synthesize_with_voice(text, voice_id) → audio
    - list_cloned_voices(user_id)
    - delete_voice(voice_id)

Voice cloning has ethical implications. Consider adding consent verification.


Phase 4: Revolutionary - Sign Language


[NEW] Sign Language Recognition 🤟

backend/app/services/sign_language_service.py

Technology: MediaPipe Holistic (feature extractor) + 1D Transformer Encoder (sequence classifier).

Pipeline:

  1. Video recording (24-30 FPS).
  2. Extract landmarks (Hands, Pose, minimized Face points).
  3. Normalize coordinates relative to body root.
  4. Transformer Encoder classifies frames into labels (labels trained on WLASL/MS-ASL).

Frontend: Webcam capture with real-time recognition overlay


[NEW] Sign Language Generation

backend/app/services/sign_avatar_service.py

Technology: Three.js + Ready Player Me Avatar (or similar) mapped to Pose/Hand animation data.

Pipeline:

  1. Text → Sign Search (Dictionary lookup for glosses).
  2. Glosses → Pose Animation Stream (Pre-recorded BVH/Landmark data).
  3. Render 3D Avatar performing signs in Streamlit via streamlit-threejs or static video.

Phase 5: Platform & API


[MODIFY] API Authentication

backend/app/core/security.py backend/app/api/routes/auth.py

# Add API key management
@router.post("/api-keys")
async def create_api_key(user: User):
    # Generate API key for programmatic access

# Rate limiting middleware
class RateLimitMiddleware:
    - check_rate_limit(api_key)
    - track_usage(api_key, endpoint)

New Dependencies

Add to backend/requirements.txt:

# Translation
transformers>=4.36.0
sentencepiece>=0.1.99

# Emotion Detection  
speechbrain>=0.5.16

# Voice Cloning
TTS>=0.22.0  # Coqui TTS

# Sign Language
mediapipe>=0.10.9
opencv-python>=4.9.0

# Audio Editing (enhanced)
moviepy>=1.0.3

Verification Plan

Existing Tests (Verified)

Located in backend/tests/:

  • test_api_integration.py - API endpoint tests
  • test_nlp.py - NLP service tests
  • test_export.py - Export format tests
  • test_diarization.py - Diarization tests

Run command:

cd backend && pytest tests/ -v

New Tests to Add

Feature Test File Type
Translation test_translation.py Unit + Integration
Batch Processing test_batch.py Integration
Live STT WebSocket test_ws_stt.py WebSocket test
Meeting Minutes test_meeting.py Integration
Emotion Detection test_emotion.py Unit
Audio Editor test_audio_editor.py Unit

Manual Testing

  1. Translation Flow:

    • Upload Hindi audio → Verify English transcript + translated audio output
  2. Sign Language (requires webcam):

    • User performs ASL signs → Verify text output matches
  3. Voice Cloning:

    • Upload 10s voice sample → Generate TTS → Verify voice similarity

Architecture Diagram

graph TB
    subgraph "Frontend - Streamlit"
        UI[Universal UI]
        PAGES[Transcribe | Synthesize | Translate | Sign | Meeting]
    end
    
    subgraph "API Gateway"
        AUTH[JWT/API Key Auth]
        RATE[Rate Limiter]
    end
    
    subgraph "Core Services"
        STT[Whisper STT]
        TTS[Edge TTS]
        TRANS[Translation]
        NLP[NLP Analysis]
    end
    
    subgraph "AI Intelligence"
        MEETING[Meeting Minutes]
        EMOTION[Emotion Detection]
        VOCAB[Custom Vocabulary]
    end
    
    subgraph "Advanced Features"
        EDIT[Audio Editor]
        CLONE[Voice Cloning]
    end
    
    subgraph "Revolutionary"
        SIGN_IN[Sign Recognition]
        SIGN_OUT[Sign Avatar]
    end
    
    UI --> AUTH --> Core Services
    Core Services --> AI Intelligence
    Core Services --> Advanced Features
    Core Services --> Revolutionary

Estimated Timeline

Phase Features Estimated Time
1 Translation, Batch, Live STT 2-3 days
2 Meeting Minutes, Emotion, Vocabulary 2-3 days
3 Audio Editor, Voice Cloning 2-3 days
4 Sign Language (Recognition + Generation) 5-7 days
5 API Auth, Landing Page 1-2 days

Total: ~12-18 days for full implementation


Questions for You

  1. Which phase should we start with? Recommend Phase 1 (Translation, Batch, Live STT) as foundation.

  2. Sign Language priority: Start with ASL only, or multi-language from beginning?

  3. Voice Cloning consent: Add user consent checkbox before allowing voice cloning?

  4. Hosting: Any preferences for model hosting (local vs. cloud inference)?