# VoiceForge → Universal Communication Platform

## Implementation Plan

Transform VoiceForge from a Speech-to-Text/TTS application into a **Universal Communication Platform** supporting speech, text, translation, and sign language.

> [!IMPORTANT]
> This is a **major expansion** requiring significant development time. Recommend implementing in phases, with user testing between phases.

---

## User Review Required

### Scope & Priority Decisions

1. **Phase Priority**: Should we start with Phase 1 (Translation, Batch, Live STT) or jump to a different phase?
2. **Sign Language**: This is the most complex feature. Options:
   - **Option A**: ASL only (most resources available)
   - **Option B**: Multi-sign-language support (ASL, ISL, BSL) - requires more training data
   - **Option C**: Defer to Phase 5 after core features stable
3. **Voice Cloning**: Coqui TTS is large (~2GB models). Accept increased storage requirements?
4. **API Authentication**: Enable for all endpoints or just new features?

---

## Proposed Changes

async def batch_transcribe(files: List[UploadFile]):
    # Offload to Celery workers using BatchedInferencePipeline (2-3x faster)
    # Return job_id for tracking

@router.get("/batch/{job_id}/status")
async def batch_status(job_id: str):
    # Query Redis/DB for status of multiple files
```

---

#### [NEW] Real-time Live STT WebSocket
`backend/app/api/routes/ws_stt.py`

**Technology**: `faster-whisper` + `silero-vad` (Voice Activity Detection) + WebSockets.

**Architecture**:
1. **Frontend**: Record 16kHz mono audio, send 100-200ms binary chunks via WS.
2. **Backend**: 
   - Buffer chunks.
   - Run **Silero-VAD** to detect speech vs silence.
   - On silence OR 5s limit, run `model.transcribe` with `beam_size=1` (greedy) for <100ms latency.
   - Send partial/final results back.

**Frontend**: New live transcription component with waveform visualization

---

### Phase 2: AI Intelligence Layer

#### [NEW] Intelligent Meeting Minutes
`backend/app/services/meeting_service.py`
`frontend/pages/8_📅_Meeting_Minutes.py`

**Goal**: Transform audio recordings into structured meeting minutes with speakers and action items.

**Changes**:
1.  **Database**: Update `Transcript` model (`backend/app/models/transcript.py`) to add `action_items` (JSON) column.
2.  **NLP Service**: Update `backend/app/services/nlp_service.py` to add regex-based action item extraction.
3.  **Meeting Service**: Create orchestrator service:
    *   Input: Audio File
    *   Step 1: STT + Diarization (reuse `DiarizationService`)
    *   Step 2: Sentiment Analysis (reuse `NLPService`)
    *   Step 3: Summarization (reuse `NLPService`)
    *   Step 4: Action Item Extraction (new)
    *   Output: JSON with all metadata + DB record.
4.  **Frontend**: Dashboard to view minutes, filter by speaker, and export to PDF.

**key Dependencies**: `pyannote.audio` (already installed), `sumy` (already installed).

---

#### [PENDING] Emotion & Sentiment Analysis
(To be implemented in Phase 2.2)
`backend/app/services/emotion_service.py`

**Technology**: **HuBERT-Large (fine-tuned on IEMOCAP)**. Research shows HuBERT captures prosody/pitch better than Wav2Vec2 in 2024.

```python
class EmotionService:
    - analyze_audio(audio_path) → {emotion: str, intensity: float, timeline: List}
    # Emotions: neutral, happy, sad, angry, fearful, surprised
```

---

#### [NEW] Custom Vocabulary
`backend/app/services/vocabulary_service.py`

**Strategy**: `faster-whisper` initialization with `initial_prompt` or keyword boosting (if supported by CTranslate2 version).

```python
class VocabularyService:
    - generate_initial_prompt(user_id) # Feed custom words into Whisper's context
    - apply_corrections(transcript, user_vocab) # Regex-based post-processing
```

**Presets**: Medical, Legal, Technical, Financial

---

### Phase 3: Advanced Audio

---

#### [NEW] Audio Editor
`backend/app/services/audio_editor_service.py`

```python
class AudioEditorService:
    - trim(audio_path, start, end) → trimmed_audio
    - merge(audio_paths: List) → merged_audio
    - convert_format(audio_path, target_format)
    - extract_segment(audio_path, timestamp)
```

**Frontend**: Waveform editor with drag-select for trim/cut

---

#### [NEW] Voice Cloning
`backend/app/services/voice_clone_service.py`

**Technology**: Coqui XTTS v2 (multilingual, 3-10s voice sample needed)

```python
class VoiceCloneService:
    - clone_voice(sample_audio) → voice_id
    - synthesize_with_voice(text, voice_id) → audio
    - list_cloned_voices(user_id)
    - delete_voice(voice_id)
```

> [!WARNING]
> Voice cloning has ethical implications. Consider adding consent verification.

---

### Phase 4: Revolutionary - Sign Language

---

#### [NEW] Sign Language Recognition 🤟
`backend/app/services/sign_language_service.py`

**Technology**: **MediaPipe Holistic** (feature extractor) + **1D Transformer Encoder** (sequence classifier).

**Pipeline**:
1. Video recording (24-30 FPS).
2. Extract landmarks (Hands, Pose, minimized Face points).
3. Normalize coordinates relative to body root.
4. Transformer Encoder classifies frames into labels (labels trained on WLASL/MS-ASL).

**Frontend**: Webcam capture with real-time recognition overlay

---

#### [NEW] Sign Language Generation
`backend/app/services/sign_avatar_service.py`

**Technology**: **Three.js + Ready Player Me Avatar** (or similar) mapped to Pose/Hand animation data.

**Pipeline**:
1. Text → Sign Search (Dictionary lookup for glosses).
2. Glosses → Pose Animation Stream (Pre-recorded BVH/Landmark data).
3. Render 3D Avatar performing signs in Streamlit via `streamlit-threejs` or static video.

---

### Phase 5: Platform & API

---

#### [MODIFY] API Authentication
`backend/app/core/security.py`
`backend/app/api/routes/auth.py`

```python
# Add API key management
@router.post("/api-keys")
async def create_api_key(user: User):
    # Generate API key for programmatic access

# Rate limiting middleware
class RateLimitMiddleware:
    - check_rate_limit(api_key)
    - track_usage(api_key, endpoint)
```

---

## New Dependencies

Add to `backend/requirements.txt`:

```txt
# Translation
transformers>=4.36.0
sentencepiece>=0.1.99

# Emotion Detection  
speechbrain>=0.5.16

# Voice Cloning
TTS>=0.22.0  # Coqui TTS

# Sign Language
mediapipe>=0.10.9
opencv-python>=4.9.0

# Audio Editing (enhanced)
moviepy>=1.0.3
```

---

## Verification Plan

### Existing Tests (Verified)
Located in `backend/tests/`:
- `test_api_integration.py` - API endpoint tests
- `test_nlp.py` - NLP service tests
- `test_export.py` - Export format tests
- `test_diarization.py` - Diarization tests

**Run command**: 
```bash
cd backend && pytest tests/ -v
```

### New Tests to Add

| Feature | Test File | Type |
|---------|-----------|------|
| Translation | `test_translation.py` | Unit + Integration |
| Batch Processing | `test_batch.py` | Integration |
| Live STT WebSocket | `test_ws_stt.py` | WebSocket test |
| Meeting Minutes | `test_meeting.py` | Integration |
| Emotion Detection | `test_emotion.py` | Unit |
| Audio Editor | `test_audio_editor.py` | Unit |

### Manual Testing

1. **Translation Flow**:
   - Upload Hindi audio → Verify English transcript + translated audio output
   
2. **Sign Language** (requires webcam):
   - User performs ASL signs → Verify text output matches

3. **Voice Cloning**:
   - Upload 10s voice sample → Generate TTS → Verify voice similarity

---

## Architecture Diagram

```mermaid
graph TB
    subgraph "Frontend - Streamlit"
        UI[Universal UI]
        PAGES[Transcribe | Synthesize | Translate | Sign | Meeting]
    end
    
    subgraph "API Gateway"
        AUTH[JWT/API Key Auth]
        RATE[Rate Limiter]
    end
    
    subgraph "Core Services"
        STT[Whisper STT]
        TTS[Edge TTS]
        TRANS[Translation]
        NLP[NLP Analysis]
    end
    
    subgraph "AI Intelligence"
        MEETING[Meeting Minutes]
        EMOTION[Emotion Detection]
        VOCAB[Custom Vocabulary]
    end
    
    subgraph "Advanced Features"
        EDIT[Audio Editor]
        CLONE[Voice Cloning]
    end
    
    subgraph "Revolutionary"
        SIGN_IN[Sign Recognition]
        SIGN_OUT[Sign Avatar]
    end
    
    UI --> AUTH --> Core Services
    Core Services --> AI Intelligence
    Core Services --> Advanced Features
    Core Services --> Revolutionary
```

---

## Estimated Timeline

| Phase | Features | Estimated Time |
|-------|----------|----------------|
| 1 | Translation, Batch, Live STT | 2-3 days |
| 2 | Meeting Minutes, Emotion, Vocabulary | 2-3 days |
| 3 | Audio Editor, Voice Cloning | 2-3 days |
| 4 | Sign Language (Recognition + Generation) | 5-7 days |
| 5 | API Auth, Landing Page | 1-2 days |

**Total**: ~12-18 days for full implementation

---

## Questions for You

1. **Which phase should we start with?** Recommend Phase 1 (Translation, Batch, Live STT) as foundation.

2. **Sign Language priority**: Start with ASL only, or multi-language from beginning?

3. **Voice Cloning consent**: Add user consent checkbox before allowing voice cloning?

4. **Hosting**: Any preferences for model hosting (local vs. cloud inference)?