# VoiceForge → Universal Communication Platform ## Implementation Plan Transform VoiceForge from a Speech-to-Text/TTS application into a **Universal Communication Platform** supporting speech, text, translation, and sign language. > [!IMPORTANT] > This is a **major expansion** requiring significant development time. Recommend implementing in phases, with user testing between phases. --- ## User Review Required ### Scope & Priority Decisions 1. **Phase Priority**: Should we start with Phase 1 (Translation, Batch, Live STT) or jump to a different phase? 2. **Sign Language**: This is the most complex feature. Options: - **Option A**: ASL only (most resources available) - **Option B**: Multi-sign-language support (ASL, ISL, BSL) - requires more training data - **Option C**: Defer to Phase 5 after core features stable 3. **Voice Cloning**: Coqui TTS is large (~2GB models). Accept increased storage requirements? 4. **API Authentication**: Enable for all endpoints or just new features? --- ## Proposed Changes async def batch_transcribe(files: List[UploadFile]): # Offload to Celery workers using BatchedInferencePipeline (2-3x faster) # Return job_id for tracking @router.get("/batch/{job_id}/status") async def batch_status(job_id: str): # Query Redis/DB for status of multiple files ``` --- #### [NEW] Real-time Live STT WebSocket `backend/app/api/routes/ws_stt.py` **Technology**: `faster-whisper` + `silero-vad` (Voice Activity Detection) + WebSockets. **Architecture**: 1. **Frontend**: Record 16kHz mono audio, send 100-200ms binary chunks via WS. 2. **Backend**: - Buffer chunks. - Run **Silero-VAD** to detect speech vs silence. - On silence OR 5s limit, run `model.transcribe` with `beam_size=1` (greedy) for <100ms latency. - Send partial/final results back. **Frontend**: New live transcription component with waveform visualization --- ### Phase 2: AI Intelligence Layer #### [NEW] Intelligent Meeting Minutes `backend/app/services/meeting_service.py` `frontend/pages/8_📅_Meeting_Minutes.py` **Goal**: Transform audio recordings into structured meeting minutes with speakers and action items. **Changes**: 1. **Database**: Update `Transcript` model (`backend/app/models/transcript.py`) to add `action_items` (JSON) column. 2. **NLP Service**: Update `backend/app/services/nlp_service.py` to add regex-based action item extraction. 3. **Meeting Service**: Create orchestrator service: * Input: Audio File * Step 1: STT + Diarization (reuse `DiarizationService`) * Step 2: Sentiment Analysis (reuse `NLPService`) * Step 3: Summarization (reuse `NLPService`) * Step 4: Action Item Extraction (new) * Output: JSON with all metadata + DB record. 4. **Frontend**: Dashboard to view minutes, filter by speaker, and export to PDF. **key Dependencies**: `pyannote.audio` (already installed), `sumy` (already installed). --- #### [PENDING] Emotion & Sentiment Analysis (To be implemented in Phase 2.2) `backend/app/services/emotion_service.py` **Technology**: **HuBERT-Large (fine-tuned on IEMOCAP)**. Research shows HuBERT captures prosody/pitch better than Wav2Vec2 in 2024. ```python class EmotionService: - analyze_audio(audio_path) → {emotion: str, intensity: float, timeline: List} # Emotions: neutral, happy, sad, angry, fearful, surprised ``` --- #### [NEW] Custom Vocabulary `backend/app/services/vocabulary_service.py` **Strategy**: `faster-whisper` initialization with `initial_prompt` or keyword boosting (if supported by CTranslate2 version). ```python class VocabularyService: - generate_initial_prompt(user_id) # Feed custom words into Whisper's context - apply_corrections(transcript, user_vocab) # Regex-based post-processing ``` **Presets**: Medical, Legal, Technical, Financial --- ### Phase 3: Advanced Audio --- #### [NEW] Audio Editor `backend/app/services/audio_editor_service.py` ```python class AudioEditorService: - trim(audio_path, start, end) → trimmed_audio - merge(audio_paths: List) → merged_audio - convert_format(audio_path, target_format) - extract_segment(audio_path, timestamp) ``` **Frontend**: Waveform editor with drag-select for trim/cut --- #### [NEW] Voice Cloning `backend/app/services/voice_clone_service.py` **Technology**: Coqui XTTS v2 (multilingual, 3-10s voice sample needed) ```python class VoiceCloneService: - clone_voice(sample_audio) → voice_id - synthesize_with_voice(text, voice_id) → audio - list_cloned_voices(user_id) - delete_voice(voice_id) ``` > [!WARNING] > Voice cloning has ethical implications. Consider adding consent verification. --- ### Phase 4: Revolutionary - Sign Language --- #### [NEW] Sign Language Recognition 🤟 `backend/app/services/sign_language_service.py` **Technology**: **MediaPipe Holistic** (feature extractor) + **1D Transformer Encoder** (sequence classifier). **Pipeline**: 1. Video recording (24-30 FPS). 2. Extract landmarks (Hands, Pose, minimized Face points). 3. Normalize coordinates relative to body root. 4. Transformer Encoder classifies frames into labels (labels trained on WLASL/MS-ASL). **Frontend**: Webcam capture with real-time recognition overlay --- #### [NEW] Sign Language Generation `backend/app/services/sign_avatar_service.py` **Technology**: **Three.js + Ready Player Me Avatar** (or similar) mapped to Pose/Hand animation data. **Pipeline**: 1. Text → Sign Search (Dictionary lookup for glosses). 2. Glosses → Pose Animation Stream (Pre-recorded BVH/Landmark data). 3. Render 3D Avatar performing signs in Streamlit via `streamlit-threejs` or static video. --- ### Phase 5: Platform & API --- #### [MODIFY] API Authentication `backend/app/core/security.py` `backend/app/api/routes/auth.py` ```python # Add API key management @router.post("/api-keys") async def create_api_key(user: User): # Generate API key for programmatic access # Rate limiting middleware class RateLimitMiddleware: - check_rate_limit(api_key) - track_usage(api_key, endpoint) ``` --- ## New Dependencies Add to `backend/requirements.txt`: ```txt # Translation transformers>=4.36.0 sentencepiece>=0.1.99 # Emotion Detection speechbrain>=0.5.16 # Voice Cloning TTS>=0.22.0 # Coqui TTS # Sign Language mediapipe>=0.10.9 opencv-python>=4.9.0 # Audio Editing (enhanced) moviepy>=1.0.3 ``` --- ## Verification Plan ### Existing Tests (Verified) Located in `backend/tests/`: - `test_api_integration.py` - API endpoint tests - `test_nlp.py` - NLP service tests - `test_export.py` - Export format tests - `test_diarization.py` - Diarization tests **Run command**: ```bash cd backend && pytest tests/ -v ``` ### New Tests to Add | Feature | Test File | Type | |---------|-----------|------| | Translation | `test_translation.py` | Unit + Integration | | Batch Processing | `test_batch.py` | Integration | | Live STT WebSocket | `test_ws_stt.py` | WebSocket test | | Meeting Minutes | `test_meeting.py` | Integration | | Emotion Detection | `test_emotion.py` | Unit | | Audio Editor | `test_audio_editor.py` | Unit | ### Manual Testing 1. **Translation Flow**: - Upload Hindi audio → Verify English transcript + translated audio output 2. **Sign Language** (requires webcam): - User performs ASL signs → Verify text output matches 3. **Voice Cloning**: - Upload 10s voice sample → Generate TTS → Verify voice similarity --- ## Architecture Diagram ```mermaid graph TB subgraph "Frontend - Streamlit" UI[Universal UI] PAGES[Transcribe | Synthesize | Translate | Sign | Meeting] end subgraph "API Gateway" AUTH[JWT/API Key Auth] RATE[Rate Limiter] end subgraph "Core Services" STT[Whisper STT] TTS[Edge TTS] TRANS[Translation] NLP[NLP Analysis] end subgraph "AI Intelligence" MEETING[Meeting Minutes] EMOTION[Emotion Detection] VOCAB[Custom Vocabulary] end subgraph "Advanced Features" EDIT[Audio Editor] CLONE[Voice Cloning] end subgraph "Revolutionary" SIGN_IN[Sign Recognition] SIGN_OUT[Sign Avatar] end UI --> AUTH --> Core Services Core Services --> AI Intelligence Core Services --> Advanced Features Core Services --> Revolutionary ``` --- ## Estimated Timeline | Phase | Features | Estimated Time | |-------|----------|----------------| | 1 | Translation, Batch, Live STT | 2-3 days | | 2 | Meeting Minutes, Emotion, Vocabulary | 2-3 days | | 3 | Audio Editor, Voice Cloning | 2-3 days | | 4 | Sign Language (Recognition + Generation) | 5-7 days | | 5 | API Auth, Landing Page | 1-2 days | **Total**: ~12-18 days for full implementation --- ## Questions for You 1. **Which phase should we start with?** Recommend Phase 1 (Translation, Batch, Live STT) as foundation. 2. **Sign Language priority**: Start with ASL only, or multi-language from beginning? 3. **Voice Cloning consent**: Add user consent checkbox before allowing voice cloning? 4. **Hosting**: Any preferences for model hosting (local vs. cloud inference)?