Spaces:
Sleeping
Sleeping
| # VoiceForge → Universal Communication Platform | |
| ## Implementation Plan | |
| Transform VoiceForge from a Speech-to-Text/TTS application into a **Universal Communication Platform** supporting speech, text, translation, and sign language. | |
| > [!IMPORTANT] | |
| > This is a **major expansion** requiring significant development time. Recommend implementing in phases, with user testing between phases. | |
| --- | |
| ## User Review Required | |
| ### Scope & Priority Decisions | |
| 1. **Phase Priority**: Should we start with Phase 1 (Translation, Batch, Live STT) or jump to a different phase? | |
| 2. **Sign Language**: This is the most complex feature. Options: | |
| - **Option A**: ASL only (most resources available) | |
| - **Option B**: Multi-sign-language support (ASL, ISL, BSL) - requires more training data | |
| - **Option C**: Defer to Phase 5 after core features stable | |
| 3. **Voice Cloning**: Coqui TTS is large (~2GB models). Accept increased storage requirements? | |
| 4. **API Authentication**: Enable for all endpoints or just new features? | |
| --- | |
| ## Proposed Changes | |
| async def batch_transcribe(files: List[UploadFile]): | |
| # Offload to Celery workers using BatchedInferencePipeline (2-3x faster) | |
| # Return job_id for tracking | |
| @router.get("/batch/{job_id}/status") | |
| async def batch_status(job_id: str): | |
| # Query Redis/DB for status of multiple files | |
| ``` | |
| --- | |
| #### [NEW] Real-time Live STT WebSocket | |
| `backend/app/api/routes/ws_stt.py` | |
| **Technology**: `faster-whisper` + `silero-vad` (Voice Activity Detection) + WebSockets. | |
| **Architecture**: | |
| 1. **Frontend**: Record 16kHz mono audio, send 100-200ms binary chunks via WS. | |
| 2. **Backend**: | |
| - Buffer chunks. | |
| - Run **Silero-VAD** to detect speech vs silence. | |
| - On silence OR 5s limit, run `model.transcribe` with `beam_size=1` (greedy) for <100ms latency. | |
| - Send partial/final results back. | |
| **Frontend**: New live transcription component with waveform visualization | |
| --- | |
| ### Phase 2: AI Intelligence Layer | |
| #### [NEW] Intelligent Meeting Minutes | |
| `backend/app/services/meeting_service.py` | |
| `frontend/pages/8_📅_Meeting_Minutes.py` | |
| **Goal**: Transform audio recordings into structured meeting minutes with speakers and action items. | |
| **Changes**: | |
| 1. **Database**: Update `Transcript` model (`backend/app/models/transcript.py`) to add `action_items` (JSON) column. | |
| 2. **NLP Service**: Update `backend/app/services/nlp_service.py` to add regex-based action item extraction. | |
| 3. **Meeting Service**: Create orchestrator service: | |
| * Input: Audio File | |
| * Step 1: STT + Diarization (reuse `DiarizationService`) | |
| * Step 2: Sentiment Analysis (reuse `NLPService`) | |
| * Step 3: Summarization (reuse `NLPService`) | |
| * Step 4: Action Item Extraction (new) | |
| * Output: JSON with all metadata + DB record. | |
| 4. **Frontend**: Dashboard to view minutes, filter by speaker, and export to PDF. | |
| **key Dependencies**: `pyannote.audio` (already installed), `sumy` (already installed). | |
| --- | |
| #### [PENDING] Emotion & Sentiment Analysis | |
| (To be implemented in Phase 2.2) | |
| `backend/app/services/emotion_service.py` | |
| **Technology**: **HuBERT-Large (fine-tuned on IEMOCAP)**. Research shows HuBERT captures prosody/pitch better than Wav2Vec2 in 2024. | |
| ```python | |
| class EmotionService: | |
| - analyze_audio(audio_path) → {emotion: str, intensity: float, timeline: List} | |
| # Emotions: neutral, happy, sad, angry, fearful, surprised | |
| ``` | |
| --- | |
| #### [NEW] Custom Vocabulary | |
| `backend/app/services/vocabulary_service.py` | |
| **Strategy**: `faster-whisper` initialization with `initial_prompt` or keyword boosting (if supported by CTranslate2 version). | |
| ```python | |
| class VocabularyService: | |
| - generate_initial_prompt(user_id) # Feed custom words into Whisper's context | |
| - apply_corrections(transcript, user_vocab) # Regex-based post-processing | |
| ``` | |
| **Presets**: Medical, Legal, Technical, Financial | |
| --- | |
| ### Phase 3: Advanced Audio | |
| --- | |
| #### [NEW] Audio Editor | |
| `backend/app/services/audio_editor_service.py` | |
| ```python | |
| class AudioEditorService: | |
| - trim(audio_path, start, end) → trimmed_audio | |
| - merge(audio_paths: List) → merged_audio | |
| - convert_format(audio_path, target_format) | |
| - extract_segment(audio_path, timestamp) | |
| ``` | |
| **Frontend**: Waveform editor with drag-select for trim/cut | |
| --- | |
| #### [NEW] Voice Cloning | |
| `backend/app/services/voice_clone_service.py` | |
| **Technology**: Coqui XTTS v2 (multilingual, 3-10s voice sample needed) | |
| ```python | |
| class VoiceCloneService: | |
| - clone_voice(sample_audio) → voice_id | |
| - synthesize_with_voice(text, voice_id) → audio | |
| - list_cloned_voices(user_id) | |
| - delete_voice(voice_id) | |
| ``` | |
| > [!WARNING] | |
| > Voice cloning has ethical implications. Consider adding consent verification. | |
| --- | |
| ### Phase 4: Revolutionary - Sign Language | |
| --- | |
| #### [NEW] Sign Language Recognition 🤟 | |
| `backend/app/services/sign_language_service.py` | |
| **Technology**: **MediaPipe Holistic** (feature extractor) + **1D Transformer Encoder** (sequence classifier). | |
| **Pipeline**: | |
| 1. Video recording (24-30 FPS). | |
| 2. Extract landmarks (Hands, Pose, minimized Face points). | |
| 3. Normalize coordinates relative to body root. | |
| 4. Transformer Encoder classifies frames into labels (labels trained on WLASL/MS-ASL). | |
| **Frontend**: Webcam capture with real-time recognition overlay | |
| --- | |
| #### [NEW] Sign Language Generation | |
| `backend/app/services/sign_avatar_service.py` | |
| **Technology**: **Three.js + Ready Player Me Avatar** (or similar) mapped to Pose/Hand animation data. | |
| **Pipeline**: | |
| 1. Text → Sign Search (Dictionary lookup for glosses). | |
| 2. Glosses → Pose Animation Stream (Pre-recorded BVH/Landmark data). | |
| 3. Render 3D Avatar performing signs in Streamlit via `streamlit-threejs` or static video. | |
| --- | |
| ### Phase 5: Platform & API | |
| --- | |
| #### [MODIFY] API Authentication | |
| `backend/app/core/security.py` | |
| `backend/app/api/routes/auth.py` | |
| ```python | |
| # Add API key management | |
| @router.post("/api-keys") | |
| async def create_api_key(user: User): | |
| # Generate API key for programmatic access | |
| # Rate limiting middleware | |
| class RateLimitMiddleware: | |
| - check_rate_limit(api_key) | |
| - track_usage(api_key, endpoint) | |
| ``` | |
| --- | |
| ## New Dependencies | |
| Add to `backend/requirements.txt`: | |
| ```txt | |
| # Translation | |
| transformers>=4.36.0 | |
| sentencepiece>=0.1.99 | |
| # Emotion Detection | |
| speechbrain>=0.5.16 | |
| # Voice Cloning | |
| TTS>=0.22.0 # Coqui TTS | |
| # Sign Language | |
| mediapipe>=0.10.9 | |
| opencv-python>=4.9.0 | |
| # Audio Editing (enhanced) | |
| moviepy>=1.0.3 | |
| ``` | |
| --- | |
| ## Verification Plan | |
| ### Existing Tests (Verified) | |
| Located in `backend/tests/`: | |
| - `test_api_integration.py` - API endpoint tests | |
| - `test_nlp.py` - NLP service tests | |
| - `test_export.py` - Export format tests | |
| - `test_diarization.py` - Diarization tests | |
| **Run command**: | |
| ```bash | |
| cd backend && pytest tests/ -v | |
| ``` | |
| ### New Tests to Add | |
| | Feature | Test File | Type | | |
| |---------|-----------|------| | |
| | Translation | `test_translation.py` | Unit + Integration | | |
| | Batch Processing | `test_batch.py` | Integration | | |
| | Live STT WebSocket | `test_ws_stt.py` | WebSocket test | | |
| | Meeting Minutes | `test_meeting.py` | Integration | | |
| | Emotion Detection | `test_emotion.py` | Unit | | |
| | Audio Editor | `test_audio_editor.py` | Unit | | |
| ### Manual Testing | |
| 1. **Translation Flow**: | |
| - Upload Hindi audio → Verify English transcript + translated audio output | |
| 2. **Sign Language** (requires webcam): | |
| - User performs ASL signs → Verify text output matches | |
| 3. **Voice Cloning**: | |
| - Upload 10s voice sample → Generate TTS → Verify voice similarity | |
| --- | |
| ## Architecture Diagram | |
| ```mermaid | |
| graph TB | |
| subgraph "Frontend - Streamlit" | |
| UI[Universal UI] | |
| PAGES[Transcribe | Synthesize | Translate | Sign | Meeting] | |
| end | |
| subgraph "API Gateway" | |
| AUTH[JWT/API Key Auth] | |
| RATE[Rate Limiter] | |
| end | |
| subgraph "Core Services" | |
| STT[Whisper STT] | |
| TTS[Edge TTS] | |
| TRANS[Translation] | |
| NLP[NLP Analysis] | |
| end | |
| subgraph "AI Intelligence" | |
| MEETING[Meeting Minutes] | |
| EMOTION[Emotion Detection] | |
| VOCAB[Custom Vocabulary] | |
| end | |
| subgraph "Advanced Features" | |
| EDIT[Audio Editor] | |
| CLONE[Voice Cloning] | |
| end | |
| subgraph "Revolutionary" | |
| SIGN_IN[Sign Recognition] | |
| SIGN_OUT[Sign Avatar] | |
| end | |
| UI --> AUTH --> Core Services | |
| Core Services --> AI Intelligence | |
| Core Services --> Advanced Features | |
| Core Services --> Revolutionary | |
| ``` | |
| --- | |
| ## Estimated Timeline | |
| | Phase | Features | Estimated Time | | |
| |-------|----------|----------------| | |
| | 1 | Translation, Batch, Live STT | 2-3 days | | |
| | 2 | Meeting Minutes, Emotion, Vocabulary | 2-3 days | | |
| | 3 | Audio Editor, Voice Cloning | 2-3 days | | |
| | 4 | Sign Language (Recognition + Generation) | 5-7 days | | |
| | 5 | API Auth, Landing Page | 1-2 days | | |
| **Total**: ~12-18 days for full implementation | |
| --- | |
| ## Questions for You | |
| 1. **Which phase should we start with?** Recommend Phase 1 (Translation, Batch, Live STT) as foundation. | |
| 2. **Sign Language priority**: Start with ASL only, or multi-language from beginning? | |
| 3. **Voice Cloning consent**: Add user consent checkbox before allowing voice cloning? | |
| 4. **Hosting**: Any preferences for model hosting (local vs. cloud inference)? | |