Spaces:
Sleeping
VoiceForge → Universal Communication Platform
Implementation Plan
Transform VoiceForge from a Speech-to-Text/TTS application into a Universal Communication Platform supporting speech, text, translation, and sign language.
This is a major expansion requiring significant development time. Recommend implementing in phases, with user testing between phases.
User Review Required
Scope & Priority Decisions
- Phase Priority: Should we start with Phase 1 (Translation, Batch, Live STT) or jump to a different phase?
- Sign Language: This is the most complex feature. Options:
- Option A: ASL only (most resources available)
- Option B: Multi-sign-language support (ASL, ISL, BSL) - requires more training data
- Option C: Defer to Phase 5 after core features stable
- Voice Cloning: Coqui TTS is large (~2GB models). Accept increased storage requirements?
- API Authentication: Enable for all endpoints or just new features?
Proposed Changes
async def batch_transcribe(files: List[UploadFile]): # Offload to Celery workers using BatchedInferencePipeline (2-3x faster) # Return job_id for tracking
@router.get("/batch/{job_id}/status") async def batch_status(job_id: str): # Query Redis/DB for status of multiple files
---
#### [NEW] Real-time Live STT WebSocket
`backend/app/api/routes/ws_stt.py`
**Technology**: `faster-whisper` + `silero-vad` (Voice Activity Detection) + WebSockets.
**Architecture**:
1. **Frontend**: Record 16kHz mono audio, send 100-200ms binary chunks via WS.
2. **Backend**:
- Buffer chunks.
- Run **Silero-VAD** to detect speech vs silence.
- On silence OR 5s limit, run `model.transcribe` with `beam_size=1` (greedy) for <100ms latency.
- Send partial/final results back.
**Frontend**: New live transcription component with waveform visualization
---
### Phase 2: AI Intelligence Layer
#### [NEW] Intelligent Meeting Minutes
`backend/app/services/meeting_service.py`
`frontend/pages/8_📅_Meeting_Minutes.py`
**Goal**: Transform audio recordings into structured meeting minutes with speakers and action items.
**Changes**:
1. **Database**: Update `Transcript` model (`backend/app/models/transcript.py`) to add `action_items` (JSON) column.
2. **NLP Service**: Update `backend/app/services/nlp_service.py` to add regex-based action item extraction.
3. **Meeting Service**: Create orchestrator service:
* Input: Audio File
* Step 1: STT + Diarization (reuse `DiarizationService`)
* Step 2: Sentiment Analysis (reuse `NLPService`)
* Step 3: Summarization (reuse `NLPService`)
* Step 4: Action Item Extraction (new)
* Output: JSON with all metadata + DB record.
4. **Frontend**: Dashboard to view minutes, filter by speaker, and export to PDF.
**key Dependencies**: `pyannote.audio` (already installed), `sumy` (already installed).
---
#### [PENDING] Emotion & Sentiment Analysis
(To be implemented in Phase 2.2)
`backend/app/services/emotion_service.py`
**Technology**: **HuBERT-Large (fine-tuned on IEMOCAP)**. Research shows HuBERT captures prosody/pitch better than Wav2Vec2 in 2024.
```python
class EmotionService:
- analyze_audio(audio_path) → {emotion: str, intensity: float, timeline: List}
# Emotions: neutral, happy, sad, angry, fearful, surprised
[NEW] Custom Vocabulary
backend/app/services/vocabulary_service.py
Strategy: faster-whisper initialization with initial_prompt or keyword boosting (if supported by CTranslate2 version).
class VocabularyService:
- generate_initial_prompt(user_id) # Feed custom words into Whisper's context
- apply_corrections(transcript, user_vocab) # Regex-based post-processing
Presets: Medical, Legal, Technical, Financial
Phase 3: Advanced Audio
[NEW] Audio Editor
backend/app/services/audio_editor_service.py
class AudioEditorService:
- trim(audio_path, start, end) → trimmed_audio
- merge(audio_paths: List) → merged_audio
- convert_format(audio_path, target_format)
- extract_segment(audio_path, timestamp)
Frontend: Waveform editor with drag-select for trim/cut
[NEW] Voice Cloning
backend/app/services/voice_clone_service.py
Technology: Coqui XTTS v2 (multilingual, 3-10s voice sample needed)
class VoiceCloneService:
- clone_voice(sample_audio) → voice_id
- synthesize_with_voice(text, voice_id) → audio
- list_cloned_voices(user_id)
- delete_voice(voice_id)
Voice cloning has ethical implications. Consider adding consent verification.
Phase 4: Revolutionary - Sign Language
[NEW] Sign Language Recognition 🤟
backend/app/services/sign_language_service.py
Technology: MediaPipe Holistic (feature extractor) + 1D Transformer Encoder (sequence classifier).
Pipeline:
- Video recording (24-30 FPS).
- Extract landmarks (Hands, Pose, minimized Face points).
- Normalize coordinates relative to body root.
- Transformer Encoder classifies frames into labels (labels trained on WLASL/MS-ASL).
Frontend: Webcam capture with real-time recognition overlay
[NEW] Sign Language Generation
backend/app/services/sign_avatar_service.py
Technology: Three.js + Ready Player Me Avatar (or similar) mapped to Pose/Hand animation data.
Pipeline:
- Text → Sign Search (Dictionary lookup for glosses).
- Glosses → Pose Animation Stream (Pre-recorded BVH/Landmark data).
- Render 3D Avatar performing signs in Streamlit via
streamlit-threejsor static video.
Phase 5: Platform & API
[MODIFY] API Authentication
backend/app/core/security.py
backend/app/api/routes/auth.py
# Add API key management
@router.post("/api-keys")
async def create_api_key(user: User):
# Generate API key for programmatic access
# Rate limiting middleware
class RateLimitMiddleware:
- check_rate_limit(api_key)
- track_usage(api_key, endpoint)
New Dependencies
Add to backend/requirements.txt:
# Translation
transformers>=4.36.0
sentencepiece>=0.1.99
# Emotion Detection
speechbrain>=0.5.16
# Voice Cloning
TTS>=0.22.0 # Coqui TTS
# Sign Language
mediapipe>=0.10.9
opencv-python>=4.9.0
# Audio Editing (enhanced)
moviepy>=1.0.3
Verification Plan
Existing Tests (Verified)
Located in backend/tests/:
test_api_integration.py- API endpoint teststest_nlp.py- NLP service teststest_export.py- Export format teststest_diarization.py- Diarization tests
Run command:
cd backend && pytest tests/ -v
New Tests to Add
| Feature | Test File | Type |
|---|---|---|
| Translation | test_translation.py |
Unit + Integration |
| Batch Processing | test_batch.py |
Integration |
| Live STT WebSocket | test_ws_stt.py |
WebSocket test |
| Meeting Minutes | test_meeting.py |
Integration |
| Emotion Detection | test_emotion.py |
Unit |
| Audio Editor | test_audio_editor.py |
Unit |
Manual Testing
Translation Flow:
- Upload Hindi audio → Verify English transcript + translated audio output
Sign Language (requires webcam):
- User performs ASL signs → Verify text output matches
Voice Cloning:
- Upload 10s voice sample → Generate TTS → Verify voice similarity
Architecture Diagram
graph TB
subgraph "Frontend - Streamlit"
UI[Universal UI]
PAGES[Transcribe | Synthesize | Translate | Sign | Meeting]
end
subgraph "API Gateway"
AUTH[JWT/API Key Auth]
RATE[Rate Limiter]
end
subgraph "Core Services"
STT[Whisper STT]
TTS[Edge TTS]
TRANS[Translation]
NLP[NLP Analysis]
end
subgraph "AI Intelligence"
MEETING[Meeting Minutes]
EMOTION[Emotion Detection]
VOCAB[Custom Vocabulary]
end
subgraph "Advanced Features"
EDIT[Audio Editor]
CLONE[Voice Cloning]
end
subgraph "Revolutionary"
SIGN_IN[Sign Recognition]
SIGN_OUT[Sign Avatar]
end
UI --> AUTH --> Core Services
Core Services --> AI Intelligence
Core Services --> Advanced Features
Core Services --> Revolutionary
Estimated Timeline
| Phase | Features | Estimated Time |
|---|---|---|
| 1 | Translation, Batch, Live STT | 2-3 days |
| 2 | Meeting Minutes, Emotion, Vocabulary | 2-3 days |
| 3 | Audio Editor, Voice Cloning | 2-3 days |
| 4 | Sign Language (Recognition + Generation) | 5-7 days |
| 5 | API Auth, Landing Page | 1-2 days |
Total: ~12-18 days for full implementation
Questions for You
Which phase should we start with? Recommend Phase 1 (Translation, Batch, Live STT) as foundation.
Sign Language priority: Start with ASL only, or multi-language from beginning?
Voice Cloning consent: Add user consent checkbox before allowing voice cloning?
Hosting: Any preferences for model hosting (local vs. cloud inference)?