voiceforge-universal / docs /ARCHITECTURE_PLAN.md
creator-o1
Initial commit: Complete VoiceForge Enterprise Speech AI Platform
d00203b
# VoiceForge → Universal Communication Platform
## Implementation Plan
Transform VoiceForge from a Speech-to-Text/TTS application into a **Universal Communication Platform** supporting speech, text, translation, and sign language.
> [!IMPORTANT]
> This is a **major expansion** requiring significant development time. Recommend implementing in phases, with user testing between phases.
---
## User Review Required
### Scope & Priority Decisions
1. **Phase Priority**: Should we start with Phase 1 (Translation, Batch, Live STT) or jump to a different phase?
2. **Sign Language**: This is the most complex feature. Options:
- **Option A**: ASL only (most resources available)
- **Option B**: Multi-sign-language support (ASL, ISL, BSL) - requires more training data
- **Option C**: Defer to Phase 5 after core features stable
3. **Voice Cloning**: Coqui TTS is large (~2GB models). Accept increased storage requirements?
4. **API Authentication**: Enable for all endpoints or just new features?
---
## Proposed Changes
async def batch_transcribe(files: List[UploadFile]):
# Offload to Celery workers using BatchedInferencePipeline (2-3x faster)
# Return job_id for tracking
@router.get("/batch/{job_id}/status")
async def batch_status(job_id: str):
# Query Redis/DB for status of multiple files
```
---
#### [NEW] Real-time Live STT WebSocket
`backend/app/api/routes/ws_stt.py`
**Technology**: `faster-whisper` + `silero-vad` (Voice Activity Detection) + WebSockets.
**Architecture**:
1. **Frontend**: Record 16kHz mono audio, send 100-200ms binary chunks via WS.
2. **Backend**:
- Buffer chunks.
- Run **Silero-VAD** to detect speech vs silence.
- On silence OR 5s limit, run `model.transcribe` with `beam_size=1` (greedy) for <100ms latency.
- Send partial/final results back.
**Frontend**: New live transcription component with waveform visualization
---
### Phase 2: AI Intelligence Layer
#### [NEW] Intelligent Meeting Minutes
`backend/app/services/meeting_service.py`
`frontend/pages/8_📅_Meeting_Minutes.py`
**Goal**: Transform audio recordings into structured meeting minutes with speakers and action items.
**Changes**:
1. **Database**: Update `Transcript` model (`backend/app/models/transcript.py`) to add `action_items` (JSON) column.
2. **NLP Service**: Update `backend/app/services/nlp_service.py` to add regex-based action item extraction.
3. **Meeting Service**: Create orchestrator service:
* Input: Audio File
* Step 1: STT + Diarization (reuse `DiarizationService`)
* Step 2: Sentiment Analysis (reuse `NLPService`)
* Step 3: Summarization (reuse `NLPService`)
* Step 4: Action Item Extraction (new)
* Output: JSON with all metadata + DB record.
4. **Frontend**: Dashboard to view minutes, filter by speaker, and export to PDF.
**key Dependencies**: `pyannote.audio` (already installed), `sumy` (already installed).
---
#### [PENDING] Emotion & Sentiment Analysis
(To be implemented in Phase 2.2)
`backend/app/services/emotion_service.py`
**Technology**: **HuBERT-Large (fine-tuned on IEMOCAP)**. Research shows HuBERT captures prosody/pitch better than Wav2Vec2 in 2024.
```python
class EmotionService:
- analyze_audio(audio_path) → {emotion: str, intensity: float, timeline: List}
# Emotions: neutral, happy, sad, angry, fearful, surprised
```
---
#### [NEW] Custom Vocabulary
`backend/app/services/vocabulary_service.py`
**Strategy**: `faster-whisper` initialization with `initial_prompt` or keyword boosting (if supported by CTranslate2 version).
```python
class VocabularyService:
- generate_initial_prompt(user_id) # Feed custom words into Whisper's context
- apply_corrections(transcript, user_vocab) # Regex-based post-processing
```
**Presets**: Medical, Legal, Technical, Financial
---
### Phase 3: Advanced Audio
---
#### [NEW] Audio Editor
`backend/app/services/audio_editor_service.py`
```python
class AudioEditorService:
- trim(audio_path, start, end) → trimmed_audio
- merge(audio_paths: List) → merged_audio
- convert_format(audio_path, target_format)
- extract_segment(audio_path, timestamp)
```
**Frontend**: Waveform editor with drag-select for trim/cut
---
#### [NEW] Voice Cloning
`backend/app/services/voice_clone_service.py`
**Technology**: Coqui XTTS v2 (multilingual, 3-10s voice sample needed)
```python
class VoiceCloneService:
- clone_voice(sample_audio) → voice_id
- synthesize_with_voice(text, voice_id) → audio
- list_cloned_voices(user_id)
- delete_voice(voice_id)
```
> [!WARNING]
> Voice cloning has ethical implications. Consider adding consent verification.
---
### Phase 4: Revolutionary - Sign Language
---
#### [NEW] Sign Language Recognition 🤟
`backend/app/services/sign_language_service.py`
**Technology**: **MediaPipe Holistic** (feature extractor) + **1D Transformer Encoder** (sequence classifier).
**Pipeline**:
1. Video recording (24-30 FPS).
2. Extract landmarks (Hands, Pose, minimized Face points).
3. Normalize coordinates relative to body root.
4. Transformer Encoder classifies frames into labels (labels trained on WLASL/MS-ASL).
**Frontend**: Webcam capture with real-time recognition overlay
---
#### [NEW] Sign Language Generation
`backend/app/services/sign_avatar_service.py`
**Technology**: **Three.js + Ready Player Me Avatar** (or similar) mapped to Pose/Hand animation data.
**Pipeline**:
1. Text → Sign Search (Dictionary lookup for glosses).
2. Glosses → Pose Animation Stream (Pre-recorded BVH/Landmark data).
3. Render 3D Avatar performing signs in Streamlit via `streamlit-threejs` or static video.
---
### Phase 5: Platform & API
---
#### [MODIFY] API Authentication
`backend/app/core/security.py`
`backend/app/api/routes/auth.py`
```python
# Add API key management
@router.post("/api-keys")
async def create_api_key(user: User):
# Generate API key for programmatic access
# Rate limiting middleware
class RateLimitMiddleware:
- check_rate_limit(api_key)
- track_usage(api_key, endpoint)
```
---
## New Dependencies
Add to `backend/requirements.txt`:
```txt
# Translation
transformers>=4.36.0
sentencepiece>=0.1.99
# Emotion Detection
speechbrain>=0.5.16
# Voice Cloning
TTS>=0.22.0 # Coqui TTS
# Sign Language
mediapipe>=0.10.9
opencv-python>=4.9.0
# Audio Editing (enhanced)
moviepy>=1.0.3
```
---
## Verification Plan
### Existing Tests (Verified)
Located in `backend/tests/`:
- `test_api_integration.py` - API endpoint tests
- `test_nlp.py` - NLP service tests
- `test_export.py` - Export format tests
- `test_diarization.py` - Diarization tests
**Run command**:
```bash
cd backend && pytest tests/ -v
```
### New Tests to Add
| Feature | Test File | Type |
|---------|-----------|------|
| Translation | `test_translation.py` | Unit + Integration |
| Batch Processing | `test_batch.py` | Integration |
| Live STT WebSocket | `test_ws_stt.py` | WebSocket test |
| Meeting Minutes | `test_meeting.py` | Integration |
| Emotion Detection | `test_emotion.py` | Unit |
| Audio Editor | `test_audio_editor.py` | Unit |
### Manual Testing
1. **Translation Flow**:
- Upload Hindi audio → Verify English transcript + translated audio output
2. **Sign Language** (requires webcam):
- User performs ASL signs → Verify text output matches
3. **Voice Cloning**:
- Upload 10s voice sample → Generate TTS → Verify voice similarity
---
## Architecture Diagram
```mermaid
graph TB
subgraph "Frontend - Streamlit"
UI[Universal UI]
PAGES[Transcribe | Synthesize | Translate | Sign | Meeting]
end
subgraph "API Gateway"
AUTH[JWT/API Key Auth]
RATE[Rate Limiter]
end
subgraph "Core Services"
STT[Whisper STT]
TTS[Edge TTS]
TRANS[Translation]
NLP[NLP Analysis]
end
subgraph "AI Intelligence"
MEETING[Meeting Minutes]
EMOTION[Emotion Detection]
VOCAB[Custom Vocabulary]
end
subgraph "Advanced Features"
EDIT[Audio Editor]
CLONE[Voice Cloning]
end
subgraph "Revolutionary"
SIGN_IN[Sign Recognition]
SIGN_OUT[Sign Avatar]
end
UI --> AUTH --> Core Services
Core Services --> AI Intelligence
Core Services --> Advanced Features
Core Services --> Revolutionary
```
---
## Estimated Timeline
| Phase | Features | Estimated Time |
|-------|----------|----------------|
| 1 | Translation, Batch, Live STT | 2-3 days |
| 2 | Meeting Minutes, Emotion, Vocabulary | 2-3 days |
| 3 | Audio Editor, Voice Cloning | 2-3 days |
| 4 | Sign Language (Recognition + Generation) | 5-7 days |
| 5 | API Auth, Landing Page | 1-2 days |
**Total**: ~12-18 days for full implementation
---
## Questions for You
1. **Which phase should we start with?** Recommend Phase 1 (Translation, Batch, Live STT) as foundation.
2. **Sign Language priority**: Start with ASL only, or multi-language from beginning?
3. **Voice Cloning consent**: Add user consent checkbox before allowing voice cloning?
4. **Hosting**: Any preferences for model hosting (local vs. cloud inference)?