Spaces:

lordofgaming
/

voiceforge-universal

Sleeping

App Files Files Community

voiceforge-universal / docs /ARCHITECTURE_PLAN.md

creator-o1

Initial commit: Complete VoiceForge Enterprise Speech AI Platform

d00203b about 1 month ago

preview code

raw

history blame contribute delete

9.19 kB

	# VoiceForge → Universal Communication Platform

	## Implementation Plan

	Transform VoiceForge from a Speech-to-Text/TTS application into a Universal Communication Platform supporting speech, text, translation, and sign language.

	> [!IMPORTANT]
	> This is a major expansion requiring significant development time. Recommend implementing in phases, with user testing between phases.

	---

	## User Review Required

	### Scope & Priority Decisions

	1. Phase Priority: Should we start with Phase 1 (Translation, Batch, Live STT) or jump to a different phase?
	2. Sign Language: This is the most complex feature. Options:
	- Option A: ASL only (most resources available)
	- Option B: Multi-sign-language support (ASL, ISL, BSL) - requires more training data
	- Option C: Defer to Phase 5 after core features stable
	3. Voice Cloning: Coqui TTS is large (~2GB models). Accept increased storage requirements?
	4. API Authentication: Enable for all endpoints or just new features?

	---

	## Proposed Changes

	async def batch_transcribe(files: List[UploadFile]):
	# Offload to Celery workers using BatchedInferencePipeline (2-3x faster)
	# Return job_id for tracking

	@router.get("/batch/{job_id}/status")
	async def batch_status(job_id: str):
	# Query Redis/DB for status of multiple files
	```

	---

	#### [NEW] Real-time Live STT WebSocket
	`backend/app/api/routes/ws_stt.py`

	Technology: `faster-whisper` + `silero-vad` (Voice Activity Detection) + WebSockets.

	Architecture:
	1. Frontend: Record 16kHz mono audio, send 100-200ms binary chunks via WS.
	2. Backend:
	- Buffer chunks.
	- Run Silero-VAD to detect speech vs silence.
	- On silence OR 5s limit, run `model.transcribe` with `beam_size=1` (greedy) for <100ms latency.
	- Send partial/final results back.

	Frontend: New live transcription component with waveform visualization

	---

	### Phase 2: AI Intelligence Layer

	#### [NEW] Intelligent Meeting Minutes
	`backend/app/services/meeting_service.py`
	`frontend/pages/8_📅_Meeting_Minutes.py`

	Goal: Transform audio recordings into structured meeting minutes with speakers and action items.

	Changes:
	1. Database: Update `Transcript` model (`backend/app/models/transcript.py`) to add `action_items` (JSON) column.
	2. NLP Service: Update `backend/app/services/nlp_service.py` to add regex-based action item extraction.
	3. Meeting Service: Create orchestrator service:
	* Input: Audio File
	* Step 1: STT + Diarization (reuse `DiarizationService`)
	* Step 2: Sentiment Analysis (reuse `NLPService`)
	* Step 3: Summarization (reuse `NLPService`)
	* Step 4: Action Item Extraction (new)
	* Output: JSON with all metadata + DB record.
	4. Frontend: Dashboard to view minutes, filter by speaker, and export to PDF.

	key Dependencies: `pyannote.audio` (already installed), `sumy` (already installed).

	---

	#### [PENDING] Emotion & Sentiment Analysis
	(To be implemented in Phase 2.2)
	`backend/app/services/emotion_service.py`

	Technology: HuBERT-Large (fine-tuned on IEMOCAP). Research shows HuBERT captures prosody/pitch better than Wav2Vec2 in 2024.

	```python
	class EmotionService:
	- analyze_audio(audio_path) → {emotion: str, intensity: float, timeline: List}
	# Emotions: neutral, happy, sad, angry, fearful, surprised
	```

	---

	#### [NEW] Custom Vocabulary
	`backend/app/services/vocabulary_service.py`

	Strategy: `faster-whisper` initialization with `initial_prompt` or keyword boosting (if supported by CTranslate2 version).

	```python
	class VocabularyService:
	- generate_initial_prompt(user_id) # Feed custom words into Whisper's context
	- apply_corrections(transcript, user_vocab) # Regex-based post-processing
	```

	Presets: Medical, Legal, Technical, Financial

	---

	### Phase 3: Advanced Audio

	---

	#### [NEW] Audio Editor
	`backend/app/services/audio_editor_service.py`

	```python
	class AudioEditorService:
	- trim(audio_path, start, end) → trimmed_audio
	- merge(audio_paths: List) → merged_audio
	- convert_format(audio_path, target_format)
	- extract_segment(audio_path, timestamp)
	```

	Frontend: Waveform editor with drag-select for trim/cut

	---

	#### [NEW] Voice Cloning
	`backend/app/services/voice_clone_service.py`

	Technology: Coqui XTTS v2 (multilingual, 3-10s voice sample needed)

	```python
	class VoiceCloneService:
	- clone_voice(sample_audio) → voice_id
	- synthesize_with_voice(text, voice_id) → audio
	- list_cloned_voices(user_id)
	- delete_voice(voice_id)
	```

	> [!WARNING]
	> Voice cloning has ethical implications. Consider adding consent verification.

	---

	### Phase 4: Revolutionary - Sign Language

	---

	#### [NEW] Sign Language Recognition 🤟
	`backend/app/services/sign_language_service.py`

	Technology: MediaPipe Holistic (feature extractor) + 1D Transformer Encoder (sequence classifier).

	Pipeline:
	1. Video recording (24-30 FPS).
	2. Extract landmarks (Hands, Pose, minimized Face points).
	3. Normalize coordinates relative to body root.
	4. Transformer Encoder classifies frames into labels (labels trained on WLASL/MS-ASL).

	Frontend: Webcam capture with real-time recognition overlay

	---

	#### [NEW] Sign Language Generation
	`backend/app/services/sign_avatar_service.py`

	Technology: Three.js + Ready Player Me Avatar (or similar) mapped to Pose/Hand animation data.

	Pipeline:
	1. Text → Sign Search (Dictionary lookup for glosses).
	2. Glosses → Pose Animation Stream (Pre-recorded BVH/Landmark data).
	3. Render 3D Avatar performing signs in Streamlit via `streamlit-threejs` or static video.

	---

	### Phase 5: Platform & API

	---

	#### [MODIFY] API Authentication
	`backend/app/core/security.py`
	`backend/app/api/routes/auth.py`

	```python
	# Add API key management
	@router.post("/api-keys")
	async def create_api_key(user: User):
	# Generate API key for programmatic access

	# Rate limiting middleware
	class RateLimitMiddleware:
	- check_rate_limit(api_key)
	- track_usage(api_key, endpoint)
	```

	---

	## New Dependencies

	Add to `backend/requirements.txt`:

	```txt
	# Translation
	transformers>=4.36.0
	sentencepiece>=0.1.99

	# Emotion Detection
	speechbrain>=0.5.16

	# Voice Cloning
	TTS>=0.22.0 # Coqui TTS

	# Sign Language
	mediapipe>=0.10.9
	opencv-python>=4.9.0

	# Audio Editing (enhanced)
	moviepy>=1.0.3
	```

	---

	## Verification Plan

	### Existing Tests (Verified)
	Located in `backend/tests/`:
	- `test_api_integration.py` - API endpoint tests
	- `test_nlp.py` - NLP service tests
	- `test_export.py` - Export format tests
	- `test_diarization.py` - Diarization tests

	Run command:
	```bash
	cd backend && pytest tests/ -v
	```

	### New Tests to Add

	\| Feature \| Test File \| Type \|
	\|---------\|-----------\|------\|
	\| Translation \| `test_translation.py` \| Unit + Integration \|
	\| Batch Processing \| `test_batch.py` \| Integration \|
	\| Live STT WebSocket \| `test_ws_stt.py` \| WebSocket test \|
	\| Meeting Minutes \| `test_meeting.py` \| Integration \|
	\| Emotion Detection \| `test_emotion.py` \| Unit \|
	\| Audio Editor \| `test_audio_editor.py` \| Unit \|

	### Manual Testing

	1. Translation Flow:
	- Upload Hindi audio → Verify English transcript + translated audio output

	2. Sign Language (requires webcam):
	- User performs ASL signs → Verify text output matches

	3. Voice Cloning:
	- Upload 10s voice sample → Generate TTS → Verify voice similarity

	---

	## Architecture Diagram

	```mermaid
	graph TB
	subgraph "Frontend - Streamlit"
	UI[Universal UI]
	PAGES[Transcribe \| Synthesize \| Translate \| Sign \| Meeting]
	end

	subgraph "API Gateway"
	AUTH[JWT/API Key Auth]
	RATE[Rate Limiter]
	end

	subgraph "Core Services"
	STT[Whisper STT]
	TTS[Edge TTS]
	TRANS[Translation]
	NLP[NLP Analysis]
	end

	subgraph "AI Intelligence"
	MEETING[Meeting Minutes]
	EMOTION[Emotion Detection]
	VOCAB[Custom Vocabulary]
	end

	subgraph "Advanced Features"
	EDIT[Audio Editor]
	CLONE[Voice Cloning]
	end

	subgraph "Revolutionary"
	SIGN_IN[Sign Recognition]
	SIGN_OUT[Sign Avatar]
	end

	UI --> AUTH --> Core Services
	Core Services --> AI Intelligence
	Core Services --> Advanced Features
	Core Services --> Revolutionary
	```

	---

	## Estimated Timeline

	\| Phase \| Features \| Estimated Time \|
	\|-------\|----------\|----------------\|
	\| 1 \| Translation, Batch, Live STT \| 2-3 days \|
	\| 2 \| Meeting Minutes, Emotion, Vocabulary \| 2-3 days \|
	\| 3 \| Audio Editor, Voice Cloning \| 2-3 days \|
	\| 4 \| Sign Language (Recognition + Generation) \| 5-7 days \|
	\| 5 \| API Auth, Landing Page \| 1-2 days \|

	Total: ~12-18 days for full implementation

	---

	## Questions for You

	1. Which phase should we start with? Recommend Phase 1 (Translation, Batch, Live STT) as foundation.

	2. Sign Language priority: Start with ASL only, or multi-language from beginning?

	3. Voice Cloning consent: Add user consent checkbox before allowing voice cloning?

	4. Hosting: Any preferences for model hosting (local vs. cloud inference)?