Spaces:
Sleeping
π Groq STT Integration Plan: HuggingFace to Groq Migration Strategy
Executive Summary
Following our successful TTS migration from Kyutai HuggingFace service to Groq (achieving significant performance improvements), we're now planning a surgical replacement of our Speech-to-Text (STT) service from HuggingFace STT-GPU-Service-v2 to Groq's Whisper-large-v3-turbo implementation.
Current STT Architecture (To Be Replaced)
HuggingFace Integration:
- External service:
pgits-stt-gpu-service-v2.hf.space - Complex WebSocket queue system for results
- HTTP POST β WebSocket listener pattern
- Base64 audio transmission
- Gradio client integration with session management
Technical Stack:
- Frontend: JavaScript MediaRecorder β Base64 conversion
- Transport: HTTP POST + WebSocket queue listener
- Backend: External HuggingFace Spaces service
- Dependencies: External service availability, queue management
Proposed Groq STT Architecture
Groq Integration:
- Direct API calls to Groq's Whisper service
- Simplified HTTP request/response pattern
- FastAPI proxy endpoint for CORS handling
- Same audio quality with reduced complexity
Implementation Details:
# New FastAPI Endpoint
@app.post("/api/stt/transcribe")
async def stt_transcribe(file: UploadFile = File(...)):
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
transcription = client.audio.transcriptions.create(
file=file.file,
model="whisper-large-v3-turbo",
response_format="json",
language="en",
temperature=0.0
)
return {"text": transcription.text}
// Simplified Frontend Integration
async transcribeAudio(audioBase64) {
const audioBlob = this.base64ToBlob(audioBase64);
const formData = new FormData();
formData.append('file', audioBlob, 'audio.wav');
const response = await fetch('/api/stt/transcribe', {
method: 'POST', body: formData
});
const result = await response.json();
this.addTranscriptionToInput(result.text);
}
Migration Benefits
Performance Improvements
- Elimination of WebSocket complexity - Direct HTTP API calls
- Reduced latency - No external queue system
- Faster transcription - Groq's optimized Whisper implementation
- Simplified error handling - No connection state management
Operational Benefits
- Consolidated authentication - Uses existing GROQ_API_KEY
- Reduced dependencies - No external HuggingFace service reliance
- Cost optimization - Direct API usage vs. external compute
- Improved reliability - Fewer points of failure
Development Benefits
- Code simplification - Remove WebSocket queue logic
- Easier debugging - Standard HTTP request/response pattern
- Better error visibility - Direct API error responses
- Consistent architecture - Matches our TTS implementation pattern
Surgical Implementation Plan
Files to Modify (Minimal Impact)
- app/api/main.py - Add new
/api/stt/transcribeendpoint - app/api/chat_widget.py - Replace
transcribeAudio()method (lines 1151-1211) - Requirements - Already satisfied (groq>=0.4.0 from TTS migration)
Files NOT Modified (Preservation Strategy)
- Audio recording logic (MediaRecorder)
- Visual state management (STT indicators)
- User interface components
- Session management
- TTS interruption system (recently enhanced)
Risk Mitigation
- Identical API contract - Same input (audio) β output (text) pattern
- Progressive deployment - Can switch back via configuration
- Preserved user experience - No UI changes required
- Same audio quality - WebM/Opus β Whisper transcription path maintained
Success Metrics
- Transcription latency reduction (target: <2 seconds)
- Error rate improvement (eliminate WebSocket timeouts)
- Code complexity reduction (remove 100+ lines of WebSocket handling)
- Infrastructure simplification (single API key vs. external service)
Timeline
- Phase 1: Implementation (FastAPI endpoint + frontend method)
- Phase 2: Testing (transcription accuracy and performance)
- Phase 3: Deployment (surgical replacement with rollback capability)
Architectural Philosophy
This migration continues our platform consolidation strategy: moving from distributed external services to unified API providers while maintaining service quality and user experience. The Groq ecosystem (TTS + STT) provides performance advantages and operational simplification compared to our current mixed-provider approach.
This document serves as the technical blueprint for our HuggingFace β Groq STT migration, ensuring stakeholder alignment and implementation clarity.
#AI #SpeechToText #Groq #HuggingFace #TechnicalStrategy #VoiceAI #SystemArchitecture