voiceCal-ai-v2 / LinkedIn.md
pgits's picture
FEATURE: Groq STT Integration - Replace HuggingFace with Groq Whisper
74910cf

πŸš€ Groq STT Integration Plan: HuggingFace to Groq Migration Strategy

Executive Summary

Following our successful TTS migration from Kyutai HuggingFace service to Groq (achieving significant performance improvements), we're now planning a surgical replacement of our Speech-to-Text (STT) service from HuggingFace STT-GPU-Service-v2 to Groq's Whisper-large-v3-turbo implementation.

Current STT Architecture (To Be Replaced)

HuggingFace Integration:

  • External service: pgits-stt-gpu-service-v2.hf.space
  • Complex WebSocket queue system for results
  • HTTP POST β†’ WebSocket listener pattern
  • Base64 audio transmission
  • Gradio client integration with session management

Technical Stack:

  • Frontend: JavaScript MediaRecorder β†’ Base64 conversion
  • Transport: HTTP POST + WebSocket queue listener
  • Backend: External HuggingFace Spaces service
  • Dependencies: External service availability, queue management

Proposed Groq STT Architecture

Groq Integration:

  • Direct API calls to Groq's Whisper service
  • Simplified HTTP request/response pattern
  • FastAPI proxy endpoint for CORS handling
  • Same audio quality with reduced complexity

Implementation Details:

# New FastAPI Endpoint
@app.post("/api/stt/transcribe")
async def stt_transcribe(file: UploadFile = File(...)):
    client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

    transcription = client.audio.transcriptions.create(
        file=file.file,
        model="whisper-large-v3-turbo",
        response_format="json",
        language="en",
        temperature=0.0
    )

    return {"text": transcription.text}
// Simplified Frontend Integration
async transcribeAudio(audioBase64) {
    const audioBlob = this.base64ToBlob(audioBase64);
    const formData = new FormData();
    formData.append('file', audioBlob, 'audio.wav');

    const response = await fetch('/api/stt/transcribe', {
        method: 'POST', body: formData
    });

    const result = await response.json();
    this.addTranscriptionToInput(result.text);
}

Migration Benefits

Performance Improvements

  • Elimination of WebSocket complexity - Direct HTTP API calls
  • Reduced latency - No external queue system
  • Faster transcription - Groq's optimized Whisper implementation
  • Simplified error handling - No connection state management

Operational Benefits

  • Consolidated authentication - Uses existing GROQ_API_KEY
  • Reduced dependencies - No external HuggingFace service reliance
  • Cost optimization - Direct API usage vs. external compute
  • Improved reliability - Fewer points of failure

Development Benefits

  • Code simplification - Remove WebSocket queue logic
  • Easier debugging - Standard HTTP request/response pattern
  • Better error visibility - Direct API error responses
  • Consistent architecture - Matches our TTS implementation pattern

Surgical Implementation Plan

Files to Modify (Minimal Impact)

  1. app/api/main.py - Add new /api/stt/transcribe endpoint
  2. app/api/chat_widget.py - Replace transcribeAudio() method (lines 1151-1211)
  3. Requirements - Already satisfied (groq>=0.4.0 from TTS migration)

Files NOT Modified (Preservation Strategy)

  • Audio recording logic (MediaRecorder)
  • Visual state management (STT indicators)
  • User interface components
  • Session management
  • TTS interruption system (recently enhanced)

Risk Mitigation

  • Identical API contract - Same input (audio) β†’ output (text) pattern
  • Progressive deployment - Can switch back via configuration
  • Preserved user experience - No UI changes required
  • Same audio quality - WebM/Opus β†’ Whisper transcription path maintained

Success Metrics

  • Transcription latency reduction (target: <2 seconds)
  • Error rate improvement (eliminate WebSocket timeouts)
  • Code complexity reduction (remove 100+ lines of WebSocket handling)
  • Infrastructure simplification (single API key vs. external service)

Timeline

  • Phase 1: Implementation (FastAPI endpoint + frontend method)
  • Phase 2: Testing (transcription accuracy and performance)
  • Phase 3: Deployment (surgical replacement with rollback capability)

Architectural Philosophy

This migration continues our platform consolidation strategy: moving from distributed external services to unified API providers while maintaining service quality and user experience. The Groq ecosystem (TTS + STT) provides performance advantages and operational simplification compared to our current mixed-provider approach.


This document serves as the technical blueprint for our HuggingFace β†’ Groq STT migration, ensuring stakeholder alignment and implementation clarity.

#AI #SpeechToText #Groq #HuggingFace #TechnicalStrategy #VoiceAI #SystemArchitecture