pgits Claude commited on
Commit
74910cf
Β·
1 Parent(s): e4ce72e

FEATURE: Groq STT Integration - Replace HuggingFace with Groq Whisper

Browse files

- Add new /api/stt/transcribe endpoint using Groq Whisper-large-v3-turbo
- Replace complex WebSocket transcribeAudio method with direct HTTP API calls
- Remove 90+ lines of WebSocket queue management and polling logic
- Simplified error handling with standard HTTP request/response pattern
- Maintains identical user experience while improving performance and reliability
- Uses existing GROQ_API_KEY for consolidated authentication

πŸ€– Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (4) hide show
  1. LinkedIn.md +110 -87
  2. app/api/chat_widget.py +13 -124
  3. app/api/main.py +53 -0
  4. version.txt +1 -1
LinkedIn.md CHANGED
@@ -1,95 +1,118 @@
1
- # πŸš€ VoiceCal.ai v1.0.0 - Clean Architecture Migration
2
-
3
- ## Migration Success: From Nested Chaos to Clean Deployment
4
-
5
- Today I successfully migrated **ChatCal.ai to VoiceCal.ai v1.0.0** with a completely clean repository structure for HuggingFace Spaces deployment.
6
-
7
- ### The Problem We Solved
8
- - **Nested directory confusion**: `/chatcal-ai/chatcal-ai/app/` causing path issues
9
- - **Git workflow inconsistencies**: File upload vs. proper version control
10
- - **Docker caching issues**: HuggingFace showing stale Dockerfiles
11
- - **Deployment complexity**: Multiple path adjustments breaking deployments
12
-
13
- ### The Solution: Fresh Start Architecture
14
-
15
- #### Before (Problematic):
16
- ```
17
- /chatCal.ai/chatcal-ai/ (repo root)
18
- β”œβ”€β”€ chatcal-ai/ (actual app nested)
19
- β”œβ”€β”€ speech2text/ (unrelated)
20
- β”œβ”€β”€ CLAUDE.md (unrelated)
21
- └── legacy files...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ```
23
 
24
- #### After (Clean):
25
- ```
26
- /voiceCal-ai-v1/ (repo root = HF Space name)
27
- β”œβ”€β”€ app/ (FastAPI directly at root)
28
- β”œβ”€β”€ Dockerfile (standard deployment)
29
- β”œβ”€β”€ app.py (simple entry point)
30
- β”œβ”€β”€ requirements.txt
31
- └── pyproject.toml
32
- ```
33
-
34
- ### Technical Improvements
35
-
36
- #### πŸ—οΈ **Architecture**
37
- - **Standard Docker structure**: No more nested path confusion
38
- - **Direct file access**: All imports work without path manipulation
39
- - **Clean git workflow**: Proper semantic versioning (v1.0.0)
40
-
41
- #### πŸ”§ **Development**
42
- - **Debugging tools**: wget, git, curl, procps, htop, nano, vim, net-tools, lsof, strace
43
- - **Clean Dockerfile**: No more tee command causing container exit
44
- - **Proper logging**: Both stdout and `/tmp/app.log` for SSH debugging
45
 
46
- #### 🎯 **Deployment**
47
- - **Fresh HuggingFace Space**: `pgits/voiceCal-ai-v1`
48
- - **No legacy caching**: Clean start eliminates previous issues
49
- - **Consistent versioning**: Semantic version matches deployment
50
 
51
- ### Migration Process
52
-
53
- 1. βœ… **Created clean directory structure** matching HF Space name
54
- 2. βœ… **Copied essential files** (app/, Dockerfile, requirements.txt, etc.)
55
- 3. βœ… **Updated all path references** for standard deployment
56
- 4. βœ… **Initialized fresh git repo** with proper main branch
57
- 5. βœ… **Ready for fresh HF Space** deployment
58
-
59
- ### Key Lessons Learned
60
-
61
- #### 🎯 **Project Structure Matters**
62
- - Directory naming should match deployment target
63
- - Avoid nested structures that complicate Docker deployments
64
- - Keep unrelated files separate from deployment artifacts
65
-
66
- #### πŸ”„ **Git Workflow Discipline**
67
- - Always use proper git workflow vs. ad-hoc file uploads
68
- - Semantic versioning prevents deployment confusion
69
- - Clean commit history aids debugging
70
-
71
- #### 🐳 **Docker Best Practices**
72
- - Standard WORKDIR structure
73
- - Include debugging tools for production troubleshooting
74
- - Avoid complex shell commands in CMD that can cause exit issues
75
-
76
- ### Next Steps
77
-
78
- Now ready to create fresh HuggingFace Space `pgits/voiceCal-ai-v1` with:
79
- - βœ… Clean repository structure
80
- - βœ… Standard Docker deployment
81
- - βœ… All debugging tools included
82
- - βœ… Proper git workflow established
83
- - βœ… Semantic versioning (v1.0.0)
84
-
85
- **Migration Summary:**
86
- - **Old**: Nested complexity, path confusion, deployment issues
87
- - **New**: Clean structure, standard paths, reliable deployment
88
 
89
- This migration eliminates the file inconsistencies and deployment issues we were experiencing, setting up a professional foundation for future development.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
  ---
 
92
 
93
- #VoiceAI #HuggingFace #Docker #SoftwareEngineering #CleanArchitecture #FastAPI #Deployment #GitWorkflow
94
-
95
- **VoiceCal.ai v1.0.0** - Professional voice-first calendar booking with Google Calendar integration, STT/TTS, and Groq LLM backend.
 
1
+ # πŸš€ Groq STT Integration Plan: HuggingFace to Groq Migration Strategy
2
+
3
+ ## Executive Summary
4
+ Following our successful TTS migration from Kyutai HuggingFace service to Groq (achieving significant performance improvements), we're now planning a surgical replacement of our Speech-to-Text (STT) service from HuggingFace STT-GPU-Service-v2 to Groq's Whisper-large-v3-turbo implementation.
5
+
6
+ ## Current STT Architecture (To Be Replaced)
7
+ **HuggingFace Integration:**
8
+ - External service: `pgits-stt-gpu-service-v2.hf.space`
9
+ - Complex WebSocket queue system for results
10
+ - HTTP POST β†’ WebSocket listener pattern
11
+ - Base64 audio transmission
12
+ - Gradio client integration with session management
13
+
14
+ **Technical Stack:**
15
+ - Frontend: JavaScript MediaRecorder β†’ Base64 conversion
16
+ - Transport: HTTP POST + WebSocket queue listener
17
+ - Backend: External HuggingFace Spaces service
18
+ - Dependencies: External service availability, queue management
19
+
20
+ ## Proposed Groq STT Architecture
21
+ **Groq Integration:**
22
+ - Direct API calls to Groq's Whisper service
23
+ - Simplified HTTP request/response pattern
24
+ - FastAPI proxy endpoint for CORS handling
25
+ - Same audio quality with reduced complexity
26
+
27
+ **Implementation Details:**
28
+ ```python
29
+ # New FastAPI Endpoint
30
+ @app.post("/api/stt/transcribe")
31
+ async def stt_transcribe(file: UploadFile = File(...)):
32
+ client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
33
+
34
+ transcription = client.audio.transcriptions.create(
35
+ file=file.file,
36
+ model="whisper-large-v3-turbo",
37
+ response_format="json",
38
+ language="en",
39
+ temperature=0.0
40
+ )
41
+
42
+ return {"text": transcription.text}
43
  ```
44
 
45
+ ```javascript
46
+ // Simplified Frontend Integration
47
+ async transcribeAudio(audioBase64) {
48
+ const audioBlob = this.base64ToBlob(audioBase64);
49
+ const formData = new FormData();
50
+ formData.append('file', audioBlob, 'audio.wav');
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ const response = await fetch('/api/stt/transcribe', {
53
+ method: 'POST', body: formData
54
+ });
 
55
 
56
+ const result = await response.json();
57
+ this.addTranscriptionToInput(result.text);
58
+ }
59
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
+ ## Migration Benefits
62
+
63
+ ### Performance Improvements
64
+ - **Elimination of WebSocket complexity** - Direct HTTP API calls
65
+ - **Reduced latency** - No external queue system
66
+ - **Faster transcription** - Groq's optimized Whisper implementation
67
+ - **Simplified error handling** - No connection state management
68
+
69
+ ### Operational Benefits
70
+ - **Consolidated authentication** - Uses existing GROQ_API_KEY
71
+ - **Reduced dependencies** - No external HuggingFace service reliance
72
+ - **Cost optimization** - Direct API usage vs. external compute
73
+ - **Improved reliability** - Fewer points of failure
74
+
75
+ ### Development Benefits
76
+ - **Code simplification** - Remove WebSocket queue logic
77
+ - **Easier debugging** - Standard HTTP request/response pattern
78
+ - **Better error visibility** - Direct API error responses
79
+ - **Consistent architecture** - Matches our TTS implementation pattern
80
+
81
+ ## Surgical Implementation Plan
82
+
83
+ ### Files to Modify (Minimal Impact)
84
+ 1. **app/api/main.py** - Add new `/api/stt/transcribe` endpoint
85
+ 2. **app/api/chat_widget.py** - Replace `transcribeAudio()` method (lines 1151-1211)
86
+ 3. **Requirements** - Already satisfied (groq>=0.4.0 from TTS migration)
87
+
88
+ ### Files NOT Modified (Preservation Strategy)
89
+ - Audio recording logic (MediaRecorder)
90
+ - Visual state management (STT indicators)
91
+ - User interface components
92
+ - Session management
93
+ - TTS interruption system (recently enhanced)
94
+
95
+ ## Risk Mitigation
96
+ - **Identical API contract** - Same input (audio) β†’ output (text) pattern
97
+ - **Progressive deployment** - Can switch back via configuration
98
+ - **Preserved user experience** - No UI changes required
99
+ - **Same audio quality** - WebM/Opus β†’ Whisper transcription path maintained
100
+
101
+ ## Success Metrics
102
+ - **Transcription latency reduction** (target: <2 seconds)
103
+ - **Error rate improvement** (eliminate WebSocket timeouts)
104
+ - **Code complexity reduction** (remove 100+ lines of WebSocket handling)
105
+ - **Infrastructure simplification** (single API key vs. external service)
106
+
107
+ ## Timeline
108
+ - **Phase 1:** Implementation (FastAPI endpoint + frontend method)
109
+ - **Phase 2:** Testing (transcription accuracy and performance)
110
+ - **Phase 3:** Deployment (surgical replacement with rollback capability)
111
+
112
+ ## Architectural Philosophy
113
+ This migration continues our platform consolidation strategy: moving from distributed external services to unified API providers while maintaining service quality and user experience. The Groq ecosystem (TTS + STT) provides performance advantages and operational simplification compared to our current mixed-provider approach.
114
 
115
  ---
116
+ *This document serves as the technical blueprint for our HuggingFace β†’ Groq STT migration, ensuring stakeholder alignment and implementation clarity.*
117
 
118
+ #AI #SpeechToText #Groq #HuggingFace #TechnicalStrategy #VoiceAI #SystemArchitecture
 
 
app/api/chat_widget.py CHANGED
@@ -1149,51 +1149,33 @@ async def chat_widget(request: Request, email: str = None):
1149
  }
1150
 
1151
  async transcribeAudio(audioBase64) {
1152
- const sessionHash = this.generateSessionHash();
1153
- const payload = {
1154
- data: [
1155
- audioBase64,
1156
- this.language,
1157
- this.modelSize
1158
- ],
1159
- session_hash: sessionHash
1160
- };
1161
-
1162
- console.log(`πŸ“€ Sending to STT v2 service: ${this.serverUrl}/call/gradio_transcribe_memory`);
1163
 
1164
  try {
1165
  const startTime = Date.now();
1166
 
1167
- const response = await fetch(`${this.serverUrl}/call/gradio_transcribe_memory`, {
 
 
 
 
 
1168
  method: 'POST',
1169
- headers: {
1170
- 'Content-Type': 'application/json',
1171
- },
1172
- body: JSON.stringify(payload)
1173
  });
1174
 
1175
  if (!response.ok) {
1176
- throw new Error(`STT v2 request failed: ${response.status}`);
1177
  }
1178
 
1179
  const responseData = await response.json();
1180
- console.log('πŸ“¨ STT v2 queue response:', responseData);
1181
-
1182
- let result;
1183
 
1184
- if (responseData.event_id) {
1185
- console.log(`🎯 Got queue event_id: ${responseData.event_id}`);
1186
- result = await this.listenForQueueResult(responseData, startTime, sessionHash);
1187
- } else if (responseData.data && Array.isArray(responseData.data)) {
1188
- result = responseData.data[0];
1189
- console.log('πŸ“₯ Got direct response from queue');
1190
- } else {
1191
- throw new Error(`Unexpected response format: ${JSON.stringify(responseData)}`);
1192
- }
1193
 
1194
  if (result && result.trim()) {
1195
  const processingTime = (Date.now() - startTime) / 1000;
1196
- console.log(`βœ… STT v2 transcription successful (${processingTime.toFixed(2)}s): "${result.substring(0, 100)}"`);
1197
 
1198
  // Add transcription to message input
1199
  this.addTranscriptionToInput(result);
@@ -1204,105 +1186,12 @@ async def chat_widget(request: Request, email: str = None):
1204
  }
1205
 
1206
  } catch (error) {
1207
- console.error('❌ STT v2 transcription failed:', error);
1208
  updateSTTVisualState('error');
1209
  setTimeout(() => updateSTTVisualState('ready'), 3000);
1210
  }
1211
  }
1212
 
1213
- async listenForQueueResult(queueResponse, startTime, sessionHash) {
1214
- return new Promise((resolve, reject) => {
1215
- const wsUrl = this.serverUrl.replace('https://', 'wss://').replace('http://', 'ws://') + '/queue/data';
1216
- console.log(`πŸ”Œ Connecting to STT v2 WebSocket: ${wsUrl}`);
1217
-
1218
- const ws = new WebSocket(wsUrl);
1219
-
1220
- const timeout = setTimeout(() => {
1221
- ws.close();
1222
- reject(new Error('STT v2 queue timeout after 30 seconds'));
1223
- }, 30000);
1224
-
1225
- ws.onopen = () => {
1226
- console.log('βœ… STT v2 WebSocket connected');
1227
- if (queueResponse.event_id) {
1228
- ws.send(JSON.stringify({
1229
- event_id: queueResponse.event_id
1230
- }));
1231
- console.log(`πŸ“€ Sent event_id: ${queueResponse.event_id}`);
1232
- }
1233
- };
1234
-
1235
- ws.onmessage = (event) => {
1236
- try {
1237
- const data = JSON.parse(event.data);
1238
- console.log('πŸ“¨ STT v2 queue message:', data);
1239
-
1240
- if (data.msg === 'process_completed' && data.output && data.output.data) {
1241
- clearTimeout(timeout);
1242
- ws.close();
1243
- resolve(data.output.data[0]);
1244
- } else if (data.msg === 'process_starts') {
1245
- updateSTTVisualState('processing');
1246
- }
1247
- } catch (e) {
1248
- console.warn('⚠️ STT v2 WebSocket parse error:', e.message);
1249
- }
1250
- };
1251
-
1252
- ws.onerror = (error) => {
1253
- console.error('❌ STT v2 WebSocket error:', error);
1254
- clearTimeout(timeout);
1255
- // Try polling as fallback
1256
- this.pollForResult(queueResponse.event_id, startTime, sessionHash).then(resolve).catch(reject);
1257
- };
1258
-
1259
- ws.onclose = (event) => {
1260
- console.log(`πŸ”Œ STT v2 WebSocket closed: code=${event.code}`);
1261
- clearTimeout(timeout);
1262
- };
1263
- });
1264
- }
1265
-
1266
- async pollForResult(eventId, startTime, sessionHash) {
1267
- console.log(`πŸ”„ Starting STT v2 polling for event: ${eventId}`);
1268
- const maxAttempts = 20;
1269
-
1270
- for (let attempt = 0; attempt < maxAttempts; attempt++) {
1271
- try {
1272
- const endpoint = `/queue/data?event_id=${eventId}&session_hash=${sessionHash}`;
1273
- const response = await fetch(`${this.serverUrl}${endpoint}`);
1274
-
1275
- if (response.ok) {
1276
- const responseText = await response.text();
1277
- console.log(`πŸ“Š STT v2 poll attempt ${attempt + 1}: ${responseText.substring(0, 200)}`);
1278
-
1279
- if (responseText.includes('data: ')) {
1280
- const lines = responseText.split('\\n');
1281
- for (const line of lines) {
1282
- if (line.startsWith('data: ')) {
1283
- try {
1284
- const data = JSON.parse(line.substring(6));
1285
- if (data.msg === 'process_completed' && data.output && data.output.data) {
1286
- return data.output.data[0];
1287
- }
1288
- } catch (parseError) {
1289
- console.warn('⚠️ STT v2 SSE parse error:', parseError.message);
1290
- }
1291
- }
1292
- }
1293
- }
1294
- }
1295
- } catch (e) {
1296
- console.warn(`⚠️ STT v2 poll error attempt ${attempt + 1}:`, e.message);
1297
- }
1298
-
1299
- // Progressive delay
1300
- const delay = attempt < 5 ? 200 : 500;
1301
- await new Promise(resolve => setTimeout(resolve, delay));
1302
- }
1303
-
1304
- throw new Error('STT v2 polling timeout - no result after 20 attempts');
1305
- }
1306
 
1307
  addTranscriptionToInput(transcription) {
1308
  const currentValue = messageInput.value;
 
1149
  }
1150
 
1151
  async transcribeAudio(audioBase64) {
1152
+ console.log(`πŸ“€ Sending to Groq STT service: /api/stt/transcribe`);
 
 
 
 
 
 
 
 
 
 
1153
 
1154
  try {
1155
  const startTime = Date.now();
1156
 
1157
+ // Convert base64 to blob
1158
+ const audioBlob = this.base64ToBlob(audioBase64);
1159
+ const formData = new FormData();
1160
+ formData.append('file', audioBlob, 'audio.wav');
1161
+
1162
+ const response = await fetch('/api/stt/transcribe', {
1163
  method: 'POST',
1164
+ body: formData
 
 
 
1165
  });
1166
 
1167
  if (!response.ok) {
1168
+ throw new Error(`Groq STT request failed: ${response.status}`);
1169
  }
1170
 
1171
  const responseData = await response.json();
1172
+ console.log('πŸ“¨ Groq STT response:', responseData);
 
 
1173
 
1174
+ const result = responseData.text;
 
 
 
 
 
 
 
 
1175
 
1176
  if (result && result.trim()) {
1177
  const processingTime = (Date.now() - startTime) / 1000;
1178
+ console.log(`βœ… Groq STT transcription successful (${processingTime.toFixed(2)}s): "${result.substring(0, 100)}"`);
1179
 
1180
  // Add transcription to message input
1181
  this.addTranscriptionToInput(result);
 
1186
  }
1187
 
1188
  } catch (error) {
1189
+ console.error('❌ Groq STT transcription failed:', error);
1190
  updateSTTVisualState('error');
1191
  setTimeout(() => updateSTTVisualState('ready'), 3000);
1192
  }
1193
  }
1194
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1195
 
1196
  addTranscriptionToInput(transcription) {
1197
  const currentValue = messageInput.value;
app/api/main.py CHANGED
@@ -701,6 +701,59 @@ async def get_tts_audio(file_id: str):
701
  raise HTTPException(status_code=500, detail=f"Audio serving failed: {str(e)}")
702
 
703
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
704
  @app.get("/auth/login", response_model=AuthResponse)
705
  async def google_auth_login(request: Request, state: Optional[str] = None):
706
  """Initiate Google OAuth login."""
 
701
  raise HTTPException(status_code=500, detail=f"Audio serving failed: {str(e)}")
702
 
703
 
704
+ @app.post("/api/stt/transcribe")
705
+ async def stt_transcribe(file: UploadFile = File(...)):
706
+ """STT transcription using Groq Whisper API."""
707
+ try:
708
+ from groq import Groq
709
+ import os
710
+ import tempfile
711
+
712
+ logger.info(f"🎀 STT transcription request: {file.filename} ({file.content_type})")
713
+
714
+ # Create Groq STT client
715
+ client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
716
+
717
+ # Time the STT generation for performance monitoring
718
+ import time
719
+ start_time = time.time()
720
+
721
+ # Create transcription using Groq Whisper
722
+ transcription = client.audio.transcriptions.create(
723
+ file=file.file,
724
+ model="whisper-large-v3-turbo",
725
+ response_format="json",
726
+ language="en",
727
+ temperature=0.0
728
+ )
729
+
730
+ transcription_time = time.time() - start_time
731
+ logger.info(f"⏱️ STT transcription took {transcription_time:.2f} seconds")
732
+
733
+ if transcription and transcription.text:
734
+ logger.info(f"🎀 STT transcription successful: \"{transcription.text[:100]}...\"")
735
+ return {
736
+ "success": True,
737
+ "text": transcription.text,
738
+ "processing_time": round(transcription_time, 2)
739
+ }
740
+ else:
741
+ raise HTTPException(status_code=500, detail="Empty transcription result")
742
+
743
+ except Exception as e:
744
+ # Enhanced error logging for Groq API issues
745
+ error_msg = str(e)
746
+ if "Error code: 401" in error_msg:
747
+ logger.error("Groq API authentication error - check GROQ_API_KEY")
748
+ raise HTTPException(status_code=500, detail="STT service authentication failed")
749
+ elif "Error code: 400" in error_msg:
750
+ logger.error(f"Groq API validation error: {error_msg}")
751
+ raise HTTPException(status_code=400, detail="STT request validation failed")
752
+ else:
753
+ logger.error(f"STT transcription error: {e}")
754
+ raise HTTPException(status_code=500, detail=f"STT transcription failed: {str(e)}")
755
+
756
+
757
  @app.get("/auth/login", response_model=AuthResponse)
758
  async def google_auth_login(request: Request, state: Optional[str] = None):
759
  """Initiate Google OAuth login."""
version.txt CHANGED
@@ -1 +1 @@
1
- 2.0.4-enhanced-tts-interrupt
 
1
+ 2.0.5-groq-stt-integration