Ashish Kumar commited on
Commit
dfdabcb
Β·
1 Parent(s): 1a3931a

Add WebSocket API support: FastAPI + Gradio hybrid app for real-time streaming

Browse files
Files changed (3) hide show
  1. WEBSOCKET_README.md +198 -0
  2. app_websocket.py +467 -0
  3. requirements.txt +3 -0
WEBSOCKET_README.md ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # WebSocket Implementation for NuralVoiceSTT
2
+
3
+ **Developed by Blink Digital**
4
+
5
+ This document explains how to use the WebSocket-enabled version of NuralVoiceSTT on Hugging Face Spaces.
6
+
7
+ ## Two App Options
8
+
9
+ ### Option 1: Standard Gradio App (`app.py`)
10
+ - **File**: `app.py`
11
+ - **Features**: Gradio UI with optimized streaming
12
+ - **Best for**: Browser-based transcription with UI
13
+ - **URL**: Your Space's main URL
14
+
15
+ ### Option 2: FastAPI + Gradio Hybrid (`app_websocket.py`)
16
+ - **File**: `app_websocket.py`
17
+ - **Features**:
18
+ - Gradio UI at `/gradio`
19
+ - WebSocket API at `/ws/transcribe`
20
+ - FastAPI REST endpoints at root
21
+ - **Best for**: Programmatic access with WebSocket support
22
+ - **URLs**:
23
+ - UI: `https://YOUR-SPACE.hf.space/gradio`
24
+ - WebSocket: `wss://YOUR-SPACE.hf.space/ws/transcribe`
25
+
26
+ ## Switching Between Apps
27
+
28
+ To use the WebSocket version:
29
+
30
+ 1. **Update README.md** in your Space:
31
+ ```yaml
32
+ ---
33
+ title: NuralVoiceSTT Playground
34
+ emoji: 🎀
35
+ colorFrom: blue
36
+ colorTo: purple
37
+ sdk: docker # Change to docker for FastAPI support
38
+ app_file: app_websocket.py # Change this line
39
+ pinned: false
40
+ license: apache-2.0
41
+ ---
42
+ ```
43
+
44
+ 2. **Or rename files**:
45
+ - Rename `app.py` to `app_gradio.py`
46
+ - Rename `app_websocket.py` to `app.py`
47
+
48
+ ## WebSocket API Usage
49
+
50
+ ### JavaScript Example
51
+
52
+ ```javascript
53
+ const ws = new WebSocket('wss://YOUR-SPACE.hf.space/ws/transcribe');
54
+
55
+ ws.onopen = () => {
56
+ console.log('Connected to WebSocket');
57
+ };
58
+
59
+ ws.onmessage = (event) => {
60
+ const data = JSON.parse(event.data);
61
+ if (data.text) {
62
+ console.log('Transcription:', data.text);
63
+ console.log('Is Final:', data.is_final);
64
+ }
65
+ };
66
+
67
+ // Send audio chunks (16-bit PCM, 16kHz, mono)
68
+ // Audio should be sent as binary data (ArrayBuffer)
69
+ ws.send(audioBuffer);
70
+
71
+ // Stop recording
72
+ ws.send(JSON.stringify({ action: 'stop' }));
73
+ ```
74
+
75
+ ### Python Example
76
+
77
+ ```python
78
+ import asyncio
79
+ import websockets
80
+ import json
81
+ import numpy as np
82
+ import soundfile as sf
83
+
84
+ async def transcribe_audio():
85
+ uri = "wss://YOUR-SPACE.hf.space/ws/transcribe"
86
+
87
+ async with websockets.connect(uri) as websocket:
88
+ # Receive connection confirmation
89
+ response = await websocket.recv()
90
+ print("Connected:", json.loads(response))
91
+
92
+ # Load audio file
93
+ audio, sample_rate = sf.read("audio.wav")
94
+
95
+ # Convert to 16-bit PCM
96
+ if audio.dtype != np.int16:
97
+ audio = (audio * 32767).astype(np.int16)
98
+
99
+ # Send audio in chunks
100
+ chunk_size = 4000
101
+ audio_bytes = audio.tobytes()
102
+
103
+ for i in range(0, len(audio_bytes), chunk_size):
104
+ chunk = audio_bytes[i:i+chunk_size]
105
+ await websocket.send(chunk)
106
+
107
+ # Receive transcription
108
+ try:
109
+ response = await websocket.recv()
110
+ data = json.loads(response)
111
+ if data.get('text'):
112
+ print(f"Transcription: {data['text']}")
113
+ except:
114
+ pass
115
+
116
+ # Stop and get final result
117
+ await websocket.send(json.dumps({"action": "stop"}))
118
+ final = await websocket.recv()
119
+ print("Final:", json.loads(final))
120
+
121
+ asyncio.run(transcribe_audio())
122
+ ```
123
+
124
+ ## Real-Time Browser Audio Streaming
125
+
126
+ ```javascript
127
+ // Get microphone stream
128
+ navigator.mediaDevices.getUserMedia({ audio: true })
129
+ .then(stream => {
130
+ const audioContext = new AudioContext({ sampleRate: 16000 });
131
+ const source = audioContext.createMediaStreamSource(stream);
132
+ const processor = audioContext.createScriptProcessor(4096, 1, 1);
133
+
134
+ processor.onaudioprocess = (e) => {
135
+ if (ws.readyState === WebSocket.OPEN) {
136
+ const inputData = e.inputBuffer.getChannelData(0);
137
+ const pcm16 = new Int16Array(inputData.length);
138
+
139
+ // Convert float32 to int16
140
+ for (let i = 0; i < inputData.length; i++) {
141
+ pcm16[i] = Math.max(-32768, Math.min(32767, inputData[i] * 32768));
142
+ }
143
+
144
+ // Send to WebSocket
145
+ ws.send(pcm16.buffer);
146
+ }
147
+ };
148
+
149
+ source.connect(processor);
150
+ processor.connect(audioContext.destination);
151
+ });
152
+ ```
153
+
154
+ ## API Endpoints (FastAPI Version)
155
+
156
+ ### GET `/`
157
+ Returns API information
158
+
159
+ ### GET `/health`
160
+ Health check endpoint
161
+
162
+ ### WebSocket `/ws/transcribe`
163
+ Real-time audio transcription endpoint
164
+
165
+ ## Response Format
166
+
167
+ ```json
168
+ {
169
+ "text": "transcribed text here",
170
+ "is_final": false,
171
+ "is_partial": true
172
+ }
173
+ ```
174
+
175
+ - `is_final: true` - Final transcription for a chunk
176
+ - `is_final: false, is_partial: true` - Partial/ongoing transcription
177
+ - `is_final: false, is_partial: false` - Final result with word timestamps
178
+
179
+ ## Requirements
180
+
181
+ Both versions require the same dependencies (see `requirements.txt`):
182
+ - `gradio>=4.0.0`
183
+ - `vosk>=0.3.45`
184
+ - `huggingface-hub>=0.16.0`
185
+ - `numpy>=1.21.0`
186
+ - `fastapi>=0.100.0` (for WebSocket version)
187
+ - `uvicorn>=0.23.0` (for WebSocket version)
188
+ - `websockets>=11.0` (for WebSocket version)
189
+
190
+ ## Performance
191
+
192
+ - **WebSocket**: True real-time streaming with minimal latency (~100-200ms)
193
+ - **Gradio Streaming**: Optimized incremental processing (~200-500ms latency)
194
+
195
+ Choose based on your use case:
196
+ - **WebSocket**: Best for programmatic access, custom UIs, low latency
197
+ - **Gradio**: Best for quick testing, browser-based UI, ease of use
198
+
app_websocket.py ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ NuralVoiceSTT - Hybrid FastAPI + Gradio App with WebSocket Support
3
+ Real-time speech-to-text with both Gradio UI and WebSocket API
4
+ Developed by Blink Digital
5
+
6
+ This app provides:
7
+ 1. Gradio UI for easy browser-based transcription
8
+ 2. WebSocket API for programmatic real-time streaming
9
+ """
10
+ from fastapi import FastAPI, WebSocket, WebSocketDisconnect
11
+ from fastapi.middleware.cors import CORSMiddleware
12
+ import gradio as gr
13
+ import json
14
+ import numpy as np
15
+ import os
16
+ import sys
17
+ import asyncio
18
+ import base64
19
+
20
+ # Declare GPU function to suppress Hugging Face Spaces warning
21
+ try:
22
+ import spaces
23
+ @spaces.GPU
24
+ def gpu_function():
25
+ """Dummy GPU function to satisfy Hugging Face Spaces GPU requirement"""
26
+ pass
27
+ except ImportError:
28
+ pass
29
+
30
+ # Try to import vosk
31
+ try:
32
+ from vosk import Model, KaldiRecognizer, SetLogLevel
33
+ from huggingface_hub import snapshot_download
34
+ VOSK_AVAILABLE = True
35
+ SetLogLevel(-1)
36
+ except ImportError as e:
37
+ print(f"Warning: Vosk not available: {e}")
38
+ VOSK_AVAILABLE = False
39
+
40
+ # Global model variable
41
+ model = None
42
+ model_path = None
43
+ model_loading = False
44
+
45
+ def load_model():
46
+ """Load the NuralVoiceSTT model from Hugging Face"""
47
+ global model, model_path, model_loading
48
+
49
+ if not VOSK_AVAILABLE:
50
+ return None
51
+
52
+ if model is not None:
53
+ return model
54
+
55
+ if model_loading:
56
+ return None
57
+
58
+ model_loading = True
59
+ try:
60
+ print("Loading NuralVoiceSTT model from Hugging Face...")
61
+ token = os.environ.get("HF_TOKEN", None)
62
+
63
+ model_path = snapshot_download(
64
+ repo_id="ashishkblink/NuralVoiceSTT",
65
+ local_dir="./nuralvoice_model",
66
+ token=token
67
+ )
68
+
69
+ model = Model(model_path)
70
+ print("βœ… Model loaded successfully!")
71
+ model_loading = False
72
+ return model
73
+ except Exception as e:
74
+ print(f"Error loading model: {e}")
75
+ model_loading = False
76
+ return None
77
+
78
+ # Initialize FastAPI app
79
+ app = FastAPI(
80
+ title="NuralVoiceSTT API",
81
+ description="Real-time speech-to-text with WebSocket support by Blink Digital",
82
+ version="1.0.0"
83
+ )
84
+
85
+ # CORS middleware
86
+ app.add_middleware(
87
+ CORSMiddleware,
88
+ allow_origins=["*"],
89
+ allow_credentials=True,
90
+ allow_methods=["*"],
91
+ allow_headers=["*"],
92
+ )
93
+
94
+ # Load model on startup
95
+ @app.on_event("startup")
96
+ async def startup_event():
97
+ """Load model on startup"""
98
+ if VOSK_AVAILABLE:
99
+ load_model()
100
+
101
+ @app.get("/")
102
+ async def root():
103
+ """API root endpoint"""
104
+ return {
105
+ "service": "NuralVoiceSTT API",
106
+ "developer": "Blink Digital",
107
+ "version": "1.0.0",
108
+ "status": "running",
109
+ "websocket_endpoint": "/ws/transcribe",
110
+ "gradio_ui": "/gradio"
111
+ }
112
+
113
+ @app.get("/health")
114
+ async def health_check():
115
+ """Health check endpoint"""
116
+ global model
117
+ return {
118
+ "status": "healthy",
119
+ "model_loaded": model is not None,
120
+ "vosk_available": VOSK_AVAILABLE
121
+ }
122
+
123
+ @app.websocket("/ws/transcribe")
124
+ async def websocket_transcribe(websocket: WebSocket):
125
+ """
126
+ WebSocket endpoint for real-time audio transcription
127
+
128
+ Protocol:
129
+ - Client sends audio chunks as binary data (16-bit PCM, mono, 16kHz recommended)
130
+ - Server sends JSON messages with transcription results:
131
+ {
132
+ "text": "partial or final text",
133
+ "is_final": false,
134
+ "is_partial": true
135
+ }
136
+ - Client can send {"action": "stop"} as JSON text to end the session
137
+ """
138
+ global model
139
+
140
+ await websocket.accept()
141
+
142
+ if model is None:
143
+ model = load_model()
144
+ if model is None:
145
+ await websocket.send_json({
146
+ "error": "Model not loaded",
147
+ "status": "error"
148
+ })
149
+ await websocket.close()
150
+ return
151
+
152
+ try:
153
+ # Create recognizer (16kHz sample rate - adjust if needed)
154
+ rec = KaldiRecognizer(model, 16000)
155
+ rec.SetWords(True)
156
+
157
+ # Send initial confirmation
158
+ await websocket.send_json({
159
+ "status": "connected",
160
+ "message": "Ready to receive audio. Send 16-bit PCM mono audio at 16kHz sample rate.",
161
+ "sample_rate": 16000
162
+ })
163
+
164
+ while True:
165
+ try:
166
+ data = await websocket.receive()
167
+
168
+ # Handle text messages (for control)
169
+ if "text" in data:
170
+ try:
171
+ message = json.loads(data["text"])
172
+ if message.get("action") == "stop":
173
+ # Send final result
174
+ final_result = json.loads(rec.FinalResult())
175
+ if 'text' in final_result and final_result['text']:
176
+ await websocket.send_json({
177
+ "text": final_result['text'],
178
+ "is_final": True,
179
+ "words": final_result.get('result', [])
180
+ })
181
+ await websocket.close()
182
+ break
183
+ continue
184
+ except json.JSONDecodeError:
185
+ # Not JSON, might be base64 audio
186
+ try:
187
+ audio_bytes = base64.b64decode(data["text"])
188
+ except:
189
+ continue
190
+ else:
191
+ continue
192
+
193
+ # Handle binary audio data
194
+ if "bytes" in data:
195
+ audio_bytes = data["bytes"]
196
+ else:
197
+ continue
198
+
199
+ # Process audio chunk in real-time
200
+ if rec.AcceptWaveform(audio_bytes):
201
+ # Final result for this chunk
202
+ result = json.loads(rec.Result())
203
+ if 'text' in result and result['text']:
204
+ await websocket.send_json({
205
+ "text": result['text'],
206
+ "is_final": True,
207
+ "words": result.get('result', [])
208
+ })
209
+ else:
210
+ # Partial result (still processing)
211
+ partial_result = json.loads(rec.PartialResult())
212
+ if 'partial' in partial_result and partial_result['partial']:
213
+ await websocket.send_json({
214
+ "text": partial_result['partial'],
215
+ "is_final": False,
216
+ "is_partial": True
217
+ })
218
+
219
+ except WebSocketDisconnect:
220
+ # Send final result before closing
221
+ final_result = json.loads(rec.FinalResult())
222
+ if 'text' in final_result and final_result['text']:
223
+ await websocket.send_json({
224
+ "text": final_result['text'],
225
+ "is_final": True,
226
+ "words": final_result.get('result', [])
227
+ })
228
+ break
229
+ except Exception as e:
230
+ await websocket.send_json({
231
+ "error": str(e),
232
+ "status": "error"
233
+ })
234
+ break
235
+
236
+ except Exception as e:
237
+ try:
238
+ await websocket.send_json({
239
+ "error": str(e),
240
+ "status": "error"
241
+ })
242
+ except:
243
+ pass
244
+ await websocket.close()
245
+
246
+ # Gradio UI components (reuse from app.py)
247
+ recognizer = None
248
+ current_sample_rate = None
249
+ last_processed_length = 0
250
+ accumulated_text = ""
251
+
252
+ def process_streaming_audio(audio_data):
253
+ """Process streaming audio for Gradio UI"""
254
+ global model, recognizer, current_sample_rate, last_processed_length, accumulated_text
255
+
256
+ if not VOSK_AVAILABLE:
257
+ return "❌ Error: Vosk library not available."
258
+
259
+ if model is None:
260
+ model = load_model()
261
+ if model is None:
262
+ return "⏳ Loading model... Please wait a moment."
263
+
264
+ if audio_data is None:
265
+ recognizer = None
266
+ current_sample_rate = None
267
+ last_processed_length = 0
268
+ accumulated_text = ""
269
+ return ""
270
+
271
+ try:
272
+ sample_rate, audio_array = audio_data
273
+
274
+ if recognizer is None or current_sample_rate != sample_rate:
275
+ recognizer = KaldiRecognizer(model, sample_rate)
276
+ recognizer.SetWords(True)
277
+ current_sample_rate = sample_rate
278
+ last_processed_length = 0
279
+ accumulated_text = ""
280
+
281
+ if isinstance(audio_array, list):
282
+ audio_array = np.array(audio_array, dtype=np.float32)
283
+
284
+ if audio_array.dtype != np.int16:
285
+ if audio_array.max() > 1.0 or audio_array.min() < -1.0:
286
+ max_val = np.max(np.abs(audio_array))
287
+ if max_val > 0:
288
+ audio_array = audio_array / max_val
289
+ audio_array = (audio_array * 32767).astype(np.int16)
290
+
291
+ current_length = len(audio_array)
292
+
293
+ if current_length > last_processed_length:
294
+ new_audio = audio_array[last_processed_length:]
295
+ audio_bytes = new_audio.tobytes()
296
+
297
+ chunk_size = 4000
298
+ result_text = ""
299
+
300
+ for i in range(0, len(audio_bytes), chunk_size):
301
+ chunk = audio_bytes[i:i+chunk_size]
302
+
303
+ if recognizer.AcceptWaveform(chunk):
304
+ result = json.loads(recognizer.Result())
305
+ if 'text' in result and result['text']:
306
+ result_text = result['text']
307
+ accumulated_text += " " + result_text if accumulated_text else result_text
308
+ else:
309
+ partial = json.loads(recognizer.PartialResult())
310
+ if 'partial' in partial and partial['partial']:
311
+ result_text = partial['partial']
312
+
313
+ last_processed_length = current_length
314
+
315
+ if accumulated_text and result_text:
316
+ return accumulated_text.strip() + " " + result_text
317
+ elif accumulated_text:
318
+ return accumulated_text.strip()
319
+ elif result_text:
320
+ return result_text
321
+ else:
322
+ partial = json.loads(recognizer.PartialResult())
323
+ if 'partial' in partial and partial['partial']:
324
+ return partial['partial']
325
+
326
+ partial = json.loads(recognizer.PartialResult())
327
+ if 'partial' in partial and partial['partial']:
328
+ return accumulated_text.strip() + " " + partial['partial'] if accumulated_text else partial['partial']
329
+
330
+ return accumulated_text.strip() if accumulated_text else ""
331
+
332
+ except Exception as e:
333
+ return f"❌ Error: {str(e)}"
334
+
335
+ # Create Gradio interface
336
+ with gr.Blocks(title="NuralVoiceSTT Playground - Blink Digital") as demo:
337
+ gr.Markdown("""
338
+ # 🎀 NuralVoiceSTT Playground
339
+
340
+ **Developed by Blink Digital**
341
+
342
+ **Real-time streaming speech-to-text** - See your words appear instantly as you speak!
343
+
344
+ ### 🌐 WebSocket API Available
345
+ For programmatic access, connect to: `wss://YOUR-SPACE.hf.space/ws/transcribe`
346
+ """)
347
+
348
+ with gr.Accordion("πŸ“‹ How to Use", open=False):
349
+ gr.Markdown("""
350
+ 1. Click the **microphone button** below
351
+ 2. Allow microphone permissions when prompted
352
+ 3. Start speaking - **text appears in real-time as you speak!**
353
+ 4. No need to stop - it streams continuously
354
+ 5. Click **"Stop"** when finished
355
+ """)
356
+
357
+ with gr.Row():
358
+ with gr.Column():
359
+ gr.Markdown("### πŸŽ™οΈ Live Audio Stream")
360
+ microphone = gr.Audio(
361
+ label="Click to Start Streaming",
362
+ type="numpy",
363
+ sources=["microphone"],
364
+ streaming=True,
365
+ show_label=True
366
+ )
367
+ status = gr.HTML("""
368
+ <div style="padding: 10px; background: #d4edda; color: #155724; border-radius: 5px; margin-top: 10px;">
369
+ βœ… Ready - Click microphone to start real-time transcription
370
+ </div>
371
+ """)
372
+
373
+ with gr.Column():
374
+ gr.Markdown("### πŸ“ Live Transcription")
375
+ output = gr.Textbox(
376
+ label="Real-time Text Output",
377
+ lines=12,
378
+ placeholder="Your speech will appear here in real-time as you speak...",
379
+ interactive=False,
380
+ autoscroll=True
381
+ )
382
+
383
+ with gr.Accordion("πŸ’‘ Tips for Best Results", open=False):
384
+ gr.Markdown("""
385
+ - Speak clearly and at a moderate pace
386
+ - Reduce background noise for better accuracy
387
+ - Use a good quality microphone if possible
388
+ - Wait a moment after speaking to see final results
389
+ """)
390
+
391
+ gr.Markdown("""
392
+ ---
393
+ ### About NuralVoiceSTT
394
+
395
+ **Developed by Blink Digital**
396
+
397
+ NuralVoiceSTT is a high-accuracy English speech-to-text model optimized for both callcenter and wideband audio scenarios.
398
+
399
+ ### WebSocket API Usage
400
+
401
+ Connect to the WebSocket endpoint for programmatic real-time transcription:
402
+
403
+ ```javascript
404
+ const ws = new WebSocket('wss://YOUR-SPACE.hf.space/ws/transcribe');
405
+ ws.onmessage = (event) => {
406
+ const data = JSON.parse(event.data);
407
+ console.log('Transcription:', data.text);
408
+ };
409
+ // Send audio chunks as binary data (16-bit PCM, 16kHz)
410
+ ws.send(audioBuffer);
411
+ ```
412
+ """)
413
+
414
+ microphone.stream(
415
+ fn=process_streaming_audio,
416
+ inputs=microphone,
417
+ outputs=output,
418
+ show_progress=False,
419
+ every=0.1
420
+ )
421
+
422
+ def update_status(audio_data):
423
+ if audio_data is None:
424
+ return gr.HTML("""
425
+ <div style="padding: 10px; background: #d4edda; color: #155724; border-radius: 5px; margin-top: 10px;">
426
+ βœ… Ready - Click microphone to start real-time transcription
427
+ </div>
428
+ """)
429
+ else:
430
+ return gr.HTML("""
431
+ <div style="padding: 10px; background: #fff3cd; color: #856404; border-radius: 5px; margin-top: 10px;">
432
+ 🎀 Streaming... Speak now - text appears in real-time!
433
+ </div>
434
+ """)
435
+
436
+ microphone.change(
437
+ fn=update_status,
438
+ inputs=microphone,
439
+ outputs=status
440
+ )
441
+
442
+ # Load model in background
443
+ if VOSK_AVAILABLE:
444
+ import threading
445
+ def load_model_background():
446
+ load_model()
447
+ threading.Thread(target=load_model_background, daemon=True).start()
448
+
449
+ demo.queue()
450
+
451
+ # Mount Gradio app to FastAPI
452
+ # For Hugging Face Spaces, FastAPI app will be the main entry point
453
+ # Gradio UI will be available at /gradio, WebSocket at /ws/transcribe, API at root
454
+
455
+ # Get Gradio's ASGI app and mount it
456
+ gradio_app = demo.app
457
+
458
+ # Mount Gradio at /gradio path (FastAPI routes stay at root)
459
+ app.mount("/gradio", gradio_app)
460
+
461
+ # Note: For Hugging Face Spaces, you may need to set app_file to app_websocket.py
462
+ # in your README.md or use this as the main app
463
+
464
+ # For local testing
465
+ if __name__ == "__main__":
466
+ import uvicorn
467
+ uvicorn.run(app, host="0.0.0.0", port=7860)
requirements.txt CHANGED
@@ -3,4 +3,7 @@ vosk>=0.3.45
3
  huggingface-hub>=0.16.0
4
  soundfile>=0.12.0
5
  numpy>=1.21.0
 
 
 
6
 
 
3
  huggingface-hub>=0.16.0
4
  soundfile>=0.12.0
5
  numpy>=1.21.0
6
+ fastapi>=0.100.0
7
+ uvicorn>=0.23.0
8
+ websockets>=11.0
9