Spaces:

ashishkblink
/

NuralVoice

Running

App Files Files Community

Ashish Kumar commited on Jan 7

Commit

dfdabcb

1 Parent(s): 1a3931a

Add WebSocket API support: FastAPI + Gradio hybrid app for real-time streaming

Browse files

Files changed (3) hide show

WEBSOCKET_README.md +198 -0
app_websocket.py +467 -0
requirements.txt +3 -0

WEBSOCKET_README.md ADDED Viewed

	@@ -0,0 +1,198 @@

+# WebSocket Implementation for NuralVoiceSTT
+**Developed by Blink Digital**
+This document explains how to use the WebSocket-enabled version of NuralVoiceSTT on Hugging Face Spaces.
+## Two App Options
+### Option 1: Standard Gradio App (`app.py`)
+- **File**: `app.py`
+- **Features**: Gradio UI with optimized streaming
+- **Best for**: Browser-based transcription with UI
+- **URL**: Your Space's main URL
+### Option 2: FastAPI + Gradio Hybrid (`app_websocket.py`)
+- **File**: `app_websocket.py`
+- **Features**:
+  - Gradio UI at `/gradio`
+  - WebSocket API at `/ws/transcribe`
+  - FastAPI REST endpoints at root
+- **Best for**: Programmatic access with WebSocket support
+- **URLs**:
+  - UI: `https://YOUR-SPACE.hf.space/gradio`
+  - WebSocket: `wss://YOUR-SPACE.hf.space/ws/transcribe`
+## Switching Between Apps
+To use the WebSocket version:
+1. **Update README.md** in your Space:
+   ```yaml
+   ---
+   title: NuralVoiceSTT Playground
+   emoji: 🎤
+   colorFrom: blue
+   colorTo: purple
+   sdk: docker  # Change to docker for FastAPI support
+   app_file: app_websocket.py  # Change this line
+   pinned: false
+   license: apache-2.0
+   ---
+   ```
+2. **Or rename files**:
+   - Rename `app.py` to `app_gradio.py`
+   - Rename `app_websocket.py` to `app.py`
+## WebSocket API Usage
+### JavaScript Example
+```javascript
+const ws = new WebSocket('wss://YOUR-SPACE.hf.space/ws/transcribe');
+ws.onopen = () => {
+    console.log('Connected to WebSocket');
+};
+ws.onmessage = (event) => {
+    const data = JSON.parse(event.data);
+    if (data.text) {
+        console.log('Transcription:', data.text);
+        console.log('Is Final:', data.is_final);
+    }
+};
+// Send audio chunks (16-bit PCM, 16kHz, mono)
+// Audio should be sent as binary data (ArrayBuffer)
+ws.send(audioBuffer);
+// Stop recording
+ws.send(JSON.stringify({ action: 'stop' }));
+```
+### Python Example
+```python
+import asyncio
+import websockets
+import json
+import numpy as np
+import soundfile as sf
+async def transcribe_audio():
+    uri = "wss://YOUR-SPACE.hf.space/ws/transcribe"
+    async with websockets.connect(uri) as websocket:
+        # Receive connection confirmation
+        response = await websocket.recv()
+        print("Connected:", json.loads(response))
+        # Load audio file
+        audio, sample_rate = sf.read("audio.wav")
+        # Convert to 16-bit PCM
+        if audio.dtype != np.int16:
+            audio = (audio * 32767).astype(np.int16)
+        # Send audio in chunks
+        chunk_size = 4000
+        audio_bytes = audio.tobytes()
+        for i in range(0, len(audio_bytes), chunk_size):
+            chunk = audio_bytes[i:i+chunk_size]
+            await websocket.send(chunk)
+            # Receive transcription
+            try:
+                response = await websocket.recv()
+                data = json.loads(response)
+                if data.get('text'):
+                    print(f"Transcription: {data['text']}")
+            except:
+                pass
+        # Stop and get final result
+        await websocket.send(json.dumps({"action": "stop"}))
+        final = await websocket.recv()
+        print("Final:", json.loads(final))
+asyncio.run(transcribe_audio())
+```
+## Real-Time Browser Audio Streaming
+```javascript
+// Get microphone stream
+navigator.mediaDevices.getUserMedia({ audio: true })
+    .then(stream => {
+        const audioContext = new AudioContext({ sampleRate: 16000 });
+        const source = audioContext.createMediaStreamSource(stream);
+        const processor = audioContext.createScriptProcessor(4096, 1, 1);
+        processor.onaudioprocess = (e) => {
+            if (ws.readyState === WebSocket.OPEN) {
+                const inputData = e.inputBuffer.getChannelData(0);
+                const pcm16 = new Int16Array(inputData.length);
+                // Convert float32 to int16
+                for (let i = 0; i < inputData.length; i++) {
+                    pcm16[i] = Math.max(-32768, Math.min(32767, inputData[i] * 32768));
+                }
+                // Send to WebSocket
+                ws.send(pcm16.buffer);
+            }
+        };
+        source.connect(processor);
+        processor.connect(audioContext.destination);
+    });
+```
+## API Endpoints (FastAPI Version)
+### GET `/`
+Returns API information
+### GET `/health`
+Health check endpoint
+### WebSocket `/ws/transcribe`
+Real-time audio transcription endpoint
+## Response Format
+```json
+{
+    "text": "transcribed text here",
+    "is_final": false,
+    "is_partial": true
+}
+```
+- `is_final: true` - Final transcription for a chunk
+- `is_final: false, is_partial: true` - Partial/ongoing transcription
+- `is_final: false, is_partial: false` - Final result with word timestamps
+## Requirements
+Both versions require the same dependencies (see `requirements.txt`):
+- `gradio>=4.0.0`
+- `vosk>=0.3.45`
+- `huggingface-hub>=0.16.0`
+- `numpy>=1.21.0`
+- `fastapi>=0.100.0` (for WebSocket version)
+- `uvicorn>=0.23.0` (for WebSocket version)
+- `websockets>=11.0` (for WebSocket version)
+## Performance
+- **WebSocket**: True real-time streaming with minimal latency (~100-200ms)
+- **Gradio Streaming**: Optimized incremental processing (~200-500ms latency)
+Choose based on your use case:
+- **WebSocket**: Best for programmatic access, custom UIs, low latency
+- **Gradio**: Best for quick testing, browser-based UI, ease of use

app_websocket.py ADDED Viewed

	@@ -0,0 +1,467 @@

+"""
+NuralVoiceSTT - Hybrid FastAPI + Gradio App with WebSocket Support
+Real-time speech-to-text with both Gradio UI and WebSocket API
+Developed by Blink Digital
+This app provides:
+1. Gradio UI for easy browser-based transcription
+2. WebSocket API for programmatic real-time streaming
+"""
+from fastapi import FastAPI, WebSocket, WebSocketDisconnect
+from fastapi.middleware.cors import CORSMiddleware
+import gradio as gr
+import json
+import numpy as np
+import os
+import sys
+import asyncio
+import base64
+# Declare GPU function to suppress Hugging Face Spaces warning
+try:
+    import spaces
+    @spaces.GPU
+    def gpu_function():
+        """Dummy GPU function to satisfy Hugging Face Spaces GPU requirement"""
+        pass
+except ImportError:
+    pass
+# Try to import vosk
+try:
+    from vosk import Model, KaldiRecognizer, SetLogLevel
+    from huggingface_hub import snapshot_download
+    VOSK_AVAILABLE = True
+    SetLogLevel(-1)
+except ImportError as e:
+    print(f"Warning: Vosk not available: {e}")
+    VOSK_AVAILABLE = False
+# Global model variable
+model = None
+model_path = None
+model_loading = False
+def load_model():
+    """Load the NuralVoiceSTT model from Hugging Face"""
+    global model, model_path, model_loading
+    if not VOSK_AVAILABLE:
+        return None
+    if model is not None:
+        return model
+    if model_loading:
+        return None
+    model_loading = True
+    try:
+        print("Loading NuralVoiceSTT model from Hugging Face...")
+        token = os.environ.get("HF_TOKEN", None)
+        model_path = snapshot_download(
+            repo_id="ashishkblink/NuralVoiceSTT",
+            local_dir="./nuralvoice_model",
+            token=token
+        )
+        model = Model(model_path)
+        print("✅ Model loaded successfully!")
+        model_loading = False
+        return model
+    except Exception as e:
+        print(f"Error loading model: {e}")
+        model_loading = False
+        return None
+# Initialize FastAPI app
+app = FastAPI(
+    title="NuralVoiceSTT API",
+    description="Real-time speech-to-text with WebSocket support by Blink Digital",
+    version="1.0.0"
+)
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Load model on startup
+@app.on_event("startup")
+async def startup_event():
+    """Load model on startup"""
+    if VOSK_AVAILABLE:
+        load_model()
+@app.get("/")
+async def root():
+    """API root endpoint"""
+    return {
+        "service": "NuralVoiceSTT API",
+        "developer": "Blink Digital",
+        "version": "1.0.0",
+        "status": "running",
+        "websocket_endpoint": "/ws/transcribe",
+        "gradio_ui": "/gradio"
+    }
+@app.get("/health")
+async def health_check():
+    """Health check endpoint"""
+    global model
+    return {
+        "status": "healthy",
+        "model_loaded": model is not None,
+        "vosk_available": VOSK_AVAILABLE
+    }
+@app.websocket("/ws/transcribe")
+async def websocket_transcribe(websocket: WebSocket):
+    """
+    WebSocket endpoint for real-time audio transcription
+    Protocol:
+    - Client sends audio chunks as binary data (16-bit PCM, mono, 16kHz recommended)
+    - Server sends JSON messages with transcription results:
+      {
+        "text": "partial or final text",
+        "is_final": false,
+        "is_partial": true
+      }
+    - Client can send {"action": "stop"} as JSON text to end the session
+    """
+    global model
+    await websocket.accept()
+    if model is None:
+        model = load_model()
+        if model is None:
+            await websocket.send_json({
+                "error": "Model not loaded",
+                "status": "error"
+            })
+            await websocket.close()
+            return
+    try:
+        # Create recognizer (16kHz sample rate - adjust if needed)
+        rec = KaldiRecognizer(model, 16000)
+        rec.SetWords(True)
+        # Send initial confirmation
+        await websocket.send_json({
+            "status": "connected",
+            "message": "Ready to receive audio. Send 16-bit PCM mono audio at 16kHz sample rate.",
+            "sample_rate": 16000
+        })
+        while True:
+            try:
+                data = await websocket.receive()
+                # Handle text messages (for control)
+                if "text" in data:
+                    try:
+                        message = json.loads(data["text"])
+                        if message.get("action") == "stop":
+                            # Send final result
+                            final_result = json.loads(rec.FinalResult())
+                            if 'text' in final_result and final_result['text']:
+                                await websocket.send_json({
+                                    "text": final_result['text'],
+                                    "is_final": True,
+                                    "words": final_result.get('result', [])
+                                })
+                            await websocket.close()
+                            break
+                        continue
+                    except json.JSONDecodeError:
+                        # Not JSON, might be base64 audio
+                        try:
+                            audio_bytes = base64.b64decode(data["text"])
+                        except:
+                            continue
+                    else:
+                        continue
+                # Handle binary audio data
+                if "bytes" in data:
+                    audio_bytes = data["bytes"]
+                else:
+                    continue
+                # Process audio chunk in real-time
+                if rec.AcceptWaveform(audio_bytes):
+                    # Final result for this chunk
+                    result = json.loads(rec.Result())
+                    if 'text' in result and result['text']:
+                        await websocket.send_json({
+                            "text": result['text'],
+                            "is_final": True,
+                            "words": result.get('result', [])
+                        })
+                else:
+                    # Partial result (still processing)
+                    partial_result = json.loads(rec.PartialResult())
+                    if 'partial' in partial_result and partial_result['partial']:
+                        await websocket.send_json({
+                            "text": partial_result['partial'],
+                            "is_final": False,
+                            "is_partial": True
+                        })
+            except WebSocketDisconnect:
+                # Send final result before closing
+                final_result = json.loads(rec.FinalResult())
+                if 'text' in final_result and final_result['text']:
+                    await websocket.send_json({
+                        "text": final_result['text'],
+                        "is_final": True,
+                        "words": final_result.get('result', [])
+                    })
+                break
+            except Exception as e:
+                await websocket.send_json({
+                    "error": str(e),
+                    "status": "error"
+                })
+                break
+    except Exception as e:
+        try:
+            await websocket.send_json({
+                "error": str(e),
+                "status": "error"
+            })
+        except:
+            pass
+        await websocket.close()
+# Gradio UI components (reuse from app.py)
+recognizer = None
+current_sample_rate = None
+last_processed_length = 0
+accumulated_text = ""
+def process_streaming_audio(audio_data):
+    """Process streaming audio for Gradio UI"""
+    global model, recognizer, current_sample_rate, last_processed_length, accumulated_text
+    if not VOSK_AVAILABLE:
+        return "❌ Error: Vosk library not available."
+    if model is None:
+        model = load_model()
+        if model is None:
+            return "⏳ Loading model... Please wait a moment."
+    if audio_data is None:
+        recognizer = None
+        current_sample_rate = None
+        last_processed_length = 0
+        accumulated_text = ""
+        return ""
+    try:
+        sample_rate, audio_array = audio_data
+        if recognizer is None or current_sample_rate != sample_rate:
+            recognizer = KaldiRecognizer(model, sample_rate)
+            recognizer.SetWords(True)
+            current_sample_rate = sample_rate
+            last_processed_length = 0
+            accumulated_text = ""
+        if isinstance(audio_array, list):
+            audio_array = np.array(audio_array, dtype=np.float32)
+        if audio_array.dtype != np.int16:
+            if audio_array.max() > 1.0 or audio_array.min() < -1.0:
+                max_val = np.max(np.abs(audio_array))
+                if max_val > 0:
+                    audio_array = audio_array / max_val
+            audio_array = (audio_array * 32767).astype(np.int16)
+        current_length = len(audio_array)
+        if current_length > last_processed_length:
+            new_audio = audio_array[last_processed_length:]
+            audio_bytes = new_audio.tobytes()
+            chunk_size = 4000
+            result_text = ""
+            for i in range(0, len(audio_bytes), chunk_size):
+                chunk = audio_bytes[i:i+chunk_size]
+                if recognizer.AcceptWaveform(chunk):
+                    result = json.loads(recognizer.Result())
+                    if 'text' in result and result['text']:
+                        result_text = result['text']
+                        accumulated_text += " " + result_text if accumulated_text else result_text
+                else:
+                    partial = json.loads(recognizer.PartialResult())
+                    if 'partial' in partial and partial['partial']:
+                        result_text = partial['partial']
+            last_processed_length = current_length
+            if accumulated_text and result_text:
+                return accumulated_text.strip() + " " + result_text
+            elif accumulated_text:
+                return accumulated_text.strip()
+            elif result_text:
+                return result_text
+            else:
+                partial = json.loads(recognizer.PartialResult())
+                if 'partial' in partial and partial['partial']:
+                    return partial['partial']
+        partial = json.loads(recognizer.PartialResult())
+        if 'partial' in partial and partial['partial']:
+            return accumulated_text.strip() + " " + partial['partial'] if accumulated_text else partial['partial']
+        return accumulated_text.strip() if accumulated_text else ""
+    except Exception as e:
+        return f"❌ Error: {str(e)}"
+# Create Gradio interface
+with gr.Blocks(title="NuralVoiceSTT Playground - Blink Digital") as demo:
+    gr.Markdown("""
+    # 🎤 NuralVoiceSTT Playground
+    **Developed by Blink Digital**
+    **Real-time streaming speech-to-text** - See your words appear instantly as you speak!
+    ### 🌐 WebSocket API Available
+    For programmatic access, connect to: `wss://YOUR-SPACE.hf.space/ws/transcribe`
+    """)
+    with gr.Accordion("📋 How to Use", open=False):
+        gr.Markdown("""
+        1. Click the **microphone button** below
+        2. Allow microphone permissions when prompted
+        3. Start speaking - **text appears in real-time as you speak!**
+        4. No need to stop - it streams continuously
+        5. Click **"Stop"** when finished
+        """)
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown("### 🎙️ Live Audio Stream")
+            microphone = gr.Audio(
+                label="Click to Start Streaming",
+                type="numpy",
+                sources=["microphone"],
+                streaming=True,
+                show_label=True
+            )
+            status = gr.HTML("""
+            <div style="padding: 10px; background: #d4edda; color: #155724; border-radius: 5px; margin-top: 10px;">
+                ✅ Ready - Click microphone to start real-time transcription
+            </div>
+            """)
+        with gr.Column():
+            gr.Markdown("### 📝 Live Transcription")
+            output = gr.Textbox(
+                label="Real-time Text Output",
+                lines=12,
+                placeholder="Your speech will appear here in real-time as you speak...",
+                interactive=False,
+                autoscroll=True
+            )
+    with gr.Accordion("💡 Tips for Best Results", open=False):
+        gr.Markdown("""
+        - Speak clearly and at a moderate pace
+        - Reduce background noise for better accuracy
+        - Use a good quality microphone if possible
+        - Wait a moment after speaking to see final results
+        """)
+    gr.Markdown("""
+    ---
+    ### About NuralVoiceSTT
+    **Developed by Blink Digital**
+    NuralVoiceSTT is a high-accuracy English speech-to-text model optimized for both callcenter and wideband audio scenarios.
+    ### WebSocket API Usage
+    Connect to the WebSocket endpoint for programmatic real-time transcription:
+    ```javascript
+    const ws = new WebSocket('wss://YOUR-SPACE.hf.space/ws/transcribe');
+    ws.onmessage = (event) => {
+        const data = JSON.parse(event.data);
+        console.log('Transcription:', data.text);
+    };
+    // Send audio chunks as binary data (16-bit PCM, 16kHz)
+    ws.send(audioBuffer);
+    ```
+    """)
+    microphone.stream(
+        fn=process_streaming_audio,
+        inputs=microphone,
+        outputs=output,
+        show_progress=False,
+        every=0.1
+    )
+    def update_status(audio_data):
+        if audio_data is None:
+            return gr.HTML("""
+            <div style="padding: 10px; background: #d4edda; color: #155724; border-radius: 5px; margin-top: 10px;">
+                ✅ Ready - Click microphone to start real-time transcription
+            </div>
+            """)
+        else:
+            return gr.HTML("""
+            <div style="padding: 10px; background: #fff3cd; color: #856404; border-radius: 5px; margin-top: 10px;">
+                🎤 Streaming... Speak now - text appears in real-time!
+            </div>
+            """)
+    microphone.change(
+        fn=update_status,
+        inputs=microphone,
+        outputs=status
+    )
+# Load model in background
+if VOSK_AVAILABLE:
+    import threading
+    def load_model_background():
+        load_model()
+    threading.Thread(target=load_model_background, daemon=True).start()
+demo.queue()
+# Mount Gradio app to FastAPI
+# For Hugging Face Spaces, FastAPI app will be the main entry point
+# Gradio UI will be available at /gradio, WebSocket at /ws/transcribe, API at root
+# Get Gradio's ASGI app and mount it
+gradio_app = demo.app
+# Mount Gradio at /gradio path (FastAPI routes stay at root)
+app.mount("/gradio", gradio_app)
+# Note: For Hugging Face Spaces, you may need to set app_file to app_websocket.py
+# in your README.md or use this as the main app
+# For local testing
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

requirements.txt CHANGED Viewed

@@ -3,4 +3,7 @@ vosk>=0.3.45
 huggingface-hub>=0.16.0
 soundfile>=0.12.0
 numpy>=1.21.0

 huggingface-hub>=0.16.0
 soundfile>=0.12.0
 numpy>=1.21.0
+fastapi>=0.100.0
+uvicorn>=0.23.0
+websockets>=11.0