--- title: STT WebSocket Service v1.0.0 emoji: 🎤 colorFrom: red colorTo: red sdk: gradio sdk_version: 5.42.0 app_file: app.py pinned: false license: mit suggested_hardware: zero-a10g --- # 🎤 ZeroGPU Speech-to-Text Service High-performance speech recognition powered by **Hugging Face ZeroGPU** and **Nvidia H200** with Whisper models. ## 🎯 Features - 🚀 **ZeroGPU Acceleration**: Dynamic H200 GPU allocation - 🎤 **Multi-language Support**: 100+ languages with auto-detection - ⚡ **Real-time Processing**: Often faster than audio duration - 📍 **Timestamp Precision**: Word-level timing information - 📦 **Batch Processing**: Multiple files in parallel - 🔴 **Live Transcription**: Real-time microphone input - 🌐 **WebRTC Ready**: Integration with live audio streams - 🔧 **MCP Protocol**: Model Context Protocol for direct integration - 💰 **Cost Efficient**: No idle costs with Pro subscription ## 🏗️ Architecture - **Backend**: Whisper (OpenAI) with PyTorch optimization - **Frontend**: Gradio with enhanced multi-tab UI - **GPU**: ZeroGPU with H200 dynamic scaling - **Models**: Whisper tiny/base/small/medium/large-v2 ## 🚀 Performance - **Real-time Factor**: 0.1x - 0.5x (much faster than real-time) - **Languages**: 100+ with auto-detection - **Accuracy**: State-of-the-art with Whisper models - **Batch processing**: Parallel execution on H200 ## 💻 API Usage ### Python Client (Gradio) ```python from gradio_client import Client client = Client("YOUR_USERNAME/stt-gpu-service") result = client.predict( "audio.wav", # audio file "auto", # language "base", # model size True, # timestamps api_name="/predict" ) status, transcription, timestamps = result ``` ### MCP Client (Model Context Protocol) ```python # Using MCP client for direct integration import json from mcp import ClientSession from mcp.client.stdio import StdioServerParameters # Connect to MCP server server_params = StdioServerParameters( command="python", args=["app.py", "--mcp-only"] ) async with ClientSession(server_params) as session: # Transcribe single audio file result = await session.call_tool( "stt_transcribe", { "audio_path": "/path/to/audio.wav", "language": "auto", "model_size": "base", "return_timestamps": True } ) transcription_data = json.loads(result.content[0].text) print(f"Transcription: {transcription_data['transcription']}") ``` ### Dual Protocol Support This service now supports **both Gradio HTTP API and MCP protocol** simultaneously: 1. **Gradio Interface**: Traditional web UI and HTTP API (port 7860) 2. **MCP Protocol**: Direct tool integration via stdio #### MCP Tools Available: - `stt_transcribe`: Transcribe single audio file - `stt_batch_transcribe`: Batch transcribe multiple files - `stt_get_info`: Get system and service information #### Running Modes: ```bash # Dual mode (default) - Both Gradio + MCP python app.py # MCP-only mode - Just MCP server python app.py --mcp-only ``` ## 🚀 MCP Integration Benefits ### Direct Tool Integration - **No HTTP overhead**: Direct protocol communication - **Type-safe interactions**: Structured tool definitions - **Streaming support**: Real-time tool communication - **Auto-discovery**: Tools are automatically discoverable ### Use Cases 1. **LLM Agent Integration**: Direct STT capability for AI agents 2. **Workflow Automation**: Seamless audio processing in pipelines 3. **Development Tools**: IDE extensions with voice transcription 4. **Multi-modal Applications**: Combine with other MCP services ### Integration with ChatCal Voice ```python # Example: Voice-enabled calendar scheduling async def voice_calendar_integration(): # STT: Convert voice to text stt_result = await stt_session.call_tool("stt_transcribe", { "audio_path": "user_voice_request.wav", "language": "auto" }) # Process calendar request with LLM calendar_action = process_calendar_request(stt_result.transcription) # TTS: Convert response back to voice (if integrated) # tts_result = await tts_session.call_tool("tts_generate", {...}) ``` # Updated: Mon Aug 25 19:59:40 EDT 2025