Spaces:
Running
Running
Peter Michael Gits
trigger: Force HuggingFace Spaces rebuild for latest hybrid implementation
219ebbf | title: STT WebSocket Service v1.0.0 | |
| emoji: π€ | |
| colorFrom: red | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: 5.42.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| suggested_hardware: zero-a10g | |
| # π€ ZeroGPU Speech-to-Text Service | |
| High-performance speech recognition powered by **Hugging Face ZeroGPU** and **Nvidia H200** with Whisper models. | |
| ## π― Features | |
| - π **ZeroGPU Acceleration**: Dynamic H200 GPU allocation | |
| - π€ **Multi-language Support**: 100+ languages with auto-detection | |
| - β‘ **Real-time Processing**: Often faster than audio duration | |
| - π **Timestamp Precision**: Word-level timing information | |
| - π¦ **Batch Processing**: Multiple files in parallel | |
| - π΄ **Live Transcription**: Real-time microphone input | |
| - π **WebRTC Ready**: Integration with live audio streams | |
| - π§ **MCP Protocol**: Model Context Protocol for direct integration | |
| - π° **Cost Efficient**: No idle costs with Pro subscription | |
| ## ποΈ Architecture | |
| - **Backend**: Whisper (OpenAI) with PyTorch optimization | |
| - **Frontend**: Gradio with enhanced multi-tab UI | |
| - **GPU**: ZeroGPU with H200 dynamic scaling | |
| - **Models**: Whisper tiny/base/small/medium/large-v2 | |
| ## π Performance | |
| - **Real-time Factor**: 0.1x - 0.5x (much faster than real-time) | |
| - **Languages**: 100+ with auto-detection | |
| - **Accuracy**: State-of-the-art with Whisper models | |
| - **Batch processing**: Parallel execution on H200 | |
| ## π» API Usage | |
| ### Python Client (Gradio) | |
| ```python | |
| from gradio_client import Client | |
| client = Client("YOUR_USERNAME/stt-gpu-service") | |
| result = client.predict( | |
| "audio.wav", # audio file | |
| "auto", # language | |
| "base", # model size | |
| True, # timestamps | |
| api_name="/predict" | |
| ) | |
| status, transcription, timestamps = result | |
| ``` | |
| ### MCP Client (Model Context Protocol) | |
| ```python | |
| # Using MCP client for direct integration | |
| import json | |
| from mcp import ClientSession | |
| from mcp.client.stdio import StdioServerParameters | |
| # Connect to MCP server | |
| server_params = StdioServerParameters( | |
| command="python", | |
| args=["app.py", "--mcp-only"] | |
| ) | |
| async with ClientSession(server_params) as session: | |
| # Transcribe single audio file | |
| result = await session.call_tool( | |
| "stt_transcribe", | |
| { | |
| "audio_path": "/path/to/audio.wav", | |
| "language": "auto", | |
| "model_size": "base", | |
| "return_timestamps": True | |
| } | |
| ) | |
| transcription_data = json.loads(result.content[0].text) | |
| print(f"Transcription: {transcription_data['transcription']}") | |
| ``` | |
| ### Dual Protocol Support | |
| This service now supports **both Gradio HTTP API and MCP protocol** simultaneously: | |
| 1. **Gradio Interface**: Traditional web UI and HTTP API (port 7860) | |
| 2. **MCP Protocol**: Direct tool integration via stdio | |
| #### MCP Tools Available: | |
| - `stt_transcribe`: Transcribe single audio file | |
| - `stt_batch_transcribe`: Batch transcribe multiple files | |
| - `stt_get_info`: Get system and service information | |
| #### Running Modes: | |
| ```bash | |
| # Dual mode (default) - Both Gradio + MCP | |
| python app.py | |
| # MCP-only mode - Just MCP server | |
| python app.py --mcp-only | |
| ``` | |
| ## π MCP Integration Benefits | |
| ### Direct Tool Integration | |
| - **No HTTP overhead**: Direct protocol communication | |
| - **Type-safe interactions**: Structured tool definitions | |
| - **Streaming support**: Real-time tool communication | |
| - **Auto-discovery**: Tools are automatically discoverable | |
| ### Use Cases | |
| 1. **LLM Agent Integration**: Direct STT capability for AI agents | |
| 2. **Workflow Automation**: Seamless audio processing in pipelines | |
| 3. **Development Tools**: IDE extensions with voice transcription | |
| 4. **Multi-modal Applications**: Combine with other MCP services | |
| ### Integration with ChatCal Voice | |
| ```python | |
| # Example: Voice-enabled calendar scheduling | |
| async def voice_calendar_integration(): | |
| # STT: Convert voice to text | |
| stt_result = await stt_session.call_tool("stt_transcribe", { | |
| "audio_path": "user_voice_request.wav", | |
| "language": "auto" | |
| }) | |
| # Process calendar request with LLM | |
| calendar_action = process_calendar_request(stt_result.transcription) | |
| # TTS: Convert response back to voice (if integrated) | |
| # tts_result = await tts_session.call_tool("tts_generate", {...}) | |
| ``` | |
| # Updated: Mon Aug 25 19:59:40 EDT 2025 | |