stt-gpu-service / README.md
Peter Michael Gits
trigger: Force HuggingFace Spaces rebuild for latest hybrid implementation
219ebbf
---
title: STT WebSocket Service v1.0.0
emoji: 🎀
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: mit
suggested_hardware: zero-a10g
---
# 🎀 ZeroGPU Speech-to-Text Service
High-performance speech recognition powered by **Hugging Face ZeroGPU** and **Nvidia H200** with Whisper models.
## 🎯 Features
- πŸš€ **ZeroGPU Acceleration**: Dynamic H200 GPU allocation
- 🎀 **Multi-language Support**: 100+ languages with auto-detection
- ⚑ **Real-time Processing**: Often faster than audio duration
- πŸ“ **Timestamp Precision**: Word-level timing information
- πŸ“¦ **Batch Processing**: Multiple files in parallel
- πŸ”΄ **Live Transcription**: Real-time microphone input
- 🌐 **WebRTC Ready**: Integration with live audio streams
- πŸ”§ **MCP Protocol**: Model Context Protocol for direct integration
- πŸ’° **Cost Efficient**: No idle costs with Pro subscription
## πŸ—οΈ Architecture
- **Backend**: Whisper (OpenAI) with PyTorch optimization
- **Frontend**: Gradio with enhanced multi-tab UI
- **GPU**: ZeroGPU with H200 dynamic scaling
- **Models**: Whisper tiny/base/small/medium/large-v2
## πŸš€ Performance
- **Real-time Factor**: 0.1x - 0.5x (much faster than real-time)
- **Languages**: 100+ with auto-detection
- **Accuracy**: State-of-the-art with Whisper models
- **Batch processing**: Parallel execution on H200
## πŸ’» API Usage
### Python Client (Gradio)
```python
from gradio_client import Client
client = Client("YOUR_USERNAME/stt-gpu-service")
result = client.predict(
"audio.wav", # audio file
"auto", # language
"base", # model size
True, # timestamps
api_name="/predict"
)
status, transcription, timestamps = result
```
### MCP Client (Model Context Protocol)
```python
# Using MCP client for direct integration
import json
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters
# Connect to MCP server
server_params = StdioServerParameters(
command="python",
args=["app.py", "--mcp-only"]
)
async with ClientSession(server_params) as session:
# Transcribe single audio file
result = await session.call_tool(
"stt_transcribe",
{
"audio_path": "/path/to/audio.wav",
"language": "auto",
"model_size": "base",
"return_timestamps": True
}
)
transcription_data = json.loads(result.content[0].text)
print(f"Transcription: {transcription_data['transcription']}")
```
### Dual Protocol Support
This service now supports **both Gradio HTTP API and MCP protocol** simultaneously:
1. **Gradio Interface**: Traditional web UI and HTTP API (port 7860)
2. **MCP Protocol**: Direct tool integration via stdio
#### MCP Tools Available:
- `stt_transcribe`: Transcribe single audio file
- `stt_batch_transcribe`: Batch transcribe multiple files
- `stt_get_info`: Get system and service information
#### Running Modes:
```bash
# Dual mode (default) - Both Gradio + MCP
python app.py
# MCP-only mode - Just MCP server
python app.py --mcp-only
```
## πŸš€ MCP Integration Benefits
### Direct Tool Integration
- **No HTTP overhead**: Direct protocol communication
- **Type-safe interactions**: Structured tool definitions
- **Streaming support**: Real-time tool communication
- **Auto-discovery**: Tools are automatically discoverable
### Use Cases
1. **LLM Agent Integration**: Direct STT capability for AI agents
2. **Workflow Automation**: Seamless audio processing in pipelines
3. **Development Tools**: IDE extensions with voice transcription
4. **Multi-modal Applications**: Combine with other MCP services
### Integration with ChatCal Voice
```python
# Example: Voice-enabled calendar scheduling
async def voice_calendar_integration():
# STT: Convert voice to text
stt_result = await stt_session.call_tool("stt_transcribe", {
"audio_path": "user_voice_request.wav",
"language": "auto"
})
# Process calendar request with LLM
calendar_action = process_calendar_request(stt_result.transcription)
# TTS: Convert response back to voice (if integrated)
# tts_result = await tts_session.call_tool("tts_generate", {...})
```
# Updated: Mon Aug 25 19:59:40 EDT 2025