---
title: STT WebSocket Service v1.0.0
emoji: 🎤
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: mit
suggested_hardware: zero-a10g
---

# 🎤 ZeroGPU Speech-to-Text Service

High-performance speech recognition powered by **Hugging Face ZeroGPU** and **Nvidia H200** with Whisper models.

## 🎯 Features

- 🚀 **ZeroGPU Acceleration**: Dynamic H200 GPU allocation
- 🎤 **Multi-language Support**: 100+ languages with auto-detection
- ⚡ **Real-time Processing**: Often faster than audio duration
- 📍 **Timestamp Precision**: Word-level timing information
- 📦 **Batch Processing**: Multiple files in parallel
- 🔴 **Live Transcription**: Real-time microphone input
- 🌐 **WebRTC Ready**: Integration with live audio streams
- 🔧 **MCP Protocol**: Model Context Protocol for direct integration
- 💰 **Cost Efficient**: No idle costs with Pro subscription

## 🏗️ Architecture

- **Backend**: Whisper (OpenAI) with PyTorch optimization
- **Frontend**: Gradio with enhanced multi-tab UI
- **GPU**: ZeroGPU with H200 dynamic scaling
- **Models**: Whisper tiny/base/small/medium/large-v2

## 🚀 Performance

- **Real-time Factor**: 0.1x - 0.5x (much faster than real-time)
- **Languages**: 100+ with auto-detection
- **Accuracy**: State-of-the-art with Whisper models
- **Batch processing**: Parallel execution on H200

## 💻 API Usage

### Python Client (Gradio)
```python
from gradio_client import Client

client = Client("YOUR_USERNAME/stt-gpu-service")
result = client.predict(
    "audio.wav",     # audio file
    "auto",          # language
    "base",          # model size
    True,            # timestamps
    api_name="/predict"
)
status, transcription, timestamps = result
```

### MCP Client (Model Context Protocol)
```python
# Using MCP client for direct integration
import json
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters

# Connect to MCP server
server_params = StdioServerParameters(
    command="python",
    args=["app.py", "--mcp-only"]
)

async with ClientSession(server_params) as session:
    # Transcribe single audio file
    result = await session.call_tool(
        "stt_transcribe",
        {
            "audio_path": "/path/to/audio.wav",
            "language": "auto",
            "model_size": "base",
            "return_timestamps": True
        }
    )
    
    transcription_data = json.loads(result.content[0].text)
    print(f"Transcription: {transcription_data['transcription']}")
```

### Dual Protocol Support

This service now supports **both Gradio HTTP API and MCP protocol** simultaneously:

1. **Gradio Interface**: Traditional web UI and HTTP API (port 7860)
2. **MCP Protocol**: Direct tool integration via stdio

#### MCP Tools Available:
- `stt_transcribe`: Transcribe single audio file
- `stt_batch_transcribe`: Batch transcribe multiple files
- `stt_get_info`: Get system and service information

#### Running Modes:
```bash
# Dual mode (default) - Both Gradio + MCP
python app.py

# MCP-only mode - Just MCP server
python app.py --mcp-only
```

## 🚀 MCP Integration Benefits

### Direct Tool Integration
- **No HTTP overhead**: Direct protocol communication
- **Type-safe interactions**: Structured tool definitions
- **Streaming support**: Real-time tool communication
- **Auto-discovery**: Tools are automatically discoverable

### Use Cases
1. **LLM Agent Integration**: Direct STT capability for AI agents
2. **Workflow Automation**: Seamless audio processing in pipelines
3. **Development Tools**: IDE extensions with voice transcription
4. **Multi-modal Applications**: Combine with other MCP services

### Integration with ChatCal Voice
```python
# Example: Voice-enabled calendar scheduling
async def voice_calendar_integration():
    # STT: Convert voice to text
    stt_result = await stt_session.call_tool("stt_transcribe", {
        "audio_path": "user_voice_request.wav",
        "language": "auto"
    })
    
    # Process calendar request with LLM
    calendar_action = process_calendar_request(stt_result.transcription)
    
    # TTS: Convert response back to voice (if integrated)
    # tts_result = await tts_session.call_tool("tts_generate", {...})
```
# Updated: Mon Aug 25 19:59:40 EDT 2025