stt-gpu-service / README.md
Peter Michael Gits
trigger: Force HuggingFace Spaces rebuild for latest hybrid implementation
219ebbf

A newer version of the Gradio SDK is available: 6.11.0

Upgrade
metadata
title: STT WebSocket Service v1.0.0
emoji: 🎀
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: mit
suggested_hardware: zero-a10g

🎀 ZeroGPU Speech-to-Text Service

High-performance speech recognition powered by Hugging Face ZeroGPU and Nvidia H200 with Whisper models.

🎯 Features

  • πŸš€ ZeroGPU Acceleration: Dynamic H200 GPU allocation
  • 🎀 Multi-language Support: 100+ languages with auto-detection
  • ⚑ Real-time Processing: Often faster than audio duration
  • πŸ“ Timestamp Precision: Word-level timing information
  • πŸ“¦ Batch Processing: Multiple files in parallel
  • πŸ”΄ Live Transcription: Real-time microphone input
  • 🌐 WebRTC Ready: Integration with live audio streams
  • πŸ”§ MCP Protocol: Model Context Protocol for direct integration
  • πŸ’° Cost Efficient: No idle costs with Pro subscription

πŸ—οΈ Architecture

  • Backend: Whisper (OpenAI) with PyTorch optimization
  • Frontend: Gradio with enhanced multi-tab UI
  • GPU: ZeroGPU with H200 dynamic scaling
  • Models: Whisper tiny/base/small/medium/large-v2

πŸš€ Performance

  • Real-time Factor: 0.1x - 0.5x (much faster than real-time)
  • Languages: 100+ with auto-detection
  • Accuracy: State-of-the-art with Whisper models
  • Batch processing: Parallel execution on H200

πŸ’» API Usage

Python Client (Gradio)

from gradio_client import Client

client = Client("YOUR_USERNAME/stt-gpu-service")
result = client.predict(
    "audio.wav",     # audio file
    "auto",          # language
    "base",          # model size
    True,            # timestamps
    api_name="/predict"
)
status, transcription, timestamps = result

MCP Client (Model Context Protocol)

# Using MCP client for direct integration
import json
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters

# Connect to MCP server
server_params = StdioServerParameters(
    command="python",
    args=["app.py", "--mcp-only"]
)

async with ClientSession(server_params) as session:
    # Transcribe single audio file
    result = await session.call_tool(
        "stt_transcribe",
        {
            "audio_path": "/path/to/audio.wav",
            "language": "auto",
            "model_size": "base",
            "return_timestamps": True
        }
    )
    
    transcription_data = json.loads(result.content[0].text)
    print(f"Transcription: {transcription_data['transcription']}")

Dual Protocol Support

This service now supports both Gradio HTTP API and MCP protocol simultaneously:

  1. Gradio Interface: Traditional web UI and HTTP API (port 7860)
  2. MCP Protocol: Direct tool integration via stdio

MCP Tools Available:

  • stt_transcribe: Transcribe single audio file
  • stt_batch_transcribe: Batch transcribe multiple files
  • stt_get_info: Get system and service information

Running Modes:

# Dual mode (default) - Both Gradio + MCP
python app.py

# MCP-only mode - Just MCP server
python app.py --mcp-only

πŸš€ MCP Integration Benefits

Direct Tool Integration

  • No HTTP overhead: Direct protocol communication
  • Type-safe interactions: Structured tool definitions
  • Streaming support: Real-time tool communication
  • Auto-discovery: Tools are automatically discoverable

Use Cases

  1. LLM Agent Integration: Direct STT capability for AI agents
  2. Workflow Automation: Seamless audio processing in pipelines
  3. Development Tools: IDE extensions with voice transcription
  4. Multi-modal Applications: Combine with other MCP services

Integration with ChatCal Voice

# Example: Voice-enabled calendar scheduling
async def voice_calendar_integration():
    # STT: Convert voice to text
    stt_result = await stt_session.call_tool("stt_transcribe", {
        "audio_path": "user_voice_request.wav",
        "language": "auto"
    })
    
    # Process calendar request with LLM
    calendar_action = process_calendar_request(stt_result.transcription)
    
    # TTS: Convert response back to voice (if integrated)
    # tts_result = await tts_session.call_tool("tts_generate", {...})

Updated: Mon Aug 25 19:59:40 EDT 2025