Spaces:

pgits
/

stt-gpu-service

Sleeping

App Files Files Community

stt-gpu-service / README.md

Peter Michael Gits

trigger: Force HuggingFace Spaces rebuild for latest hybrid implementation

219ebbf 7 months ago

preview code

raw

history blame contribute delete

4.29 kB

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

metadata

title: STT WebSocket Service v1.0.0
emoji: 🎤
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 5.42.0
app_file: app.py
pinned: false
license: mit
suggested_hardware: zero-a10g

🎤 ZeroGPU Speech-to-Text Service

High-performance speech recognition powered by Hugging Face ZeroGPU and Nvidia H200 with Whisper models.

🎯 Features

🚀 ZeroGPU Acceleration: Dynamic H200 GPU allocation
🎤 Multi-language Support: 100+ languages with auto-detection
⚡ Real-time Processing: Often faster than audio duration
📍 Timestamp Precision: Word-level timing information
📦 Batch Processing: Multiple files in parallel
🔴 Live Transcription: Real-time microphone input
🌐 WebRTC Ready: Integration with live audio streams
🔧 MCP Protocol: Model Context Protocol for direct integration
💰 Cost Efficient: No idle costs with Pro subscription

🏗️ Architecture

Backend: Whisper (OpenAI) with PyTorch optimization
Frontend: Gradio with enhanced multi-tab UI
GPU: ZeroGPU with H200 dynamic scaling
Models: Whisper tiny/base/small/medium/large-v2

🚀 Performance

Real-time Factor: 0.1x - 0.5x (much faster than real-time)
Languages: 100+ with auto-detection
Accuracy: State-of-the-art with Whisper models
Batch processing: Parallel execution on H200

💻 API Usage

Python Client (Gradio)

from gradio_client import Client

client = Client("YOUR_USERNAME/stt-gpu-service")
result = client.predict(
    "audio.wav",     # audio file
    "auto",          # language
    "base",          # model size
    True,            # timestamps
    api_name="/predict"
)
status, transcription, timestamps = result

MCP Client (Model Context Protocol)

# Using MCP client for direct integration
import json
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters

# Connect to MCP server
server_params = StdioServerParameters(
    command="python",
    args=["app.py", "--mcp-only"]
)

async with ClientSession(server_params) as session:
    # Transcribe single audio file
    result = await session.call_tool(
        "stt_transcribe",
        {
            "audio_path": "/path/to/audio.wav",
            "language": "auto",
            "model_size": "base",
            "return_timestamps": True
        }
    )
    
    transcription_data = json.loads(result.content[0].text)
    print(f"Transcription: {transcription_data['transcription']}")

Dual Protocol Support

This service now supports both Gradio HTTP API and MCP protocol simultaneously:

Gradio Interface: Traditional web UI and HTTP API (port 7860)
MCP Protocol: Direct tool integration via stdio

MCP Tools Available:

stt_transcribe: Transcribe single audio file
stt_batch_transcribe: Batch transcribe multiple files
stt_get_info: Get system and service information

Running Modes:

# Dual mode (default) - Both Gradio + MCP
python app.py

# MCP-only mode - Just MCP server
python app.py --mcp-only

🚀 MCP Integration Benefits

Direct Tool Integration

No HTTP overhead: Direct protocol communication
Type-safe interactions: Structured tool definitions
Streaming support: Real-time tool communication
Auto-discovery: Tools are automatically discoverable

Use Cases

LLM Agent Integration: Direct STT capability for AI agents
Workflow Automation: Seamless audio processing in pipelines
Development Tools: IDE extensions with voice transcription
Multi-modal Applications: Combine with other MCP services

Integration with ChatCal Voice

# Example: Voice-enabled calendar scheduling
async def voice_calendar_integration():
    # STT: Convert voice to text
    stt_result = await stt_session.call_tool("stt_transcribe", {
        "audio_path": "user_voice_request.wav",
        "language": "auto"
    })
    
    # Process calendar request with LLM
    calendar_action = process_calendar_request(stt_result.transcription)
    
    # TTS: Convert response back to voice (if integrated)
    # tts_result = await tts_session.call_tool("tts_generate", {...})

Updated: Mon Aug 25 19:59:40 EDT 2025