Spaces:

pgits
/

stt-gpu-service

Running

App Files Files Community

stt-gpu-service / README.md

Peter Michael Gits

trigger: Force HuggingFace Spaces rebuild for latest hybrid implementation

219ebbf 8 months ago

preview code

raw

history blame contribute delete

4.29 kB

	---
	title: STT WebSocket Service v1.0.0
	emoji: 🎤
	colorFrom: red
	colorTo: red
	sdk: gradio
	sdk_version: 5.42.0
	app_file: app.py
	pinned: false
	license: mit
	suggested_hardware: zero-a10g
	---

	# 🎤 ZeroGPU Speech-to-Text Service

	High-performance speech recognition powered by Hugging Face ZeroGPU and Nvidia H200 with Whisper models.

	## 🎯 Features

	- 🚀 ZeroGPU Acceleration: Dynamic H200 GPU allocation
	- 🎤 Multi-language Support: 100+ languages with auto-detection
	- ⚡ Real-time Processing: Often faster than audio duration
	- 📍 Timestamp Precision: Word-level timing information
	- 📦 Batch Processing: Multiple files in parallel
	- 🔴 Live Transcription: Real-time microphone input
	- 🌐 WebRTC Ready: Integration with live audio streams
	- 🔧 MCP Protocol: Model Context Protocol for direct integration
	- 💰 Cost Efficient: No idle costs with Pro subscription

	## 🏗️ Architecture

	- Backend: Whisper (OpenAI) with PyTorch optimization
	- Frontend: Gradio with enhanced multi-tab UI
	- GPU: ZeroGPU with H200 dynamic scaling
	- Models: Whisper tiny/base/small/medium/large-v2

	## 🚀 Performance

	- Real-time Factor: 0.1x - 0.5x (much faster than real-time)
	- Languages: 100+ with auto-detection
	- Accuracy: State-of-the-art with Whisper models
	- Batch processing: Parallel execution on H200

	## 💻 API Usage

	### Python Client (Gradio)
	```python
	from gradio_client import Client

	client = Client("YOUR_USERNAME/stt-gpu-service")
	result = client.predict(
	"audio.wav", # audio file
	"auto", # language
	"base", # model size
	True, # timestamps
	api_name="/predict"
	)
	status, transcription, timestamps = result
	```

	### MCP Client (Model Context Protocol)
	```python
	# Using MCP client for direct integration
	import json
	from mcp import ClientSession
	from mcp.client.stdio import StdioServerParameters

	# Connect to MCP server
	server_params = StdioServerParameters(
	command="python",
	args=["app.py", "--mcp-only"]
	)

	async with ClientSession(server_params) as session:
	# Transcribe single audio file
	result = await session.call_tool(
	"stt_transcribe",
	{
	"audio_path": "/path/to/audio.wav",
	"language": "auto",
	"model_size": "base",
	"return_timestamps": True
	}
	)

	transcription_data = json.loads(result.content[0].text)
	print(f"Transcription: {transcription_data['transcription']}")
	```

	### Dual Protocol Support

	This service now supports both Gradio HTTP API and MCP protocol simultaneously:

	1. Gradio Interface: Traditional web UI and HTTP API (port 7860)
	2. MCP Protocol: Direct tool integration via stdio

	#### MCP Tools Available:
	- `stt_transcribe`: Transcribe single audio file
	- `stt_batch_transcribe`: Batch transcribe multiple files
	- `stt_get_info`: Get system and service information

	#### Running Modes:
	```bash
	# Dual mode (default) - Both Gradio + MCP
	python app.py

	# MCP-only mode - Just MCP server
	python app.py --mcp-only
	```

	## 🚀 MCP Integration Benefits

	### Direct Tool Integration
	- No HTTP overhead: Direct protocol communication
	- Type-safe interactions: Structured tool definitions
	- Streaming support: Real-time tool communication
	- Auto-discovery: Tools are automatically discoverable

	### Use Cases
	1. LLM Agent Integration: Direct STT capability for AI agents
	2. Workflow Automation: Seamless audio processing in pipelines
	3. Development Tools: IDE extensions with voice transcription
	4. Multi-modal Applications: Combine with other MCP services

	### Integration with ChatCal Voice
	```python
	# Example: Voice-enabled calendar scheduling
	async def voice_calendar_integration():
	# STT: Convert voice to text
	stt_result = await stt_session.call_tool("stt_transcribe", {
	"audio_path": "user_voice_request.wav",
	"language": "auto"
	})

	# Process calendar request with LLM
	calendar_action = process_calendar_request(stt_result.transcription)

	# TTS: Convert response back to voice (if integrated)
	# tts_result = await tts_session.call_tool("tts_generate", {...})
	```
	# Updated: Mon Aug 25 19:59:40 EDT 2025