voicekit

Sleeping

App Files Files Community

voicekit / README.md

jjin6573

Upload folder using huggingface_hub

7ae2c28 verified 3 months ago

preview code

raw

history blame contribute delete

9.4 kB

	---
	title: VoiceKit MCP
	emoji: 🎤
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: "6.0.0"
	app_file: app.py
	pinned: false
	tags:
	- building-mcp-track-creative
	- mcp-server
	---

	# 🎤 VoiceKit MCP

	> Professional voice analysis as MCP tools — extract embeddings, compare voices, transcribe speech, and more.

	6 powerful MCP tools for voice processing, all accepting base64-encoded audio.

	📢 Social Post: [View on X](https://x.com/dahee_pk/status/1994389505898582442)<br>
	🎬 Demo Video: [Watch on YouTube](https://www.youtube.com/watch?v=1VIqvpwfyWU)<br>
	👥 Team: [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa)

	---

	## 📋 Submission Info

	\| \| \|
	\|---\|---\|
	\| Track \| Building MCP — Creative \|
	\| MCP Endpoint \| `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` \|
	\| Framework \| Gradio 6.0 \|

	---

	## ✅ Track 1 Requirements

	\| Requirement \| How We Fulfill It \|
	\|-------------\|-------------------\|
	\| Functioning MCP Server \| 6 MCP tools exposed via Gradio's `mcp_server=True` \|
	\| MCP Client Demo \| Video shows integration with Claude Desktop / MCP client \|
	\| Documented Tools \| Full API documentation with inputs/outputs below \|
	\| Gradio App \| Interactive demo UI + hidden MCP tool interfaces \|

	---

	## 🛠️ MCP Tools (6 Tools)

	All tools accept base64-encoded audio as input.

	### 1. `extract_embedding` <img src="icons/extract_embedding.svg" width="20" height="20">
	Extract voice embeddings using Wav2Vec2 model.

	\| \| \|
	\|---\|---\|
	\| Input \| `audio_base64` (base64-encoded audio) \|
	\| Output \| `embedding_preview` (first 5 values), `embedding_length` (768) \|
	\| Use Case \| Speaker identification, voice fingerprinting \|

	<img src="imgs/extract_embedding.jpg" height="300">

	### 2. `match_voice` <img src="icons/match_voice.svg" width="20" height="20">
	Compare similarity between two voices.

	\| \| \|
	\|---\|---\|
	\| Inputs \| `audio1_base64`, `audio2_base64` \|
	\| Output \| `similarity` (0-1), `tone_score` (0-100) \|
	\| Use Case \| Voice cloning verification, speaker matching \|

	<img src="imgs/match_voice.jpg" height="300">

	### 3. `analyze_acoustics` <img src="icons/analyze_acoustics.svg" width="20" height="20">
	Extract detailed acoustic characteristics.

	\| \| \|
	\|---\|---\|
	\| Input \| `audio_base64` \|
	\| Output \| Pitch, energy, rhythm, tempo, spectral info \|
	\| Use Case \| Emotional tone detection, voice profiling \|

	<img src="imgs/analyze_acoustics.jpg" height="300">

	### 4. `transcribe_audio` <img src="icons/transcribe_audio.svg" width="20" height="20">
	Convert speech to text (multilingual).

	\| \| \|
	\|---\|---\|
	\| Inputs \| `audio_base64`, `language` (default: "en") \|
	\| Output \| Transcribed text, detected language \|
	\| Model \| ElevenLabs Scribe v1 \|
	\| Languages \| English, Korean, Japanese, and 15+ more \|

	<img src="imgs/transcribe_audio.jpg" height="300">

	### 5. `isolate_voice` <img src="icons/isolate_voice.svg" width="20" height="20">
	Remove background music/noise and extract clean voice.

	\| \| \|
	\|---\|---\|
	\| Input \| `audio_base64` (audio with background sounds) \|
	\| Output \| Isolated audio (base64), BGM detection status \|
	\| Use Case \| Audio cleanup for memes, songs, movies \|

	<img src="imgs/isolate_voice.jpg" height="300">

	### 6. `grade_voice` <img src="icons/grade_voice.svg" width="20" height="20">
	Comprehensive voice comparison with multi-metric scoring.

	\| \| \|
	\|---\|---\|
	\| Inputs \| `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\\|song\\|movie) \|
	\| Output \| Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription \|
	\| Use Case \| Voice mimicry evaluation, pronunciation games \|

	<img src="imgs/grade_voice.jpg" height="300">

	---

	## 🏗️ Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ VoiceKit MCP │
	├─────────────────────────────────────────────────────────────────┤
	│ │
	│ ┌────────────────────────────────────────────────────────────┐ │
	│ │ MCP Client (Claude) │ │
	│ │ base64 audio → SSE endpoint │ │
	│ └──────────────────────────┬─────────────────────────────────┘ │
	│ ↓ │
	│ ┌────────────────────────────────────────────────────────────┐ │
	│ │ Gradio MCP Server (app.py) │ │
	│ │ mcp_server=True • 6 tool interfaces │ │
	│ └──────────────────────────┬─────────────────────────────────┘ │
	│ ↓ │
	│ ┌────────────────────────────────────────────────────────────┐ │
	│ │ Modal GPU Container (T4) │ │
	│ │ Wav2Vec2 • librosa • ElevenLabs APIs • DTW │ │
	│ └──────────────────────────┬─────────────────────────────────┘ │
	│ ↓ │
	│ ┌────────────────────────────────────────────────────────────┐ │
	│ │ JSON Response │ │
	│ │ embeddings • scores • transcripts • audio │ │
	│ └────────────────────────────────────────────────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────┘
	```

	---

	## 🔌 How to Connect

	### Claude Desktop / MCP Client

	Add to your MCP configuration:

	```json
	{
	"mcpServers": {
	"voicekit": {
	"url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
	}
	}
	}
	```

	### Example Usage

	```python
	# 1. Encode audio to base64
	import base64
	with open("audio.wav", "rb") as f:
	audio_base64 = base64.b64encode(f.read()).decode()

	# 2. Call MCP tool
	result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})

	# 3. Use the 768-dim embedding
	embedding = result["embedding"]
	```

	---

	## 🛠️ Tech Stack

	\| Component \| Technology \|
	\|-----------\|------------\|
	\| MCP Server \| Gradio 6.0 (`mcp_server=True`) \|
	\| GPU Compute \| Modal (T4 GPU) \|
	\| Embeddings \| Wav2Vec2 (facebook/wav2vec2-base-960h) \|
	\| Speech-to-Text \| ElevenLabs Scribe v1 \|
	\| Voice Isolation \| ElevenLabs Voice Isolator \|
	\| Acoustic Analysis \| librosa + scipy \|

	---

	## ⚡ Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Response Time (warm) \| <200ms \|
	\| Cold Start \| 1-3s (memory snapshot optimized) \|
	\| Embedding Dimensions \| 768 \|
	\| Supported Audio \| Any format (auto-converts to WAV) \|
	\| Max Duration \| Tested up to 10 minutes \|

	---

	## 🎯 Why VoiceKit MCP?

	\| Criteria \| Our Approach \|
	\|----------\|--------------\|
	\| Functionality \| 6 production-ready tools covering full voice analysis pipeline \|
	\| Innovation \| First MCP server for comprehensive voice analysis \|
	\| Documentation \| Complete API docs with inputs/outputs/use cases \|
	\| Real-world Impact \| Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning \|

	---

	## 🎮 Interactive Demo

	👆 Click the interface above to try each tool!

	1. Upload or record audio
	2. Select a tool to test
	3. View JSON results with scores and analysis
	4. Copy embeddings or transcripts for your app

	---

	## 🔗 Related Projects

	- [Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle) — Daily voice puzzle game powered by VoiceKit MCP

	---

	Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday) 🎂

	Celebrating one year of Model Context Protocol!