voicekit / README.md
jjin6573's picture
Upload folder using huggingface_hub
7ae2c28 verified
---
title: VoiceKit MCP
emoji: ๐ŸŽค
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: "6.0.0"
app_file: app.py
pinned: false
tags:
- building-mcp-track-creative
- mcp-server
---
# ๐ŸŽค VoiceKit MCP
> **Professional voice analysis as MCP tools โ€” extract embeddings, compare voices, transcribe speech, and more.**
6 powerful MCP tools for voice processing, all accepting base64-encoded audio.
๐Ÿ“ข **Social Post:** [View on X](https://x.com/dahee_pk/status/1994389505898582442)<br>
๐ŸŽฌ **Demo Video:** [Watch on YouTube](https://www.youtube.com/watch?v=1VIqvpwfyWU)<br>
๐Ÿ‘ฅ **Team:** [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa)
---
## ๐Ÿ“‹ Submission Info
| | |
|---|---|
| **Track** | Building MCP โ€” Creative |
| **MCP Endpoint** | `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` |
| **Framework** | Gradio 6.0 |
---
## โœ… Track 1 Requirements
| Requirement | How We Fulfill It |
|-------------|-------------------|
| **Functioning MCP Server** | 6 MCP tools exposed via Gradio's `mcp_server=True` |
| **MCP Client Demo** | Video shows integration with Claude Desktop / MCP client |
| **Documented Tools** | Full API documentation with inputs/outputs below |
| **Gradio App** | Interactive demo UI + hidden MCP tool interfaces |
---
## ๐Ÿ› ๏ธ MCP Tools (6 Tools)
All tools accept **base64-encoded audio** as input.
### 1. `extract_embedding` <img src="icons/extract_embedding.svg" width="20" height="20">
Extract voice embeddings using Wav2Vec2 model.
| | |
|---|---|
| **Input** | `audio_base64` (base64-encoded audio) |
| **Output** | `embedding_preview` (first 5 values), `embedding_length` (768) |
| **Use Case** | Speaker identification, voice fingerprinting |
<img src="imgs/extract_embedding.jpg" height="300">
### 2. `match_voice` <img src="icons/match_voice.svg" width="20" height="20">
Compare similarity between two voices.
| | |
|---|---|
| **Inputs** | `audio1_base64`, `audio2_base64` |
| **Output** | `similarity` (0-1), `tone_score` (0-100) |
| **Use Case** | Voice cloning verification, speaker matching |
<img src="imgs/match_voice.jpg" height="300">
### 3. `analyze_acoustics` <img src="icons/analyze_acoustics.svg" width="20" height="20">
Extract detailed acoustic characteristics.
| | |
|---|---|
| **Input** | `audio_base64` |
| **Output** | Pitch, energy, rhythm, tempo, spectral info |
| **Use Case** | Emotional tone detection, voice profiling |
<img src="imgs/analyze_acoustics.jpg" height="300">
### 4. `transcribe_audio` <img src="icons/transcribe_audio.svg" width="20" height="20">
Convert speech to text (multilingual).
| | |
|---|---|
| **Inputs** | `audio_base64`, `language` (default: "en") |
| **Output** | Transcribed text, detected language |
| **Model** | ElevenLabs Scribe v1 |
| **Languages** | English, Korean, Japanese, and 15+ more |
<img src="imgs/transcribe_audio.jpg" height="300">
### 5. `isolate_voice` <img src="icons/isolate_voice.svg" width="20" height="20">
Remove background music/noise and extract clean voice.
| | |
|---|---|
| **Input** | `audio_base64` (audio with background sounds) |
| **Output** | Isolated audio (base64), BGM detection status |
| **Use Case** | Audio cleanup for memes, songs, movies |
<img src="imgs/isolate_voice.jpg" height="300">
### 6. `grade_voice` <img src="icons/grade_voice.svg" width="20" height="20">
Comprehensive voice comparison with multi-metric scoring.
| | |
|---|---|
| **Inputs** | `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie) |
| **Output** | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription |
| **Use Case** | Voice mimicry evaluation, pronunciation games |
<img src="imgs/grade_voice.jpg" height="300">
---
## ๐Ÿ—๏ธ Architecture
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ VoiceKit MCP โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ MCP Client (Claude) โ”‚ โ”‚
โ”‚ โ”‚ base64 audio โ†’ SSE endpoint โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Gradio MCP Server (app.py) โ”‚ โ”‚
โ”‚ โ”‚ mcp_server=True โ€ข 6 tool interfaces โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ Modal GPU Container (T4) โ”‚ โ”‚
โ”‚ โ”‚ Wav2Vec2 โ€ข librosa โ€ข ElevenLabs APIs โ€ข DTW โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ†“ โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ JSON Response โ”‚ โ”‚
โ”‚ โ”‚ embeddings โ€ข scores โ€ข transcripts โ€ข audio โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
---
## ๐Ÿ”Œ How to Connect
### Claude Desktop / MCP Client
Add to your MCP configuration:
```json
{
"mcpServers": {
"voicekit": {
"url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
}
}
}
```
### Example Usage
```python
# 1. Encode audio to base64
import base64
with open("audio.wav", "rb") as f:
audio_base64 = base64.b64encode(f.read()).decode()
# 2. Call MCP tool
result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})
# 3. Use the 768-dim embedding
embedding = result["embedding"]
```
---
## ๐Ÿ› ๏ธ Tech Stack
| Component | Technology |
|-----------|------------|
| MCP Server | Gradio 6.0 (`mcp_server=True`) |
| GPU Compute | Modal (T4 GPU) |
| Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) |
| Speech-to-Text | ElevenLabs Scribe v1 |
| Voice Isolation | ElevenLabs Voice Isolator |
| Acoustic Analysis | librosa + scipy |
---
## โšก Performance
| Metric | Value |
|--------|-------|
| Response Time (warm) | <200ms |
| Cold Start | 1-3s (memory snapshot optimized) |
| Embedding Dimensions | 768 |
| Supported Audio | Any format (auto-converts to WAV) |
| Max Duration | Tested up to 10 minutes |
---
## ๐ŸŽฏ Why VoiceKit MCP?
| Criteria | Our Approach |
|----------|--------------|
| **Functionality** | 6 production-ready tools covering full voice analysis pipeline |
| **Innovation** | First MCP server for comprehensive voice analysis |
| **Documentation** | Complete API docs with inputs/outputs/use cases |
| **Real-world Impact** | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning |
---
## ๐ŸŽฎ Interactive Demo
๐Ÿ‘† **Click the interface above to try each tool!**
1. Upload or record audio
2. Select a tool to test
3. View JSON results with scores and analysis
4. Copy embeddings or transcripts for your app
---
## ๐Ÿ”— Related Projects
- **[Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle)** โ€” Daily voice puzzle game powered by VoiceKit MCP
---
**Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)** ๐ŸŽ‚
*Celebrating one year of Model Context Protocol!*