Spaces:
Sleeping
Sleeping
| title: VoiceKit MCP | |
| emoji: ๐ค | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: "6.0.0" | |
| app_file: app.py | |
| pinned: false | |
| tags: | |
| - building-mcp-track-creative | |
| - mcp-server | |
| # ๐ค VoiceKit MCP | |
| > **Professional voice analysis as MCP tools โ extract embeddings, compare voices, transcribe speech, and more.** | |
| 6 powerful MCP tools for voice processing, all accepting base64-encoded audio. | |
| ๐ข **Social Post:** [View on X](https://x.com/dahee_pk/status/1994389505898582442)<br> | |
| ๐ฌ **Demo Video:** [Watch on YouTube](https://www.youtube.com/watch?v=1VIqvpwfyWU)<br> | |
| ๐ฅ **Team:** [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa) | |
| --- | |
| ## ๐ Submission Info | |
| | | | | |
| |---|---| | |
| | **Track** | Building MCP โ Creative | | |
| | **MCP Endpoint** | `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` | | |
| | **Framework** | Gradio 6.0 | | |
| --- | |
| ## โ Track 1 Requirements | |
| | Requirement | How We Fulfill It | | |
| |-------------|-------------------| | |
| | **Functioning MCP Server** | 6 MCP tools exposed via Gradio's `mcp_server=True` | | |
| | **MCP Client Demo** | Video shows integration with Claude Desktop / MCP client | | |
| | **Documented Tools** | Full API documentation with inputs/outputs below | | |
| | **Gradio App** | Interactive demo UI + hidden MCP tool interfaces | | |
| --- | |
| ## ๐ ๏ธ MCP Tools (6 Tools) | |
| All tools accept **base64-encoded audio** as input. | |
| ### 1. `extract_embedding` <img src="icons/extract_embedding.svg" width="20" height="20"> | |
| Extract voice embeddings using Wav2Vec2 model. | |
| | | | | |
| |---|---| | |
| | **Input** | `audio_base64` (base64-encoded audio) | | |
| | **Output** | `embedding_preview` (first 5 values), `embedding_length` (768) | | |
| | **Use Case** | Speaker identification, voice fingerprinting | | |
| <img src="imgs/extract_embedding.jpg" height="300"> | |
| ### 2. `match_voice` <img src="icons/match_voice.svg" width="20" height="20"> | |
| Compare similarity between two voices. | |
| | | | | |
| |---|---| | |
| | **Inputs** | `audio1_base64`, `audio2_base64` | | |
| | **Output** | `similarity` (0-1), `tone_score` (0-100) | | |
| | **Use Case** | Voice cloning verification, speaker matching | | |
| <img src="imgs/match_voice.jpg" height="300"> | |
| ### 3. `analyze_acoustics` <img src="icons/analyze_acoustics.svg" width="20" height="20"> | |
| Extract detailed acoustic characteristics. | |
| | | | | |
| |---|---| | |
| | **Input** | `audio_base64` | | |
| | **Output** | Pitch, energy, rhythm, tempo, spectral info | | |
| | **Use Case** | Emotional tone detection, voice profiling | | |
| <img src="imgs/analyze_acoustics.jpg" height="300"> | |
| ### 4. `transcribe_audio` <img src="icons/transcribe_audio.svg" width="20" height="20"> | |
| Convert speech to text (multilingual). | |
| | | | | |
| |---|---| | |
| | **Inputs** | `audio_base64`, `language` (default: "en") | | |
| | **Output** | Transcribed text, detected language | | |
| | **Model** | ElevenLabs Scribe v1 | | |
| | **Languages** | English, Korean, Japanese, and 15+ more | | |
| <img src="imgs/transcribe_audio.jpg" height="300"> | |
| ### 5. `isolate_voice` <img src="icons/isolate_voice.svg" width="20" height="20"> | |
| Remove background music/noise and extract clean voice. | |
| | | | | |
| |---|---| | |
| | **Input** | `audio_base64` (audio with background sounds) | | |
| | **Output** | Isolated audio (base64), BGM detection status | | |
| | **Use Case** | Audio cleanup for memes, songs, movies | | |
| <img src="imgs/isolate_voice.jpg" height="300"> | |
| ### 6. `grade_voice` <img src="icons/grade_voice.svg" width="20" height="20"> | |
| Comprehensive voice comparison with multi-metric scoring. | |
| | | | | |
| |---|---| | |
| | **Inputs** | `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie) | | |
| | **Output** | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription | | |
| | **Use Case** | Voice mimicry evaluation, pronunciation games | | |
| <img src="imgs/grade_voice.jpg" height="300"> | |
| --- | |
| ## ๐๏ธ Architecture | |
| ``` | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ VoiceKit MCP โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค | |
| โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ MCP Client (Claude) โ โ | |
| โ โ base64 audio โ SSE endpoint โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ Gradio MCP Server (app.py) โ โ | |
| โ โ mcp_server=True โข 6 tool interfaces โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ Modal GPU Container (T4) โ โ | |
| โ โ Wav2Vec2 โข librosa โข ElevenLabs APIs โข DTW โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ JSON Response โ โ | |
| โ โ embeddings โข scores โข transcripts โข audio โ โ | |
| โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ | |
| โ โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| --- | |
| ## ๐ How to Connect | |
| ### Claude Desktop / MCP Client | |
| Add to your MCP configuration: | |
| ```json | |
| { | |
| "mcpServers": { | |
| "voicekit": { | |
| "url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse" | |
| } | |
| } | |
| } | |
| ``` | |
| ### Example Usage | |
| ```python | |
| # 1. Encode audio to base64 | |
| import base64 | |
| with open("audio.wav", "rb") as f: | |
| audio_base64 = base64.b64encode(f.read()).decode() | |
| # 2. Call MCP tool | |
| result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64}) | |
| # 3. Use the 768-dim embedding | |
| embedding = result["embedding"] | |
| ``` | |
| --- | |
| ## ๐ ๏ธ Tech Stack | |
| | Component | Technology | | |
| |-----------|------------| | |
| | MCP Server | Gradio 6.0 (`mcp_server=True`) | | |
| | GPU Compute | Modal (T4 GPU) | | |
| | Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) | | |
| | Speech-to-Text | ElevenLabs Scribe v1 | | |
| | Voice Isolation | ElevenLabs Voice Isolator | | |
| | Acoustic Analysis | librosa + scipy | | |
| --- | |
| ## โก Performance | |
| | Metric | Value | | |
| |--------|-------| | |
| | Response Time (warm) | <200ms | | |
| | Cold Start | 1-3s (memory snapshot optimized) | | |
| | Embedding Dimensions | 768 | | |
| | Supported Audio | Any format (auto-converts to WAV) | | |
| | Max Duration | Tested up to 10 minutes | | |
| --- | |
| ## ๐ฏ Why VoiceKit MCP? | |
| | Criteria | Our Approach | | |
| |----------|--------------| | |
| | **Functionality** | 6 production-ready tools covering full voice analysis pipeline | | |
| | **Innovation** | First MCP server for comprehensive voice analysis | | |
| | **Documentation** | Complete API docs with inputs/outputs/use cases | | |
| | **Real-world Impact** | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning | | |
| --- | |
| ## ๐ฎ Interactive Demo | |
| ๐ **Click the interface above to try each tool!** | |
| 1. Upload or record audio | |
| 2. Select a tool to test | |
| 3. View JSON results with scores and analysis | |
| 4. Copy embeddings or transcripts for your app | |
| --- | |
| ## ๐ Related Projects | |
| - **[Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle)** โ Daily voice puzzle game powered by VoiceKit MCP | |
| --- | |
| **Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)** ๐ | |
| *Celebrating one year of Model Context Protocol!* |