Spaces:
Sleeping
Sleeping
| title: AI Assistant For Visually Impaired | |
| emoji: π¬ | |
| colorFrom: yellow | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.42.0 | |
| app_file: app.py | |
| pinned: false | |
| hf_oauth: true | |
| hf_oauth_scopes: | |
| - inference-api | |
| license: mit | |
| short_description: AI-Assistant-for-Visually-Impaired | |
| tags: | |
| - mcp-in-action-track-consumer | |
| - building-mcp-track-consumer | |
| # Accessibility Voice Agent β MCP Tools | |
| ### **Track:** mcp-in-action-track-consumer | |
| ### **Team:** Team | |
| ### **Author:** @subhash4face β *Subhash Mankunnu* | |
| ### **Author:** *Athira AR* | |
| Model Context Protocol (MCP) + Gradio 6 + HF Inference + ElevenLabs | |
| A fully accessible, voice-driven AI assistant demonstrating how MCP tools can enable **speech-to-text**, **image understanding**, and **text-to-speech** workflows for low-vision and visually impaired users. | |
| This project showcases a real-world use case of MCP tools working together inside an agent-style UI. | |
| --- | |
| ## π Workflow Diagram β MCP Tools | |
|  | |
| ## π Demo Video | |
| π *https://youtu.be/af4Y89g2HPE* | |
| ## π Social Media Post - LinkedIn | |
| π *https://www.linkedin.com/posts/subhashmankunnu_hugginface-share-7400924735989010432-a9sH?utm_source=share&utm_medium=member_desktop&rcm=ACoAAASVxnsB9ojyfy-Kef3IWvBPf4c3pUSOaWw* | |
| --- | |
| ## π Key Features | |
| ### π Text-to-Speech (TTS) via ElevenLabs | |
| **MCP Tool:** `speak_text` | |
| - Converts any assistant message to natural speech | |
| - Returns base64 audio + WAV playback | |
| - Helps low-vision users receive spoken responses | |
| --- | |
| ### π€ Speech-to-Text (STT) via Whisper / Local fallback | |
| **MCP Tool:** `transcribe_audio` | |
| - OpenAI Whisper STT or local fallback | |
| - Great for hands-free usage | |
| - Tool-call log shows backend + duration | |
| --- | |
| ### πΌ Image Description via OpenAI / Gemini / HF Inference | |
| **MCP Tool:** `describe_image` | |
| - Multimodal accessibility | |
| - Describes any uploaded image in plain language | |
| - Hugging Face Inference API used instead of local BLIP | |
| --- | |
| ### π§© Fully MCP-powered | |
| Every capability is wrapped as an MCP tool, making this app a template for: | |
| - Agents | |
| - Assistive technologies | |
| - Multimodal accessibility apps | |
| - Voice-driven workflows | |
| - Cross-backend tool orchestration | |
| --- | |
| ## π‘ Real Use Case: Accessibility | |
| Designed for: | |
| - Low-vision users | |
| - Voice-interface users | |
| - Anyone needing automated image descriptions | |
| - Hands-free workflows | |
| - Assistive technology research | |
| --- | |
| ## π Tech Stack | |
| - MCP Server (Python) | |
| - Gradio 6 | |
| - OpenAI Whisper (STT) | |
| - ElevenLabs (TTS) | |
| - Gemini Vision (optional) | |
| - Hugging Face Inference API (image captioning) | |
| - Python | |
| --- | |
| ## π How to Run Locally | |
| ```bash | |
| pip install -r requirements.txt | |
| python app.py | |