Spaces:
Sleeping
Sleeping
| title: Multimodal AI Doctor – An Agentic AI Project | |
| emoji: 🩺 | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.46.1 | |
| app_file: app.py | |
| pinned: false | |
| # Multimodal AI Doctor – An Agentic AI Project | |
| **Multimodal AI Doctor** is an **agentic multimodal assistant** built with **Gradio**, **Groq APIs**, and **ElevenLabs**. | |
| It combines **speech, vision, and reasoning** through a series of cooperating LLMs, simulating how a real doctor listens, observes, and responds concisely. | |
| The system integrates **voice input, image analysis, clinical reasoning, and voice output** into a single pipeline. | |
| --- | |
| ## Features | |
| * Record patient voice from microphone (Speech-to-Text using **Whisper Large v3** on Groq) | |
| * Upload an image (diagnosis/medical-related) for analysis (Vision-Language reasoning using **Llama 4 Scout** on Groq) | |
| * Generate a concise medical-style response (2 sentences maximum, human-like tone) | |
| * Convert response to voice (Text-to-Speech using **ElevenLabs** with WAV output, fallback to **gTTS** if needed) | |
| * Gradio-based interactive UI | |
| --- | |
| ## Project Structure | |
| ``` | |
| . | |
| ├── app.py # Gradio UI + main workflow | |
| ├── brain\_of\_the\_doctor.py # Image encoding + Groq multimodal analysis | |
| ├── voice\_of\_the\_patient.py # Audio recording + Groq Whisper transcription | |
| ├── voice\_of\_the\_doctor.py # ElevenLabs + gTTS text-to-speech | |
| ├── requirements.txt # Python dependencies | |
| ├── .env # Environment variables (API keys) | |
| ├── .gitignore # Ignore venv, **pycache**, .env, etc. | |
| ├── images/ # Folder for saving test/sample images | |
| └── README.md # Documentation | |
| ```` | |
| --- | |
| ## Agentic AI Workflow | |
| The system uses **multiple LLM agents** to process multimodal input step by step: | |
| 1. **Symptom Agent** – extracts structured meaning from patient speech (via Whisper transcription). | |
| 2. **Vision Agent** – analyzes uploaded medical images (X-ray, MRI, scan). | |
| 3. **Reasoning Agent** – integrates speech and image findings into a medical interpretation. | |
| 4. **Response Agent** – formats the answer in a concise, empathetic, doctor-style tone (≤ 2 sentences). | |
| 5. **Voice Agent** – delivers the response using ElevenLabs (WAV, fallback gTTS). | |
| This makes the project an **agentic AI pipeline**, where multiple specialized models cooperate to mimic a doctor’s diagnostic process. | |
| --- | |
| ## Requirements | |
| * Python 3.10 or higher | |
| * FFmpeg installed and available in PATH (required by pydub) | |
| * A Groq API key (obtain from [https://console.groq.com](https://console.groq.com)) | |
| * An ElevenLabs API key (obtain from [https://elevenlabs.io](https://elevenlabs.io)) | |
| --- | |
| ## Installation | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/your-username/ai-doctor-2.0-voice-and-vision.git | |
| cd ai-doctor-2.0-voice-and-vision | |
| ```` | |
| 2. Create and activate a virtual environment: | |
| ```bash | |
| python -m venv venv | |
| source venv/bin/activate # Linux/Mac | |
| venv\Scripts\activate # Windows | |
| ``` | |
| 3. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 4. Install FFmpeg (if not already installed): | |
| * Windows: [Download builds](https://www.gyan.dev/ffmpeg/builds/) and add `bin/` to PATH | |
| * Linux (Debian/Ubuntu): `sudo apt install ffmpeg` | |
| * macOS (Homebrew): `brew install ffmpeg` | |
| 5. Create a `.env` file in the project root with your API keys: | |
| ``` | |
| GROQ_API_KEY=your_groq_api_key_here | |
| ELEVEN_API_KEY=your_elevenlabs_api_key_here | |
| ``` | |
| --- | |
| ## Running the Application | |
| Start the Gradio app: | |
| ```bash | |
| python app.py | |
| ``` | |
| The app will launch locally at: | |
| ``` | |
| http://127.0.0.1:7860 | |
| ``` | |
| --- | |
| ## Usage | |
| 1. Allow microphone access to record your voice. | |
| 2. Upload a medical image for analysis. | |
| 3. The system will: | |
| * Transcribe your voice (Whisper Large v3 via Groq) | |
| * Analyze the image + text (Llama 4 Scout via Groq) | |
| * Generate a concise medical-style response | |
| * Play back the response as voice (ElevenLabs or gTTS fallback) | |
| --- | |
| ## Models Used | |
| 1. **Whisper Large v3** (Groq) – Speech-to-Text | |
| * [Groq API Docs](https://console.groq.com/docs) | |
| 2. **Llama 4 Scout 17B (Mixture-of-Experts)** (Groq) – Vision-Language reasoning | |
| * [Groq API Docs](https://console.groq.com/docs) | |
| 3. **ElevenLabs `eleven_turbo_v2`** – Text-to-Speech (WAV, with MP3 fallback) | |
| * [ElevenLabs Docs](https://elevenlabs.io/docs) | |
| 4. **gTTS (Google Text-to-Speech)** – Backup Text-to-Speech | |
| * [PyPI gTTS](https://pypi.org/project/gTTS/) | |
| --- | |
| ## Notes | |
| * ElevenLabs free-tier accounts may not allow WAV output or certain custom voices. In that case, the code automatically falls back to MP3 output with a safe built-in voice. | |
| * Ensure FFmpeg is correctly installed; otherwise, audio export with pydub will fail. | |
| * Gradio will automatically handle playback of both WAV and MP3 outputs. | |
| --- | |
| ## Support | |
| For questions, issues, or collaboration, please contact: | |
| **Email:** [sayeem26s@gmail.com](mailto:sayeem26s@gmail.com) | |
| **LinkedIn:** [https://www.linkedin.com/in/s-m-shahriar-26s/](https://www.linkedin.com/in/s-m-shahriar-26s/) | |
| ``` | |
| --- |