Spaces:
Sleeping
Sleeping
File size: 5,232 Bytes
7074753 25377ed 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 90b53f1 7074753 25377ed | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
title: Multimodal AI Doctor – An Agentic AI Project
emoji: 🩺
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false
---
# Multimodal AI Doctor – An Agentic AI Project
**Multimodal AI Doctor** is an **agentic multimodal assistant** built with **Gradio**, **Groq APIs**, and **ElevenLabs**.
It combines **speech, vision, and reasoning** through a series of cooperating LLMs, simulating how a real doctor listens, observes, and responds concisely.
The system integrates **voice input, image analysis, clinical reasoning, and voice output** into a single pipeline.
---
## Features
* Record patient voice from microphone (Speech-to-Text using **Whisper Large v3** on Groq)
* Upload an image (diagnosis/medical-related) for analysis (Vision-Language reasoning using **Llama 4 Scout** on Groq)
* Generate a concise medical-style response (2 sentences maximum, human-like tone)
* Convert response to voice (Text-to-Speech using **ElevenLabs** with WAV output, fallback to **gTTS** if needed)
* Gradio-based interactive UI
---
## Project Structure
```
.
├── app.py # Gradio UI + main workflow
├── brain\_of\_the\_doctor.py # Image encoding + Groq multimodal analysis
├── voice\_of\_the\_patient.py # Audio recording + Groq Whisper transcription
├── voice\_of\_the\_doctor.py # ElevenLabs + gTTS text-to-speech
├── requirements.txt # Python dependencies
├── .env # Environment variables (API keys)
├── .gitignore # Ignore venv, **pycache**, .env, etc.
├── images/ # Folder for saving test/sample images
└── README.md # Documentation
````
---
## Agentic AI Workflow
The system uses **multiple LLM agents** to process multimodal input step by step:
1. **Symptom Agent** – extracts structured meaning from patient speech (via Whisper transcription).
2. **Vision Agent** – analyzes uploaded medical images (X-ray, MRI, scan).
3. **Reasoning Agent** – integrates speech and image findings into a medical interpretation.
4. **Response Agent** – formats the answer in a concise, empathetic, doctor-style tone (≤ 2 sentences).
5. **Voice Agent** – delivers the response using ElevenLabs (WAV, fallback gTTS).
This makes the project an **agentic AI pipeline**, where multiple specialized models cooperate to mimic a doctor’s diagnostic process.
---
## Requirements
* Python 3.10 or higher
* FFmpeg installed and available in PATH (required by pydub)
* A Groq API key (obtain from [https://console.groq.com](https://console.groq.com))
* An ElevenLabs API key (obtain from [https://elevenlabs.io](https://elevenlabs.io))
---
## Installation
1. Clone the repository:
```bash
git clone https://github.com/your-username/ai-doctor-2.0-voice-and-vision.git
cd ai-doctor-2.0-voice-and-vision
````
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Install FFmpeg (if not already installed):
* Windows: [Download builds](https://www.gyan.dev/ffmpeg/builds/) and add `bin/` to PATH
* Linux (Debian/Ubuntu): `sudo apt install ffmpeg`
* macOS (Homebrew): `brew install ffmpeg`
5. Create a `.env` file in the project root with your API keys:
```
GROQ_API_KEY=your_groq_api_key_here
ELEVEN_API_KEY=your_elevenlabs_api_key_here
```
---
## Running the Application
Start the Gradio app:
```bash
python app.py
```
The app will launch locally at:
```
http://127.0.0.1:7860
```
---
## Usage
1. Allow microphone access to record your voice.
2. Upload a medical image for analysis.
3. The system will:
* Transcribe your voice (Whisper Large v3 via Groq)
* Analyze the image + text (Llama 4 Scout via Groq)
* Generate a concise medical-style response
* Play back the response as voice (ElevenLabs or gTTS fallback)
---
## Models Used
1. **Whisper Large v3** (Groq) – Speech-to-Text
* [Groq API Docs](https://console.groq.com/docs)
2. **Llama 4 Scout 17B (Mixture-of-Experts)** (Groq) – Vision-Language reasoning
* [Groq API Docs](https://console.groq.com/docs)
3. **ElevenLabs `eleven_turbo_v2`** – Text-to-Speech (WAV, with MP3 fallback)
* [ElevenLabs Docs](https://elevenlabs.io/docs)
4. **gTTS (Google Text-to-Speech)** – Backup Text-to-Speech
* [PyPI gTTS](https://pypi.org/project/gTTS/)
---
## Notes
* ElevenLabs free-tier accounts may not allow WAV output or certain custom voices. In that case, the code automatically falls back to MP3 output with a safe built-in voice.
* Ensure FFmpeg is correctly installed; otherwise, audio export with pydub will fail.
* Gradio will automatically handle playback of both WAV and MP3 outputs.
---
## Support
For questions, issues, or collaboration, please contact:
**Email:** [sayeem26s@gmail.com](mailto:sayeem26s@gmail.com)
**LinkedIn:** [https://www.linkedin.com/in/s-m-shahriar-26s/](https://www.linkedin.com/in/s-m-shahriar-26s/)
```
--- |