Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.6.0
metadata
title: Multimodal AI Doctor – An Agentic AI Project
emoji: 🩺
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false
Multimodal AI Doctor – An Agentic AI Project
Multimodal AI Doctor is an agentic multimodal assistant built with Gradio, Groq APIs, and ElevenLabs.
It combines speech, vision, and reasoning through a series of cooperating LLMs, simulating how a real doctor listens, observes, and responds concisely.
The system integrates voice input, image analysis, clinical reasoning, and voice output into a single pipeline.
Features
- Record patient voice from microphone (Speech-to-Text using Whisper Large v3 on Groq)
- Upload an image (diagnosis/medical-related) for analysis (Vision-Language reasoning using Llama 4 Scout on Groq)
- Generate a concise medical-style response (2 sentences maximum, human-like tone)
- Convert response to voice (Text-to-Speech using ElevenLabs with WAV output, fallback to gTTS if needed)
- Gradio-based interactive UI
Project Structure
.
├── app.py # Gradio UI + main workflow
├── brain\_of\_the\_doctor.py # Image encoding + Groq multimodal analysis
├── voice\_of\_the\_patient.py # Audio recording + Groq Whisper transcription
├── voice\_of\_the\_doctor.py # ElevenLabs + gTTS text-to-speech
├── requirements.txt # Python dependencies
├── .env # Environment variables (API keys)
├── .gitignore # Ignore venv, **pycache**, .env, etc.
├── images/ # Folder for saving test/sample images
└── README.md # Documentation
Agentic AI Workflow
The system uses multiple LLM agents to process multimodal input step by step:
- Symptom Agent – extracts structured meaning from patient speech (via Whisper transcription).
- Vision Agent – analyzes uploaded medical images (X-ray, MRI, scan).
- Reasoning Agent – integrates speech and image findings into a medical interpretation.
- Response Agent – formats the answer in a concise, empathetic, doctor-style tone (≤ 2 sentences).
- Voice Agent – delivers the response using ElevenLabs (WAV, fallback gTTS).
This makes the project an agentic AI pipeline, where multiple specialized models cooperate to mimic a doctor’s diagnostic process.
Requirements
- Python 3.10 or higher
- FFmpeg installed and available in PATH (required by pydub)
- A Groq API key (obtain from https://console.groq.com)
- An ElevenLabs API key (obtain from https://elevenlabs.io)
Installation
Clone the repository:
git clone https://github.com/your-username/ai-doctor-2.0-voice-and-vision.git cd ai-doctor-2.0-voice-and-vision
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Install FFmpeg (if not already installed):
* Windows: [Download builds](https://www.gyan.dev/ffmpeg/builds/) and add `bin/` to PATH
* Linux (Debian/Ubuntu): `sudo apt install ffmpeg`
* macOS (Homebrew): `brew install ffmpeg`
5. Create a `.env` file in the project root with your API keys:
```
GROQ_API_KEY=your_groq_api_key_here
ELEVEN_API_KEY=your_elevenlabs_api_key_here
```
---
## Running the Application
Start the Gradio app:
```bash
python app.py
```
The app will launch locally at:
```
http://127.0.0.1:7860
```
---
## Usage
1. Allow microphone access to record your voice.
2. Upload a medical image for analysis.
3. The system will:
* Transcribe your voice (Whisper Large v3 via Groq)
* Analyze the image + text (Llama 4 Scout via Groq)
* Generate a concise medical-style response
* Play back the response as voice (ElevenLabs or gTTS fallback)
---
## Models Used
1. **Whisper Large v3** (Groq) – Speech-to-Text
* [Groq API Docs](https://console.groq.com/docs)
2. **Llama 4 Scout 17B (Mixture-of-Experts)** (Groq) – Vision-Language reasoning
* [Groq API Docs](https://console.groq.com/docs)
3. **ElevenLabs `eleven_turbo_v2`** – Text-to-Speech (WAV, with MP3 fallback)
* [ElevenLabs Docs](https://elevenlabs.io/docs)
4. **gTTS (Google Text-to-Speech)** – Backup Text-to-Speech
* [PyPI gTTS](https://pypi.org/project/gTTS/)
---
## Notes
* ElevenLabs free-tier accounts may not allow WAV output or certain custom voices. In that case, the code automatically falls back to MP3 output with a safe built-in voice.
* Ensure FFmpeg is correctly installed; otherwise, audio export with pydub will fail.
* Gradio will automatically handle playback of both WAV and MP3 outputs.
---
## Support
For questions, issues, or collaboration, please contact:
**Email:** [sayeem26s@gmail.com](mailto:sayeem26s@gmail.com)
**LinkedIn:** [https://www.linkedin.com/in/s-m-shahriar-26s/](https://www.linkedin.com/in/s-m-shahriar-26s/)
```
---