Sayeem26s's picture
Update README.md
25377ed verified
---
title: Multimodal AI Doctor An Agentic AI Project
emoji: 🩺
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false
---
# Multimodal AI Doctor – An Agentic AI Project
**Multimodal AI Doctor** is an **agentic multimodal assistant** built with **Gradio**, **Groq APIs**, and **ElevenLabs**.
It combines **speech, vision, and reasoning** through a series of cooperating LLMs, simulating how a real doctor listens, observes, and responds concisely.
The system integrates **voice input, image analysis, clinical reasoning, and voice output** into a single pipeline.
---
## Features
* Record patient voice from microphone (Speech-to-Text using **Whisper Large v3** on Groq)
* Upload an image (diagnosis/medical-related) for analysis (Vision-Language reasoning using **Llama 4 Scout** on Groq)
* Generate a concise medical-style response (2 sentences maximum, human-like tone)
* Convert response to voice (Text-to-Speech using **ElevenLabs** with WAV output, fallback to **gTTS** if needed)
* Gradio-based interactive UI
---
## Project Structure
```
.
├── app.py # Gradio UI + main workflow
├── brain\_of\_the\_doctor.py # Image encoding + Groq multimodal analysis
├── voice\_of\_the\_patient.py # Audio recording + Groq Whisper transcription
├── voice\_of\_the\_doctor.py # ElevenLabs + gTTS text-to-speech
├── requirements.txt # Python dependencies
├── .env # Environment variables (API keys)
├── .gitignore # Ignore venv, **pycache**, .env, etc.
├── images/ # Folder for saving test/sample images
└── README.md # Documentation
````
---
## Agentic AI Workflow
The system uses **multiple LLM agents** to process multimodal input step by step:
1. **Symptom Agent** – extracts structured meaning from patient speech (via Whisper transcription).
2. **Vision Agent** – analyzes uploaded medical images (X-ray, MRI, scan).
3. **Reasoning Agent** – integrates speech and image findings into a medical interpretation.
4. **Response Agent** – formats the answer in a concise, empathetic, doctor-style tone (≤ 2 sentences).
5. **Voice Agent** – delivers the response using ElevenLabs (WAV, fallback gTTS).
This makes the project an **agentic AI pipeline**, where multiple specialized models cooperate to mimic a doctor’s diagnostic process.
---
## Requirements
* Python 3.10 or higher
* FFmpeg installed and available in PATH (required by pydub)
* A Groq API key (obtain from [https://console.groq.com](https://console.groq.com))
* An ElevenLabs API key (obtain from [https://elevenlabs.io](https://elevenlabs.io))
---
## Installation
1. Clone the repository:
```bash
git clone https://github.com/your-username/ai-doctor-2.0-voice-and-vision.git
cd ai-doctor-2.0-voice-and-vision
````
2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
```
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Install FFmpeg (if not already installed):
* Windows: [Download builds](https://www.gyan.dev/ffmpeg/builds/) and add `bin/` to PATH
* Linux (Debian/Ubuntu): `sudo apt install ffmpeg`
* macOS (Homebrew): `brew install ffmpeg`
5. Create a `.env` file in the project root with your API keys:
```
GROQ_API_KEY=your_groq_api_key_here
ELEVEN_API_KEY=your_elevenlabs_api_key_here
```
---
## Running the Application
Start the Gradio app:
```bash
python app.py
```
The app will launch locally at:
```
http://127.0.0.1:7860
```
---
## Usage
1. Allow microphone access to record your voice.
2. Upload a medical image for analysis.
3. The system will:
* Transcribe your voice (Whisper Large v3 via Groq)
* Analyze the image + text (Llama 4 Scout via Groq)
* Generate a concise medical-style response
* Play back the response as voice (ElevenLabs or gTTS fallback)
---
## Models Used
1. **Whisper Large v3** (Groq) – Speech-to-Text
* [Groq API Docs](https://console.groq.com/docs)
2. **Llama 4 Scout 17B (Mixture-of-Experts)** (Groq) – Vision-Language reasoning
* [Groq API Docs](https://console.groq.com/docs)
3. **ElevenLabs `eleven_turbo_v2`** – Text-to-Speech (WAV, with MP3 fallback)
* [ElevenLabs Docs](https://elevenlabs.io/docs)
4. **gTTS (Google Text-to-Speech)** – Backup Text-to-Speech
* [PyPI gTTS](https://pypi.org/project/gTTS/)
---
## Notes
* ElevenLabs free-tier accounts may not allow WAV output or certain custom voices. In that case, the code automatically falls back to MP3 output with a safe built-in voice.
* Ensure FFmpeg is correctly installed; otherwise, audio export with pydub will fail.
* Gradio will automatically handle playback of both WAV and MP3 outputs.
---
## Support
For questions, issues, or collaboration, please contact:
**Email:** [sayeem26s@gmail.com](mailto:sayeem26s@gmail.com)
**LinkedIn:** [https://www.linkedin.com/in/s-m-shahriar-26s/](https://www.linkedin.com/in/s-m-shahriar-26s/)
```
---