File size: 5,232 Bytes
7074753
 
 
 
 
 
25377ed
7074753
 
 
 
90b53f1
 
7074753
 
 
90b53f1
 
 
 
 
7074753
 
 
 
 
90b53f1
 
 
 
 
 
7074753
90b53f1
 
7074753
 
 
90b53f1
 
7074753
90b53f1
 
7074753
 
90b53f1
 
 
 
 
 
 
7074753
 
 
 
 
90b53f1
7074753
90b53f1
 
 
 
 
7074753
 
 
 
90b53f1
 
 
 
 
 
 
 
 
 
7074753
90b53f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7074753
90b53f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7074753
 
25377ed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: Multimodal AI Doctor  An Agentic AI Project
emoji: 🩺
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false
---

# Multimodal AI Doctor – An Agentic AI Project

**Multimodal AI Doctor** is an **agentic multimodal assistant** built with **Gradio**, **Groq APIs**, and **ElevenLabs**.  
It combines **speech, vision, and reasoning** through a series of cooperating LLMs, simulating how a real doctor listens, observes, and responds concisely.  
The system integrates **voice input, image analysis, clinical reasoning, and voice output** into a single pipeline.  

---

## Features

* Record patient voice from microphone (Speech-to-Text using **Whisper Large v3** on Groq)  
* Upload an image (diagnosis/medical-related) for analysis (Vision-Language reasoning using **Llama 4 Scout** on Groq)  
* Generate a concise medical-style response (2 sentences maximum, human-like tone)  
* Convert response to voice (Text-to-Speech using **ElevenLabs** with WAV output, fallback to **gTTS** if needed)  
* Gradio-based interactive UI  

---

## Project Structure

```

.
├── app.py                    # Gradio UI + main workflow
├── brain\_of\_the\_doctor.py    # Image encoding + Groq multimodal analysis
├── voice\_of\_the\_patient.py   # Audio recording + Groq Whisper transcription
├── voice\_of\_the\_doctor.py    # ElevenLabs + gTTS text-to-speech
├── requirements.txt          # Python dependencies
├── .env                      # Environment variables (API keys)
├── .gitignore                # Ignore venv, **pycache**, .env, etc.
├── images/                   # Folder for saving test/sample images
└── README.md                 # Documentation

````

---

## Agentic AI Workflow

The system uses **multiple LLM agents** to process multimodal input step by step:

1. **Symptom Agent** – extracts structured meaning from patient speech (via Whisper transcription).  
2. **Vision Agent** – analyzes uploaded medical images (X-ray, MRI, scan).  
3. **Reasoning Agent** – integrates speech and image findings into a medical interpretation.  
4. **Response Agent** – formats the answer in a concise, empathetic, doctor-style tone (≤ 2 sentences).  
5. **Voice Agent** – delivers the response using ElevenLabs (WAV, fallback gTTS).  

This makes the project an **agentic AI pipeline**, where multiple specialized models cooperate to mimic a doctor’s diagnostic process.  

---

## Requirements

* Python 3.10 or higher  
* FFmpeg installed and available in PATH (required by pydub)  
* A Groq API key (obtain from [https://console.groq.com](https://console.groq.com))  
* An ElevenLabs API key (obtain from [https://elevenlabs.io](https://elevenlabs.io))  

---

## Installation

1. Clone the repository:

   ```bash
   git clone https://github.com/your-username/ai-doctor-2.0-voice-and-vision.git
   cd ai-doctor-2.0-voice-and-vision
````

2. Create and activate a virtual environment:

   ```bash
   python -m venv venv
   source venv/bin/activate      # Linux/Mac
   venv\Scripts\activate         # Windows
   ```

3. Install dependencies:

   ```bash
   pip install -r requirements.txt
   ```

4. Install FFmpeg (if not already installed):

   * Windows: [Download builds](https://www.gyan.dev/ffmpeg/builds/) and add `bin/` to PATH
   * Linux (Debian/Ubuntu): `sudo apt install ffmpeg`
   * macOS (Homebrew): `brew install ffmpeg`

5. Create a `.env` file in the project root with your API keys:

   ```
   GROQ_API_KEY=your_groq_api_key_here
   ELEVEN_API_KEY=your_elevenlabs_api_key_here
   ```

---

## Running the Application

Start the Gradio app:

```bash
python app.py
```

The app will launch locally at:

```
http://127.0.0.1:7860
```

---

## Usage

1. Allow microphone access to record your voice.
2. Upload a medical image for analysis.
3. The system will:

   * Transcribe your voice (Whisper Large v3 via Groq)
   * Analyze the image + text (Llama 4 Scout via Groq)
   * Generate a concise medical-style response
   * Play back the response as voice (ElevenLabs or gTTS fallback)

---

## Models Used

1. **Whisper Large v3** (Groq) – Speech-to-Text

   * [Groq API Docs](https://console.groq.com/docs)

2. **Llama 4 Scout 17B (Mixture-of-Experts)** (Groq) – Vision-Language reasoning

   * [Groq API Docs](https://console.groq.com/docs)

3. **ElevenLabs `eleven_turbo_v2`** – Text-to-Speech (WAV, with MP3 fallback)

   * [ElevenLabs Docs](https://elevenlabs.io/docs)

4. **gTTS (Google Text-to-Speech)** – Backup Text-to-Speech

   * [PyPI gTTS](https://pypi.org/project/gTTS/)

---

## Notes

* ElevenLabs free-tier accounts may not allow WAV output or certain custom voices. In that case, the code automatically falls back to MP3 output with a safe built-in voice.
* Ensure FFmpeg is correctly installed; otherwise, audio export with pydub will fail.
* Gradio will automatically handle playback of both WAV and MP3 outputs.

---

## Support

For questions, issues, or collaboration, please contact:

**Email:** [sayeem26s@gmail.com](mailto:sayeem26s@gmail.com)
**LinkedIn:** [https://www.linkedin.com/in/s-m-shahriar-26s/](https://www.linkedin.com/in/s-m-shahriar-26s/)

```

---