Sayeem26s's picture
Update README.md
25377ed verified

A newer version of the Gradio SDK is available: 6.6.0

Upgrade
metadata
title: Multimodal AI Doctor  An Agentic AI Project
emoji: 🩺
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.46.1
app_file: app.py
pinned: false

Multimodal AI Doctor – An Agentic AI Project

Multimodal AI Doctor is an agentic multimodal assistant built with Gradio, Groq APIs, and ElevenLabs.
It combines speech, vision, and reasoning through a series of cooperating LLMs, simulating how a real doctor listens, observes, and responds concisely.
The system integrates voice input, image analysis, clinical reasoning, and voice output into a single pipeline.


Features

  • Record patient voice from microphone (Speech-to-Text using Whisper Large v3 on Groq)
  • Upload an image (diagnosis/medical-related) for analysis (Vision-Language reasoning using Llama 4 Scout on Groq)
  • Generate a concise medical-style response (2 sentences maximum, human-like tone)
  • Convert response to voice (Text-to-Speech using ElevenLabs with WAV output, fallback to gTTS if needed)
  • Gradio-based interactive UI

Project Structure


.
├── app.py                    # Gradio UI + main workflow
├── brain\_of\_the\_doctor.py    # Image encoding + Groq multimodal analysis
├── voice\_of\_the\_patient.py   # Audio recording + Groq Whisper transcription
├── voice\_of\_the\_doctor.py    # ElevenLabs + gTTS text-to-speech
├── requirements.txt          # Python dependencies
├── .env                      # Environment variables (API keys)
├── .gitignore                # Ignore venv, **pycache**, .env, etc.
├── images/                   # Folder for saving test/sample images
└── README.md                 # Documentation

Agentic AI Workflow

The system uses multiple LLM agents to process multimodal input step by step:

  1. Symptom Agent – extracts structured meaning from patient speech (via Whisper transcription).
  2. Vision Agent – analyzes uploaded medical images (X-ray, MRI, scan).
  3. Reasoning Agent – integrates speech and image findings into a medical interpretation.
  4. Response Agent – formats the answer in a concise, empathetic, doctor-style tone (≤ 2 sentences).
  5. Voice Agent – delivers the response using ElevenLabs (WAV, fallback gTTS).

This makes the project an agentic AI pipeline, where multiple specialized models cooperate to mimic a doctor’s diagnostic process.


Requirements


Installation

  1. Clone the repository:

    git clone https://github.com/your-username/ai-doctor-2.0-voice-and-vision.git
    cd ai-doctor-2.0-voice-and-vision
    

2. Create and activate a virtual environment:

   ```bash
   python -m venv venv
   source venv/bin/activate      # Linux/Mac
   venv\Scripts\activate         # Windows
   ```

3. Install dependencies:

   ```bash
   pip install -r requirements.txt
   ```

4. Install FFmpeg (if not already installed):

   * Windows: [Download builds](https://www.gyan.dev/ffmpeg/builds/) and add `bin/` to PATH
   * Linux (Debian/Ubuntu): `sudo apt install ffmpeg`
   * macOS (Homebrew): `brew install ffmpeg`

5. Create a `.env` file in the project root with your API keys:

   ```
   GROQ_API_KEY=your_groq_api_key_here
   ELEVEN_API_KEY=your_elevenlabs_api_key_here
   ```

---

## Running the Application

Start the Gradio app:

```bash
python app.py
```

The app will launch locally at:

```
http://127.0.0.1:7860
```

---

## Usage

1. Allow microphone access to record your voice.
2. Upload a medical image for analysis.
3. The system will:

   * Transcribe your voice (Whisper Large v3 via Groq)
   * Analyze the image + text (Llama 4 Scout via Groq)
   * Generate a concise medical-style response
   * Play back the response as voice (ElevenLabs or gTTS fallback)

---

## Models Used

1. **Whisper Large v3** (Groq) – Speech-to-Text

   * [Groq API Docs](https://console.groq.com/docs)

2. **Llama 4 Scout 17B (Mixture-of-Experts)** (Groq) – Vision-Language reasoning

   * [Groq API Docs](https://console.groq.com/docs)

3. **ElevenLabs `eleven_turbo_v2`** – Text-to-Speech (WAV, with MP3 fallback)

   * [ElevenLabs Docs](https://elevenlabs.io/docs)

4. **gTTS (Google Text-to-Speech)** – Backup Text-to-Speech

   * [PyPI gTTS](https://pypi.org/project/gTTS/)

---

## Notes

* ElevenLabs free-tier accounts may not allow WAV output or certain custom voices. In that case, the code automatically falls back to MP3 output with a safe built-in voice.
* Ensure FFmpeg is correctly installed; otherwise, audio export with pydub will fail.
* Gradio will automatically handle playback of both WAV and MP3 outputs.

---

## Support

For questions, issues, or collaboration, please contact:

**Email:** [sayeem26s@gmail.com](mailto:sayeem26s@gmail.com)
**LinkedIn:** [https://www.linkedin.com/in/s-m-shahriar-26s/](https://www.linkedin.com/in/s-m-shahriar-26s/)

```

---