Spaces:

Sayeem26s
/

Multimodal-AI-Doctor

Sleeping

App Files Files Community

Multimodal-AI-Doctor / README.md

Sayeem26s

Update README.md

25377ed verified 5 months ago

preview code

raw

history blame contribute delete

5.23 kB

	---
	title: Multimodal AI Doctor – An Agentic AI Project
	emoji: 🩺
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 5.46.1
	app_file: app.py
	pinned: false
	---

	# Multimodal AI Doctor – An Agentic AI Project

	Multimodal AI Doctor is an agentic multimodal assistant built with Gradio, Groq APIs, and ElevenLabs.
	It combines speech, vision, and reasoning through a series of cooperating LLMs, simulating how a real doctor listens, observes, and responds concisely.
	The system integrates voice input, image analysis, clinical reasoning, and voice output into a single pipeline.

	---

	## Features

	* Record patient voice from microphone (Speech-to-Text using Whisper Large v3 on Groq)
	* Upload an image (diagnosis/medical-related) for analysis (Vision-Language reasoning using Llama 4 Scout on Groq)
	* Generate a concise medical-style response (2 sentences maximum, human-like tone)
	* Convert response to voice (Text-to-Speech using ElevenLabs with WAV output, fallback to gTTS if needed)
	* Gradio-based interactive UI

	---

	## Project Structure

	```

	.
	├── app.py # Gradio UI + main workflow
	├── brain\_of\_the\_doctor.py # Image encoding + Groq multimodal analysis
	├── voice\_of\_the\_patient.py # Audio recording + Groq Whisper transcription
	├── voice\_of\_the\_doctor.py # ElevenLabs + gTTS text-to-speech
	├── requirements.txt # Python dependencies
	├── .env # Environment variables (API keys)
	├── .gitignore # Ignore venv, pycache, .env, etc.
	├── images/ # Folder for saving test/sample images
	└── README.md # Documentation

	````

	---

	## Agentic AI Workflow

	The system uses multiple LLM agents to process multimodal input step by step:

	1. Symptom Agent – extracts structured meaning from patient speech (via Whisper transcription).
	2. Vision Agent – analyzes uploaded medical images (X-ray, MRI, scan).
	3. Reasoning Agent – integrates speech and image findings into a medical interpretation.
	4. Response Agent – formats the answer in a concise, empathetic, doctor-style tone (≤ 2 sentences).
	5. Voice Agent – delivers the response using ElevenLabs (WAV, fallback gTTS).

	This makes the project an agentic AI pipeline, where multiple specialized models cooperate to mimic a doctor’s diagnostic process.

	---

	## Requirements

	* Python 3.10 or higher
	* FFmpeg installed and available in PATH (required by pydub)
	* A Groq API key (obtain from [https://console.groq.com](https://console.groq.com))
	* An ElevenLabs API key (obtain from [https://elevenlabs.io](https://elevenlabs.io))

	---

	## Installation

	1. Clone the repository:

	```bash
	git clone https://github.com/your-username/ai-doctor-2.0-voice-and-vision.git
	cd ai-doctor-2.0-voice-and-vision
	````

	2. Create and activate a virtual environment:

	```bash
	python -m venv venv
	source venv/bin/activate # Linux/Mac
	venv\Scripts\activate # Windows
	```

	3. Install dependencies:

	```bash
	pip install -r requirements.txt
	```

	4. Install FFmpeg (if not already installed):

	* Windows: [Download builds](https://www.gyan.dev/ffmpeg/builds/) and add `bin/` to PATH
	* Linux (Debian/Ubuntu): `sudo apt install ffmpeg`
	* macOS (Homebrew): `brew install ffmpeg`

	5. Create a `.env` file in the project root with your API keys:

	```
	GROQ_API_KEY=your_groq_api_key_here
	ELEVEN_API_KEY=your_elevenlabs_api_key_here
	```

	---

	## Running the Application

	Start the Gradio app:

	```bash
	python app.py
	```

	The app will launch locally at:

	```
	http://127.0.0.1:7860
	```

	---

	## Usage

	1. Allow microphone access to record your voice.
	2. Upload a medical image for analysis.
	3. The system will:

	* Transcribe your voice (Whisper Large v3 via Groq)
	* Analyze the image + text (Llama 4 Scout via Groq)
	* Generate a concise medical-style response
	* Play back the response as voice (ElevenLabs or gTTS fallback)

	---

	## Models Used

	1. Whisper Large v3 (Groq) – Speech-to-Text

	* [Groq API Docs](https://console.groq.com/docs)

	2. Llama 4 Scout 17B (Mixture-of-Experts) (Groq) – Vision-Language reasoning

	* [Groq API Docs](https://console.groq.com/docs)

	3. ElevenLabs `eleven_turbo_v2` – Text-to-Speech (WAV, with MP3 fallback)

	* [ElevenLabs Docs](https://elevenlabs.io/docs)

	4. gTTS (Google Text-to-Speech) – Backup Text-to-Speech

	* [PyPI gTTS](https://pypi.org/project/gTTS/)

	---

	## Notes

	* ElevenLabs free-tier accounts may not allow WAV output or certain custom voices. In that case, the code automatically falls back to MP3 output with a safe built-in voice.
	* Ensure FFmpeg is correctly installed; otherwise, audio export with pydub will fail.
	* Gradio will automatically handle playback of both WAV and MP3 outputs.

	---

	## Support

	For questions, issues, or collaboration, please contact:

	Email: [sayeem26s@gmail.com](mailto:sayeem26s@gmail.com)
	LinkedIn: [https://www.linkedin.com/in/s-m-shahriar-26s/](https://www.linkedin.com/in/s-m-shahriar-26s/)

	```

	---