🧠 Multi-Modal AI Chatbot

A multi-modal conversational assistant that can:

💬 Chat naturally using an LLM.

🖼️ Analyze uploaded images (with summarization via Groq API).

🎵 Analyze audio files and return transcriptions/predictions.

🔍 Perform local semantic image search using pre-computed embeddings.

🎯 Handle intent classification via rule-based matching with intents.json.

Built with Gradio, Sentence Transformers, and Groq API, this project combines text, image, and audio workflows into a unified chat interface.

📂 Project Structure
.
├── .env                # Environment variables (must contain GROQ_API_KEY)
├── .gitattributes
├── .gitignore
├── app.py              # Main Gradio application
├── image.json          # Metadata for local image descriptions
├── intents.json        # Rule-based intent classifier definitions
├── requirements.txt    # Python dependencies
├── README.md           # Project documentation (this file)
└── images/             # (Optional) Local image directory

⚙️ Setup
1. Clone the Repository
git clone https://github.com/<your-username>/<your-repo>.git
cd <your-repo>

2. Create Virtual Environment
python -m venv .venv
source .venv/bin/activate   # Linux/Mac
.venv\Scripts\activate      # Windows

3. Install Dependencies
pip install -r requirements.txt

4. Environment Variables

Create a .env file in the root directory:

GROQ_API_KEY=your_api_key_here


You can obtain an API key from Groq Cloud
.

🚀 Run the App
python app.py


By default, Gradio will launch at:

http://127.0.0.1:7860

🛠 Features

Chat Mode

Uses a hosted LLM via gradio_client.

Fallback to rule-based intent classifier for special queries.

Image Analysis

Upload .png, .jpg, .jpeg, .webp images.

Analyzed by vision model → summarized using Groq API.

Audio Analysis

Upload .wav, .mp3, .flac audio.

Returns friendly analysis result.

Local Image Search

Loads image metadata from image.json.

Embeddings computed with all-MiniLM-L6-v2.

Finds best semantic match from images/ folder.

Intent Classification

Rule-based, defined in intents.json.

Supports custom triggers like "search_local_image", "request_audio_analysis", etc.

📌 Example Usage

General Chat:
User: “Tell me something interesting.”
Bot: Generates response via chatbot client.

Search Local Image:
User: “Find me the blueprint diagram.”
Bot: Returns matching local image + description.

Image Analysis:
User uploads engine_part.jpg → Bot analyzes and summarizes.

Audio Analysis:
User uploads sample.wav → Bot outputs recognized text/prediction.

📦 Requirements

Key dependencies:

gradio

gradio_client

sentence-transformers

numpy

requests

python-dotenv

Full list in requirements.txt
.

🔮 Future Improvements

Add vector database for scalable image/document retrieval.

Enhance intent detection with hybrid (rules + semantic).

Extend multimodal support (video, PDFs).

Dockerize deployment for cloud environments.

👨‍💻 Author

Built with ❤️ by Anvit