vqa-backend / README.md
Deva8's picture
Add HuggingFace configuration block to README
4de914d
---
title: VQA Backend
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---
<div align="center">
# GenVQA β€” Generative Visual Question Answering
**A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.**
[![Backend CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml)
[![UI CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml)
![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python)
![License](https://img.shields.io/badge/License-MIT-green)
</div>
---
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CLIENT LAYER β”‚
β”‚ πŸ“± Expo Mobile App (React Native) β”‚
β”‚ β€’ Image upload + question input β”‚
β”‚ β€’ Displays answer + accessibility description β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ HTTP POST /api/answer
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BACKEND LAYER (FastAPI) β”‚
β”‚ backend_api.py β”‚
β”‚ β€’ Request handling, session management β”‚
β”‚ β€’ Conversation Manager β†’ multi-turn context tracking β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ROUTING LAYER (ensemble_vqa_app.py) β”‚
β”‚ β”‚
β”‚ CLIP encodes question β†’ compares against: β”‚
β”‚ "reasoning question" vs "visual/perceptual question" β”‚
β”‚ β”‚
β”‚ Reasoning? Visual? β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β–Ό β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ NEURO-SYMBOLIC β”‚ β”‚ NEURAL VQA PATH β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ 1. VQA model β”‚ β”‚ VQA model (GRU + β”‚ β”‚
β”‚ β”‚ detects obj β”‚ β”‚ Attention) predicts β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ answer directly β”‚ β”‚
β”‚ β”‚ 2. Wikidata API β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ fetches factsβ”‚ β”‚ β”‚
β”‚ β”‚ (P31, P2101, β”‚ β”‚ β”‚
β”‚ β”‚ P2054, P186,β”‚ β”‚ β”‚
β”‚ β”‚ P366 ...) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ 3. Groq LLM β”‚ β”‚ β”‚
β”‚ β”‚ verbalizes β”‚ β”‚ β”‚
β”‚ β”‚ from facts β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
└────────────────────────── β”‚ β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GROQ SERVICE β”‚
β”‚ Accessibility β”‚
β”‚ description β”‚
β”‚ (2 sentences, β”‚
β”‚ screen-reader β”‚
β”‚ friendly) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
JSON response
{ answer, model_used,
kg_enhancement,
wikidata_entity,
description }
```
| Layer | Component | Role |
|---|---|---|
| **Client** | Expo React Native | Image upload, question input, answer display |
| **API** | FastAPI (`backend_api.py`) | Routing, sessions, conversation state |
| **Conversation** | `conversation_manager.py` | Multi-turn context, history tracking |
| **Router** | CLIP (in `ensemble_vqa_app.py`) | Classifies question as reasoning vs visual |
| **Neural VQA** | GRU + Attention (`model.py`) | Answers visual questions directly from image |
| **Neuro-Symbolic** | `semantic_neurosymbolic_vqa.py` | VQA detects objects β†’ Wikidata fetches facts β†’ Groq verbalizes |
| **Accessibility** | `groq_service.py` | Generates spoken-friendly 2-sentence description for every answer |
---
## Features
- πŸ” **Visual Question Answering** β€” trained on VQAv2, fine-tuned on custom data
- 🧠 **Neuro-Symbolic Routing** β€” CLIP semantically classifies questions as _reasoning_ vs _visual_, routes accordingly
- 🌐 **Live Wikidata Facts** β€” queries physical properties, categories, materials, uses in real time
- πŸ€– **Groq Verbalization** β€” Llama 3.3 70B answers from structured facts, not hallucination
- πŸ’¬ **Conversational Support** β€” multi-turn conversation manager with context tracking
- πŸ“± **Expo Mobile UI** β€” React Native app for iOS/Android/Web
- β™Ώ **Accessibility** β€” Groq generates spoken-friendly descriptions for every answer
---
## Quick Start
### 1 β€” Backend
```bash
# Clone and install
git clone https://github.com/DevaRajan8/Generative-vqa.git
cd Generative-vqa
pip install -r requirements_api.txt
# Set your Groq API key
cp .env.example .env
# Edit .env β†’ GROQ_API_KEY=your_key_here
# Start API
python backend_api.py
# β†’ http://localhost:8000
```
### 2 β€” Mobile UI
```bash
cd ui
npm install
npx expo start --clear
```
> Scan the QR code with Expo Go, or press `w` for browser.
---
## API
| Endpoint | Method | Description |
|---|---|---|
| `/api/answer` | POST | Answer a question about an uploaded image |
| `/api/health` | GET | Health check |
| `/api/conversation/new` | POST | Start a new conversation session |
**Example:**
```bash
curl -X POST http://localhost:8000/api/answer \
-F "image=@photo.jpg" \
-F "question=Can this melt?"
```
**Response:**
```json
{
"answer": "ice",
"model_used": "neuro-symbolic",
"kg_enhancement": "Yes β€” ice can melt. [Wikidata P2101: melting point = 0.0 Β°C]",
"knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
"wikidata_entity": "Q86"
}
```
---
## Project Structure
```
β”œβ”€β”€ backend_api.py # FastAPI server
β”œβ”€β”€ ensemble_vqa_app.py # VQA orchestrator (routing + inference)
β”œβ”€β”€ semantic_neurosymbolic_vqa.py # Wikidata KB + Groq verbalizer
β”œβ”€β”€ groq_service.py # Groq accessibility descriptions
β”œβ”€β”€ conversation_manager.py # Multi-turn conversation tracking
β”œβ”€β”€ model.py # VQA model definition
β”œβ”€β”€ train.py # Training pipeline
β”œβ”€β”€ ui/ # Expo React Native app
β”‚ └── src/screens/HomeScreen.js
└── .github/
β”œβ”€β”€ workflows/ # CI β€” backend lint + UI build
└── ISSUE_TEMPLATE/
```
---
## Environment Variables
| Variable | Required | Description |
|---|---|---|
| `GROQ_API_KEY` | βœ… | Groq API key β€” [get one free](https://console.groq.com) |
| `MODEL_PATH` | optional | Path to VQA checkpoint (default: `vqa_checkpoint.pt`) |
| `PORT` | optional | API server port (default: `8000`) |
---
## Requirements
- Python 3.10+
- CUDA GPU recommended (CPU works but is slow)
- Node.js 20+ (for UI)
- Groq API key (free tier available)
---
## License
MIT Β© [DevaRajan8](https://github.com/DevaRajan8)