--- title: VQA Backend emoji: 🚀 colorFrom: blue colorTo: purple sdk: docker pinned: false ---
# GenVQA — Generative Visual Question Answering **A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.** [![Backend CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml) [![UI CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml) ![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python) ![License](https://img.shields.io/badge/License-MIT-green)
--- ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ CLIENT LAYER │ │ 📱 Expo Mobile App (React Native) │ │ • Image upload + question input │ │ • Displays answer + accessibility description │ └────────────────────────┬────────────────────────────────────┘ │ HTTP POST /api/answer ▼ ┌─────────────────────────────────────────────────────────────┐ │ BACKEND LAYER (FastAPI) │ │ backend_api.py │ │ • Request handling, session management │ │ • Conversation Manager → multi-turn context tracking │ └────────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ ROUTING LAYER (ensemble_vqa_app.py) │ │ │ │ CLIP encodes question → compares against: │ │ "reasoning question" vs "visual/perceptual question" │ │ │ │ Reasoning? Visual? │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────────┐ ┌─────────────────────┐ │ │ │ NEURO-SYMBOLIC │ │ NEURAL VQA PATH │ │ │ │ │ │ │ │ │ │ 1. VQA model │ │ VQA model (GRU + │ │ │ │ detects obj │ │ Attention) predicts │ │ │ │ │ │ answer directly │ │ │ │ 2. Wikidata API │ └──────────┬──────────┘ │ │ │ fetches facts│ │ │ │ │ (P31, P2101, │ │ │ │ │ P2054, P186,│ │ │ │ │ P366 ...) │ │ │ │ │ │ │ │ │ │ 3. Groq LLM │ │ │ │ │ verbalizes │ │ │ │ │ from facts │ │ │ │ └─────────┬───────┘ │ │ │ └──────────────┬──────────┘ │ └────────────────────────── │ ─────────────────────────────┘ │ ▼ ┌─────────────────┐ │ GROQ SERVICE │ │ Accessibility │ │ description │ │ (2 sentences, │ │ screen-reader │ │ friendly) │ └────────┬────────┘ │ ▼ JSON response { answer, model_used, kg_enhancement, wikidata_entity, description } ``` | Layer | Component | Role | |---|---|---| | **Client** | Expo React Native | Image upload, question input, answer display | | **API** | FastAPI (`backend_api.py`) | Routing, sessions, conversation state | | **Conversation** | `conversation_manager.py` | Multi-turn context, history tracking | | **Router** | CLIP (in `ensemble_vqa_app.py`) | Classifies question as reasoning vs visual | | **Neural VQA** | GRU + Attention (`model.py`) | Answers visual questions directly from image | | **Neuro-Symbolic** | `semantic_neurosymbolic_vqa.py` | VQA detects objects → Wikidata fetches facts → Groq verbalizes | | **Accessibility** | `groq_service.py` | Generates spoken-friendly 2-sentence description for every answer | --- ## Features - 🔍 **Visual Question Answering** — trained on VQAv2, fine-tuned on custom data - 🧠 **Neuro-Symbolic Routing** — CLIP semantically classifies questions as _reasoning_ vs _visual_, routes accordingly - 🌐 **Live Wikidata Facts** — queries physical properties, categories, materials, uses in real time - 🤖 **Groq Verbalization** — Llama 3.3 70B answers from structured facts, not hallucination - 💬 **Conversational Support** — multi-turn conversation manager with context tracking - 📱 **Expo Mobile UI** — React Native app for iOS/Android/Web - ♿ **Accessibility** — Groq generates spoken-friendly descriptions for every answer --- ## Quick Start ### 1 — Backend ```bash # Clone and install git clone https://github.com/DevaRajan8/Generative-vqa.git cd Generative-vqa pip install -r requirements_api.txt # Set your Groq API key cp .env.example .env # Edit .env → GROQ_API_KEY=your_key_here # Start API python backend_api.py # → http://localhost:8000 ``` ### 2 — Mobile UI ```bash cd ui npm install npx expo start --clear ``` > Scan the QR code with Expo Go, or press `w` for browser. --- ## API | Endpoint | Method | Description | |---|---|---| | `/api/answer` | POST | Answer a question about an uploaded image | | `/api/health` | GET | Health check | | `/api/conversation/new` | POST | Start a new conversation session | **Example:** ```bash curl -X POST http://localhost:8000/api/answer \ -F "image=@photo.jpg" \ -F "question=Can this melt?" ``` **Response:** ```json { "answer": "ice", "model_used": "neuro-symbolic", "kg_enhancement": "Yes — ice can melt. [Wikidata P2101: melting point = 0.0 °C]", "knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)", "wikidata_entity": "Q86" } ``` --- ## Project Structure ``` ├── backend_api.py # FastAPI server ├── ensemble_vqa_app.py # VQA orchestrator (routing + inference) ├── semantic_neurosymbolic_vqa.py # Wikidata KB + Groq verbalizer ├── groq_service.py # Groq accessibility descriptions ├── conversation_manager.py # Multi-turn conversation tracking ├── model.py # VQA model definition ├── train.py # Training pipeline ├── ui/ # Expo React Native app │ └── src/screens/HomeScreen.js └── .github/ ├── workflows/ # CI — backend lint + UI build └── ISSUE_TEMPLATE/ ``` --- ## Environment Variables | Variable | Required | Description | |---|---|---| | `GROQ_API_KEY` | ✅ | Groq API key — [get one free](https://console.groq.com) | | `MODEL_PATH` | optional | Path to VQA checkpoint (default: `vqa_checkpoint.pt`) | | `PORT` | optional | API server port (default: `8000`) | --- ## Requirements - Python 3.10+ - CUDA GPU recommended (CPU works but is slow) - Node.js 20+ (for UI) - Groq API key (free tier available) --- ## License MIT © [DevaRajan8](https://github.com/DevaRajan8)