Spaces:
Sleeping
Sleeping
File size: 9,245 Bytes
4de914d bb8f662 016e102 bb8f662 016e102 bb8f662 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | ---
title: VQA Backend
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---
<div align="center">
# GenVQA β Generative Visual Question Answering
**A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.**
[](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml)
[](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml)


</div>
---
## Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β π± Expo Mobile App (React Native) β
β β’ Image upload + question input β
β β’ Displays answer + accessibility description β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β HTTP POST /api/answer
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKEND LAYER (FastAPI) β
β backend_api.py β
β β’ Request handling, session management β
β β’ Conversation Manager β multi-turn context tracking β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ROUTING LAYER (ensemble_vqa_app.py) β
β β
β CLIP encodes question β compares against: β
β "reasoning question" vs "visual/perceptual question" β
β β
β Reasoning? Visual? β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββ β
β β NEURO-SYMBOLIC β β NEURAL VQA PATH β β
β β β β β β
β β 1. VQA model β β VQA model (GRU + β β
β β detects obj β β Attention) predicts β β
β β β β answer directly β β
β β 2. Wikidata API β ββββββββββββ¬βββββββββββ β
β β fetches factsβ β β
β β (P31, P2101, β β β
β β P2054, P186,β β β
β β P366 ...) β β β
β β β β β
β β 3. Groq LLM β β β
β β verbalizes β β β
β β from facts β β β
β βββββββββββ¬ββββββββ β β
β ββββββββββββββββ¬βββββββββββ β
βββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β GROQ SERVICE β
β Accessibility β
β description β
β (2 sentences, β
β screen-reader β
β friendly) β
ββββββββββ¬βββββββββ
β
βΌ
JSON response
{ answer, model_used,
kg_enhancement,
wikidata_entity,
description }
```
| Layer | Component | Role |
|---|---|---|
| **Client** | Expo React Native | Image upload, question input, answer display |
| **API** | FastAPI (`backend_api.py`) | Routing, sessions, conversation state |
| **Conversation** | `conversation_manager.py` | Multi-turn context, history tracking |
| **Router** | CLIP (in `ensemble_vqa_app.py`) | Classifies question as reasoning vs visual |
| **Neural VQA** | GRU + Attention (`model.py`) | Answers visual questions directly from image |
| **Neuro-Symbolic** | `semantic_neurosymbolic_vqa.py` | VQA detects objects β Wikidata fetches facts β Groq verbalizes |
| **Accessibility** | `groq_service.py` | Generates spoken-friendly 2-sentence description for every answer |
---
## Features
- π **Visual Question Answering** β trained on VQAv2, fine-tuned on custom data
- π§ **Neuro-Symbolic Routing** β CLIP semantically classifies questions as _reasoning_ vs _visual_, routes accordingly
- π **Live Wikidata Facts** β queries physical properties, categories, materials, uses in real time
- π€ **Groq Verbalization** β Llama 3.3 70B answers from structured facts, not hallucination
- π¬ **Conversational Support** β multi-turn conversation manager with context tracking
- π± **Expo Mobile UI** β React Native app for iOS/Android/Web
- βΏ **Accessibility** β Groq generates spoken-friendly descriptions for every answer
---
## Quick Start
### 1 β Backend
```bash
# Clone and install
git clone https://github.com/DevaRajan8/Generative-vqa.git
cd Generative-vqa
pip install -r requirements_api.txt
# Set your Groq API key
cp .env.example .env
# Edit .env β GROQ_API_KEY=your_key_here
# Start API
python backend_api.py
# β http://localhost:8000
```
### 2 β Mobile UI
```bash
cd ui
npm install
npx expo start --clear
```
> Scan the QR code with Expo Go, or press `w` for browser.
---
## API
| Endpoint | Method | Description |
|---|---|---|
| `/api/answer` | POST | Answer a question about an uploaded image |
| `/api/health` | GET | Health check |
| `/api/conversation/new` | POST | Start a new conversation session |
**Example:**
```bash
curl -X POST http://localhost:8000/api/answer \
-F "image=@photo.jpg" \
-F "question=Can this melt?"
```
**Response:**
```json
{
"answer": "ice",
"model_used": "neuro-symbolic",
"kg_enhancement": "Yes β ice can melt. [Wikidata P2101: melting point = 0.0 Β°C]",
"knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
"wikidata_entity": "Q86"
}
```
---
## Project Structure
```
βββ backend_api.py # FastAPI server
βββ ensemble_vqa_app.py # VQA orchestrator (routing + inference)
βββ semantic_neurosymbolic_vqa.py # Wikidata KB + Groq verbalizer
βββ groq_service.py # Groq accessibility descriptions
βββ conversation_manager.py # Multi-turn conversation tracking
βββ model.py # VQA model definition
βββ train.py # Training pipeline
βββ ui/ # Expo React Native app
β βββ src/screens/HomeScreen.js
βββ .github/
βββ workflows/ # CI β backend lint + UI build
βββ ISSUE_TEMPLATE/
```
---
## Environment Variables
| Variable | Required | Description |
|---|---|---|
| `GROQ_API_KEY` | β
| Groq API key β [get one free](https://console.groq.com) |
| `MODEL_PATH` | optional | Path to VQA checkpoint (default: `vqa_checkpoint.pt`) |
| `PORT` | optional | API server port (default: `8000`) |
---
## Requirements
- Python 3.10+
- CUDA GPU recommended (CPU works but is slow)
- Node.js 20+ (for UI)
- Groq API key (free tier available)
---
## License
MIT Β© [DevaRajan8](https://github.com/DevaRajan8)
|