File size: 9,245 Bytes
4de914d
 
 
 
 
 
 
 
 
 
bb8f662
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
016e102
bb8f662
 
 
 
 
 
 
 
 
016e102
 
bb8f662
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
---
title: VQA Backend
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---


<div align="center">

# GenVQA β€” Generative Visual Question Answering

**A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.**

[![Backend CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml)
[![UI CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml)
![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python)
![License](https://img.shields.io/badge/License-MIT-green)

</div>

---

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   CLIENT LAYER                              β”‚
β”‚   πŸ“± Expo Mobile App (React Native)                         β”‚
β”‚   β€’ Image upload + question input                           β”‚
β”‚   β€’ Displays answer + accessibility description             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ HTTP POST /api/answer
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   BACKEND LAYER  (FastAPI)                  β”‚
β”‚   backend_api.py                                            β”‚
β”‚   β€’ Request handling, session management                    β”‚
β”‚   β€’ Conversation Manager β†’ multi-turn context tracking      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            ROUTING LAYER  (ensemble_vqa_app.py)             β”‚
β”‚                                                             β”‚
β”‚   CLIP encodes question β†’ compares against:                 β”‚
β”‚   "reasoning question" vs "visual/perceptual question"      β”‚
β”‚                                                             β”‚
β”‚         Reasoning?                 Visual?                  β”‚
β”‚             β”‚                          β”‚                    β”‚
β”‚             β–Ό                          β–Ό                    β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚   β”‚ NEURO-SYMBOLIC  β”‚      β”‚   NEURAL VQA PATH   β”‚         β”‚
β”‚   β”‚                 β”‚      β”‚                     β”‚         β”‚
β”‚   β”‚ 1. VQA model    β”‚      β”‚  VQA model (GRU +   β”‚         β”‚
β”‚   β”‚    detects obj  β”‚      β”‚  Attention) predicts β”‚         β”‚
β”‚   β”‚                 β”‚      β”‚  answer directly     β”‚         β”‚
β”‚   β”‚ 2. Wikidata API β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚   β”‚    fetches factsβ”‚                 β”‚                    β”‚
β”‚   β”‚    (P31, P2101, β”‚                 β”‚                    β”‚
β”‚   β”‚     P2054, P186,β”‚                 β”‚                    β”‚
β”‚   β”‚     P366 ...)   β”‚                 β”‚                    β”‚
β”‚   β”‚                 β”‚                 β”‚                    β”‚
β”‚   β”‚ 3. Groq LLM     β”‚                 β”‚                    β”‚
β”‚   β”‚    verbalizes   β”‚                 β”‚                    β”‚
β”‚   β”‚    from facts   β”‚                 β”‚                    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚                    β”‚
β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
└──────────────────────────  β”‚  β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   GROQ SERVICE  β”‚
                    β”‚  Accessibility  β”‚
                    β”‚  description    β”‚
                    β”‚  (2 sentences,  β”‚
                    β”‚  screen-reader  β”‚
                    β”‚  friendly)      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
                      JSON response
                    { answer, model_used,
                      kg_enhancement,
                      wikidata_entity,
                      description }
```

| Layer | Component | Role |
|---|---|---|
| **Client** | Expo React Native | Image upload, question input, answer display |
| **API** | FastAPI (`backend_api.py`) | Routing, sessions, conversation state |
| **Conversation** | `conversation_manager.py` | Multi-turn context, history tracking |
| **Router** | CLIP (in `ensemble_vqa_app.py`) | Classifies question as reasoning vs visual |
| **Neural VQA** | GRU + Attention (`model.py`) | Answers visual questions directly from image |
| **Neuro-Symbolic** | `semantic_neurosymbolic_vqa.py` | VQA detects objects β†’ Wikidata fetches facts β†’ Groq verbalizes |
| **Accessibility** | `groq_service.py` | Generates spoken-friendly 2-sentence description for every answer |

---

## Features

- πŸ” **Visual Question Answering** β€” trained on VQAv2, fine-tuned on custom data
- 🧠 **Neuro-Symbolic Routing** β€” CLIP semantically classifies questions as _reasoning_ vs _visual_, routes accordingly
- 🌐 **Live Wikidata Facts** β€” queries physical properties, categories, materials, uses in real time
- πŸ€– **Groq Verbalization** β€” Llama 3.3 70B answers from structured facts, not hallucination
- πŸ’¬ **Conversational Support** β€” multi-turn conversation manager with context tracking
- πŸ“± **Expo Mobile UI** β€” React Native app for iOS/Android/Web
- β™Ώ **Accessibility** β€” Groq generates spoken-friendly descriptions for every answer

---

## Quick Start

### 1 β€” Backend

```bash
# Clone and install
git clone https://github.com/DevaRajan8/Generative-vqa.git
cd Generative-vqa
pip install -r requirements_api.txt

# Set your Groq API key
cp .env.example .env
# Edit .env β†’ GROQ_API_KEY=your_key_here

# Start API
python backend_api.py
# β†’ http://localhost:8000
```

### 2 β€” Mobile UI

```bash
cd ui
npm install
npx expo start --clear
```

> Scan the QR code with Expo Go, or press `w` for browser.

---

## API

| Endpoint | Method | Description |
|---|---|---|
| `/api/answer` | POST | Answer a question about an uploaded image |
| `/api/health` | GET | Health check |
| `/api/conversation/new` | POST | Start a new conversation session |

**Example:**

```bash
curl -X POST http://localhost:8000/api/answer \
  -F "image=@photo.jpg" \
  -F "question=Can this melt?"
```

**Response:**

```json
{
  "answer": "ice",
  "model_used": "neuro-symbolic",
  "kg_enhancement": "Yes β€” ice can melt. [Wikidata P2101: melting point = 0.0 Β°C]",
  "knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
  "wikidata_entity": "Q86"
}
```

---

## Project Structure

```
β”œβ”€β”€ backend_api.py                  # FastAPI server
β”œβ”€β”€ ensemble_vqa_app.py             # VQA orchestrator (routing + inference)
β”œβ”€β”€ semantic_neurosymbolic_vqa.py   # Wikidata KB + Groq verbalizer
β”œβ”€β”€ groq_service.py                 # Groq accessibility descriptions
β”œβ”€β”€ conversation_manager.py         # Multi-turn conversation tracking
β”œβ”€β”€ model.py                        # VQA model definition
β”œβ”€β”€ train.py                        # Training pipeline
β”œβ”€β”€ ui/                             # Expo React Native app
β”‚   └── src/screens/HomeScreen.js
└── .github/
    β”œβ”€β”€ workflows/                  # CI β€” backend lint + UI build
    └── ISSUE_TEMPLATE/
```

---

## Environment Variables

| Variable | Required | Description |
|---|---|---|
| `GROQ_API_KEY` | βœ… | Groq API key β€” [get one free](https://console.groq.com) |
| `MODEL_PATH` | optional | Path to VQA checkpoint (default: `vqa_checkpoint.pt`) |
| `PORT` | optional | API server port (default: `8000`) |

---

## Requirements

- Python 3.10+
- CUDA GPU recommended (CPU works but is slow)
- Node.js 20+ (for UI)
- Groq API key (free tier available)

---

## License

MIT Β© [DevaRajan8](https://github.com/DevaRajan8)