Spaces:

Deva8
/

vqa-backend

Sleeping

File size: 9,245 Bytes

---
title: VQA Backend
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---


<div align="center">

# GenVQA — Generative Visual Question Answering

**A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.**

[![Backend CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml)
[![UI CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml)
![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python)
![License](https://img.shields.io/badge/License-MIT-green)

</div>

---

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                   CLIENT LAYER                              │
│   📱 Expo Mobile App (React Native)                         │
│   • Image upload + question input                           │
│   • Displays answer + accessibility description             │
└────────────────────────┬────────────────────────────────────┘
                         │ HTTP POST /api/answer
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                   BACKEND LAYER  (FastAPI)                  │
│   backend_api.py                                            │
│   • Request handling, session management                    │
│   • Conversation Manager → multi-turn context tracking      │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│            ROUTING LAYER  (ensemble_vqa_app.py)             │
│                                                             │
│   CLIP encodes question → compares against:                 │
│   "reasoning question" vs "visual/perceptual question"      │
│                                                             │
│         Reasoning?                 Visual?                  │
│             │                          │                    │
│             ▼                          ▼                    │
│   ┌─────────────────┐      ┌─────────────────────┐         │
│   │ NEURO-SYMBOLIC  │      │   NEURAL VQA PATH   │         │
│   │                 │      │                     │         │
│   │ 1. VQA model    │      │  VQA model (GRU +   │         │
│   │    detects obj  │      │  Attention) predicts │         │
│   │                 │      │  answer directly     │         │
│   │ 2. Wikidata API │      └──────────┬──────────┘         │
│   │    fetches facts│                 │                    │
│   │    (P31, P2101, │                 │                    │
│   │     P2054, P186,│                 │                    │
│   │     P366 ...)   │                 │                    │
│   │                 │                 │                    │
│   │ 3. Groq LLM     │                 │                    │
│   │    verbalizes   │                 │                    │
│   │    from facts   │                 │                    │
│   └─────────┬───────┘                 │                    │
│             └──────────────┬──────────┘                    │
└──────────────────────────  │  ─────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │   GROQ SERVICE  │
                    │  Accessibility  │
                    │  description    │
                    │  (2 sentences,  │
                    │  screen-reader  │
                    │  friendly)      │
                    └────────┬────────┘
                             │
                             ▼
                      JSON response
                    { answer, model_used,
                      kg_enhancement,
                      wikidata_entity,
                      description }
```

| Layer | Component | Role |
|---|---|---|
| **Client** | Expo React Native | Image upload, question input, answer display |
| **API** | FastAPI (`backend_api.py`) | Routing, sessions, conversation state |
| **Conversation** | `conversation_manager.py` | Multi-turn context, history tracking |
| **Router** | CLIP (in `ensemble_vqa_app.py`) | Classifies question as reasoning vs visual |
| **Neural VQA** | GRU + Attention (`model.py`) | Answers visual questions directly from image |
| **Neuro-Symbolic** | `semantic_neurosymbolic_vqa.py` | VQA detects objects → Wikidata fetches facts → Groq verbalizes |
| **Accessibility** | `groq_service.py` | Generates spoken-friendly 2-sentence description for every answer |

---

## Features

- 🔍 **Visual Question Answering** — trained on VQAv2, fine-tuned on custom data
- 🧠 **Neuro-Symbolic Routing** — CLIP semantically classifies questions as _reasoning_ vs _visual_, routes accordingly
- 🌐 **Live Wikidata Facts** — queries physical properties, categories, materials, uses in real time
- 🤖 **Groq Verbalization** — Llama 3.3 70B answers from structured facts, not hallucination
- 💬 **Conversational Support** — multi-turn conversation manager with context tracking
- 📱 **Expo Mobile UI** — React Native app for iOS/Android/Web
- ♿ **Accessibility** — Groq generates spoken-friendly descriptions for every answer

---

## Quick Start

### 1 — Backend

```bash
# Clone and install
git clone https://github.com/DevaRajan8/Generative-vqa.git
cd Generative-vqa
pip install -r requirements_api.txt

# Set your Groq API key
cp .env.example .env
# Edit .env → GROQ_API_KEY=your_key_here

# Start API
python backend_api.py
# → http://localhost:8000
```

### 2 — Mobile UI

```bash
cd ui
npm install
npx expo start --clear
```

> Scan the QR code with Expo Go, or press `w` for browser.

---

## API

| Endpoint | Method | Description |
|---|---|---|
| `/api/answer` | POST | Answer a question about an uploaded image |
| `/api/health` | GET | Health check |
| `/api/conversation/new` | POST | Start a new conversation session |

**Example:**

```bash
curl -X POST http://localhost:8000/api/answer \
  -F "image=@photo.jpg" \
  -F "question=Can this melt?"
```

**Response:**

```json
{
  "answer": "ice",
  "model_used": "neuro-symbolic",
  "kg_enhancement": "Yes — ice can melt. [Wikidata P2101: melting point = 0.0 °C]",
  "knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
  "wikidata_entity": "Q86"
}
```

---

## Project Structure

```
├── backend_api.py                  # FastAPI server
├── ensemble_vqa_app.py             # VQA orchestrator (routing + inference)
├── semantic_neurosymbolic_vqa.py   # Wikidata KB + Groq verbalizer
├── groq_service.py                 # Groq accessibility descriptions
├── conversation_manager.py         # Multi-turn conversation tracking
├── model.py                        # VQA model definition
├── train.py                        # Training pipeline
├── ui/                             # Expo React Native app
│   └── src/screens/HomeScreen.js
└── .github/
    ├── workflows/                  # CI — backend lint + UI build
    └── ISSUE_TEMPLATE/
```

---

## Environment Variables

| Variable | Required | Description |
|---|---|---|
| `GROQ_API_KEY` | ✅ | Groq API key — [get one free](https://console.groq.com) |
| `MODEL_PATH` | optional | Path to VQA checkpoint (default: `vqa_checkpoint.pt`) |
| `PORT` | optional | API server port (default: `8000`) |

---

## Requirements

- Python 3.10+
- CUDA GPU recommended (CPU works but is slow)
- Node.js 20+ (for UI)
- Groq API key (free tier available)

---

## License

MIT © [DevaRajan8](https://github.com/DevaRajan8)