Spaces:

Deva8
/

vqa-backend

Sleeping

App Files Files Community

vqa-backend / README.md

Deva8

Add HuggingFace configuration block to README

4de914d 5 days ago

preview code

raw

history blame contribute delete

9.25 kB

metadata

title: VQA Backend
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

GenVQA — Generative Visual Question Answering

A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                   CLIENT LAYER                              │
│   📱 Expo Mobile App (React Native)                         │
│   • Image upload + question input                           │
│   • Displays answer + accessibility description             │
└────────────────────────┬────────────────────────────────────┘
                         │ HTTP POST /api/answer
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                   BACKEND LAYER  (FastAPI)                  │
│   backend_api.py                                            │
│   • Request handling, session management                    │
│   • Conversation Manager → multi-turn context tracking      │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│            ROUTING LAYER  (ensemble_vqa_app.py)             │
│                                                             │
│   CLIP encodes question → compares against:                 │
│   "reasoning question" vs "visual/perceptual question"      │
│                                                             │
│         Reasoning?                 Visual?                  │
│             │                          │                    │
│             ▼                          ▼                    │
│   ┌─────────────────┐      ┌─────────────────────┐         │
│   │ NEURO-SYMBOLIC  │      │   NEURAL VQA PATH   │         │
│   │                 │      │                     │         │
│   │ 1. VQA model    │      │  VQA model (GRU +   │         │
│   │    detects obj  │      │  Attention) predicts │         │
│   │                 │      │  answer directly     │         │
│   │ 2. Wikidata API │      └──────────┬──────────┘         │
│   │    fetches facts│                 │                    │
│   │    (P31, P2101, │                 │                    │
│   │     P2054, P186,│                 │                    │
│   │     P366 ...)   │                 │                    │
│   │                 │                 │                    │
│   │ 3. Groq LLM     │                 │                    │
│   │    verbalizes   │                 │                    │
│   │    from facts   │                 │                    │
│   └─────────┬───────┘                 │                    │
│             └──────────────┬──────────┘                    │
└──────────────────────────  │  ─────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │   GROQ SERVICE  │
                    │  Accessibility  │
                    │  description    │
                    │  (2 sentences,  │
                    │  screen-reader  │
                    │  friendly)      │
                    └────────┬────────┘
                             │
                             ▼
                      JSON response
                    { answer, model_used,
                      kg_enhancement,
                      wikidata_entity,
                      description }

Layer	Component	Role
Client	Expo React Native	Image upload, question input, answer display
API	FastAPI (`backend_api.py`)	Routing, sessions, conversation state
Conversation	`conversation_manager.py`	Multi-turn context, history tracking
Router	CLIP (in `ensemble_vqa_app.py`)	Classifies question as reasoning vs visual
Neural VQA	GRU + Attention (`model.py`)	Answers visual questions directly from image
Neuro-Symbolic	`semantic_neurosymbolic_vqa.py`	VQA detects objects → Wikidata fetches facts → Groq verbalizes
Accessibility	`groq_service.py`	Generates spoken-friendly 2-sentence description for every answer

Features

🔍 Visual Question Answering — trained on VQAv2, fine-tuned on custom data
🧠 Neuro-Symbolic Routing — CLIP semantically classifies questions as reasoning vs visual, routes accordingly
🌐 Live Wikidata Facts — queries physical properties, categories, materials, uses in real time
🤖 Groq Verbalization — Llama 3.3 70B answers from structured facts, not hallucination
💬 Conversational Support — multi-turn conversation manager with context tracking
📱 Expo Mobile UI — React Native app for iOS/Android/Web
♿ Accessibility — Groq generates spoken-friendly descriptions for every answer

Quick Start

1 — Backend

# Clone and install
git clone https://github.com/DevaRajan8/Generative-vqa.git
cd Generative-vqa
pip install -r requirements_api.txt

# Set your Groq API key
cp .env.example .env
# Edit .env → GROQ_API_KEY=your_key_here

# Start API
python backend_api.py
# → http://localhost:8000

2 — Mobile UI

cd ui
npm install
npx expo start --clear

Scan the QR code with Expo Go, or press w for browser.

API

Endpoint	Method	Description
`/api/answer`	POST	Answer a question about an uploaded image
`/api/health`	GET	Health check
`/api/conversation/new`	POST	Start a new conversation session

Example:

curl -X POST http://localhost:8000/api/answer \
  -F "image=@photo.jpg" \
  -F "question=Can this melt?"

Response:

{
  "answer": "ice",
  "model_used": "neuro-symbolic",
  "kg_enhancement": "Yes — ice can melt. [Wikidata P2101: melting point = 0.0 °C]",
  "knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
  "wikidata_entity": "Q86"
}

Project Structure

├── backend_api.py                  # FastAPI server
├── ensemble_vqa_app.py             # VQA orchestrator (routing + inference)
├── semantic_neurosymbolic_vqa.py   # Wikidata KB + Groq verbalizer
├── groq_service.py                 # Groq accessibility descriptions
├── conversation_manager.py         # Multi-turn conversation tracking
├── model.py                        # VQA model definition
├── train.py                        # Training pipeline
├── ui/                             # Expo React Native app
│   └── src/screens/HomeScreen.js
└── .github/
    ├── workflows/                  # CI — backend lint + UI build
    └── ISSUE_TEMPLATE/

Environment Variables

Variable	Required	Description
`GROQ_API_KEY`	✅	Groq API key — get one free
`MODEL_PATH`	optional	Path to VQA checkpoint (default: `vqa_checkpoint.pt`)
`PORT`	optional	API server port (default: `8000`)

Requirements

Python 3.10+
CUDA GPU recommended (CPU works but is slow)
Node.js 20+ (for UI)
Groq API key (free tier available)