vqa-backend / README.md
Deva8's picture
Add HuggingFace configuration block to README
4de914d
metadata
title: VQA Backend
emoji: πŸš€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

GenVQA β€” Generative Visual Question Answering

A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.

Backend CI UI CI Python License


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   CLIENT LAYER                              β”‚
β”‚   πŸ“± Expo Mobile App (React Native)                         β”‚
β”‚   β€’ Image upload + question input                           β”‚
β”‚   β€’ Displays answer + accessibility description             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ HTTP POST /api/answer
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   BACKEND LAYER  (FastAPI)                  β”‚
β”‚   backend_api.py                                            β”‚
β”‚   β€’ Request handling, session management                    β”‚
β”‚   β€’ Conversation Manager β†’ multi-turn context tracking      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            ROUTING LAYER  (ensemble_vqa_app.py)             β”‚
β”‚                                                             β”‚
β”‚   CLIP encodes question β†’ compares against:                 β”‚
β”‚   "reasoning question" vs "visual/perceptual question"      β”‚
β”‚                                                             β”‚
β”‚         Reasoning?                 Visual?                  β”‚
β”‚             β”‚                          β”‚                    β”‚
β”‚             β–Ό                          β–Ό                    β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚   β”‚ NEURO-SYMBOLIC  β”‚      β”‚   NEURAL VQA PATH   β”‚         β”‚
β”‚   β”‚                 β”‚      β”‚                     β”‚         β”‚
β”‚   β”‚ 1. VQA model    β”‚      β”‚  VQA model (GRU +   β”‚         β”‚
β”‚   β”‚    detects obj  β”‚      β”‚  Attention) predicts β”‚         β”‚
β”‚   β”‚                 β”‚      β”‚  answer directly     β”‚         β”‚
β”‚   β”‚ 2. Wikidata API β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚   β”‚    fetches factsβ”‚                 β”‚                    β”‚
β”‚   β”‚    (P31, P2101, β”‚                 β”‚                    β”‚
β”‚   β”‚     P2054, P186,β”‚                 β”‚                    β”‚
β”‚   β”‚     P366 ...)   β”‚                 β”‚                    β”‚
β”‚   β”‚                 β”‚                 β”‚                    β”‚
β”‚   β”‚ 3. Groq LLM     β”‚                 β”‚                    β”‚
β”‚   β”‚    verbalizes   β”‚                 β”‚                    β”‚
β”‚   β”‚    from facts   β”‚                 β”‚                    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚                    β”‚
β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
└──────────────────────────  β”‚  β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   GROQ SERVICE  β”‚
                    β”‚  Accessibility  β”‚
                    β”‚  description    β”‚
                    β”‚  (2 sentences,  β”‚
                    β”‚  screen-reader  β”‚
                    β”‚  friendly)      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β–Ό
                      JSON response
                    { answer, model_used,
                      kg_enhancement,
                      wikidata_entity,
                      description }
Layer Component Role
Client Expo React Native Image upload, question input, answer display
API FastAPI (backend_api.py) Routing, sessions, conversation state
Conversation conversation_manager.py Multi-turn context, history tracking
Router CLIP (in ensemble_vqa_app.py) Classifies question as reasoning vs visual
Neural VQA GRU + Attention (model.py) Answers visual questions directly from image
Neuro-Symbolic semantic_neurosymbolic_vqa.py VQA detects objects β†’ Wikidata fetches facts β†’ Groq verbalizes
Accessibility groq_service.py Generates spoken-friendly 2-sentence description for every answer

Features

  • πŸ” Visual Question Answering β€” trained on VQAv2, fine-tuned on custom data
  • 🧠 Neuro-Symbolic Routing β€” CLIP semantically classifies questions as reasoning vs visual, routes accordingly
  • 🌐 Live Wikidata Facts β€” queries physical properties, categories, materials, uses in real time
  • πŸ€– Groq Verbalization β€” Llama 3.3 70B answers from structured facts, not hallucination
  • πŸ’¬ Conversational Support β€” multi-turn conversation manager with context tracking
  • πŸ“± Expo Mobile UI β€” React Native app for iOS/Android/Web
  • β™Ώ Accessibility β€” Groq generates spoken-friendly descriptions for every answer

Quick Start

1 β€” Backend

# Clone and install
git clone https://github.com/DevaRajan8/Generative-vqa.git
cd Generative-vqa
pip install -r requirements_api.txt

# Set your Groq API key
cp .env.example .env
# Edit .env β†’ GROQ_API_KEY=your_key_here

# Start API
python backend_api.py
# β†’ http://localhost:8000

2 β€” Mobile UI

cd ui
npm install
npx expo start --clear

Scan the QR code with Expo Go, or press w for browser.


API

Endpoint Method Description
/api/answer POST Answer a question about an uploaded image
/api/health GET Health check
/api/conversation/new POST Start a new conversation session

Example:

curl -X POST http://localhost:8000/api/answer \
  -F "image=@photo.jpg" \
  -F "question=Can this melt?"

Response:

{
  "answer": "ice",
  "model_used": "neuro-symbolic",
  "kg_enhancement": "Yes β€” ice can melt. [Wikidata P2101: melting point = 0.0 Β°C]",
  "knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
  "wikidata_entity": "Q86"
}

Project Structure

β”œβ”€β”€ backend_api.py                  # FastAPI server
β”œβ”€β”€ ensemble_vqa_app.py             # VQA orchestrator (routing + inference)
β”œβ”€β”€ semantic_neurosymbolic_vqa.py   # Wikidata KB + Groq verbalizer
β”œβ”€β”€ groq_service.py                 # Groq accessibility descriptions
β”œβ”€β”€ conversation_manager.py         # Multi-turn conversation tracking
β”œβ”€β”€ model.py                        # VQA model definition
β”œβ”€β”€ train.py                        # Training pipeline
β”œβ”€β”€ ui/                             # Expo React Native app
β”‚   └── src/screens/HomeScreen.js
└── .github/
    β”œβ”€β”€ workflows/                  # CI β€” backend lint + UI build
    └── ISSUE_TEMPLATE/

Environment Variables

Variable Required Description
GROQ_API_KEY βœ… Groq API key β€” get one free
MODEL_PATH optional Path to VQA checkpoint (default: vqa_checkpoint.pt)
PORT optional API server port (default: 8000)

Requirements

  • Python 3.10+
  • CUDA GPU recommended (CPU works but is slow)
  • Node.js 20+ (for UI)
  • Groq API key (free tier available)

License

MIT Β© DevaRajan8