Spaces:
Sleeping
Sleeping
metadata
title: VQA Backend
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
GenVQA β Generative Visual Question Answering
A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT LAYER β
β π± Expo Mobile App (React Native) β
β β’ Image upload + question input β
β β’ Displays answer + accessibility description β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β HTTP POST /api/answer
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKEND LAYER (FastAPI) β
β backend_api.py β
β β’ Request handling, session management β
β β’ Conversation Manager β multi-turn context tracking β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ROUTING LAYER (ensemble_vqa_app.py) β
β β
β CLIP encodes question β compares against: β
β "reasoning question" vs "visual/perceptual question" β
β β
β Reasoning? Visual? β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββ β
β β NEURO-SYMBOLIC β β NEURAL VQA PATH β β
β β β β β β
β β 1. VQA model β β VQA model (GRU + β β
β β detects obj β β Attention) predicts β β
β β β β answer directly β β
β β 2. Wikidata API β ββββββββββββ¬βββββββββββ β
β β fetches factsβ β β
β β (P31, P2101, β β β
β β P2054, P186,β β β
β β P366 ...) β β β
β β β β β
β β 3. Groq LLM β β β
β β verbalizes β β β
β β from facts β β β
β βββββββββββ¬ββββββββ β β
β ββββββββββββββββ¬βββββββββββ β
βββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β GROQ SERVICE β
β Accessibility β
β description β
β (2 sentences, β
β screen-reader β
β friendly) β
ββββββββββ¬βββββββββ
β
βΌ
JSON response
{ answer, model_used,
kg_enhancement,
wikidata_entity,
description }
| Layer | Component | Role |
|---|---|---|
| Client | Expo React Native | Image upload, question input, answer display |
| API | FastAPI (backend_api.py) |
Routing, sessions, conversation state |
| Conversation | conversation_manager.py |
Multi-turn context, history tracking |
| Router | CLIP (in ensemble_vqa_app.py) |
Classifies question as reasoning vs visual |
| Neural VQA | GRU + Attention (model.py) |
Answers visual questions directly from image |
| Neuro-Symbolic | semantic_neurosymbolic_vqa.py |
VQA detects objects β Wikidata fetches facts β Groq verbalizes |
| Accessibility | groq_service.py |
Generates spoken-friendly 2-sentence description for every answer |
Features
- π Visual Question Answering β trained on VQAv2, fine-tuned on custom data
- π§ Neuro-Symbolic Routing β CLIP semantically classifies questions as reasoning vs visual, routes accordingly
- π Live Wikidata Facts β queries physical properties, categories, materials, uses in real time
- π€ Groq Verbalization β Llama 3.3 70B answers from structured facts, not hallucination
- π¬ Conversational Support β multi-turn conversation manager with context tracking
- π± Expo Mobile UI β React Native app for iOS/Android/Web
- βΏ Accessibility β Groq generates spoken-friendly descriptions for every answer
Quick Start
1 β Backend
# Clone and install
git clone https://github.com/DevaRajan8/Generative-vqa.git
cd Generative-vqa
pip install -r requirements_api.txt
# Set your Groq API key
cp .env.example .env
# Edit .env β GROQ_API_KEY=your_key_here
# Start API
python backend_api.py
# β http://localhost:8000
2 β Mobile UI
cd ui
npm install
npx expo start --clear
Scan the QR code with Expo Go, or press
wfor browser.
API
| Endpoint | Method | Description |
|---|---|---|
/api/answer |
POST | Answer a question about an uploaded image |
/api/health |
GET | Health check |
/api/conversation/new |
POST | Start a new conversation session |
Example:
curl -X POST http://localhost:8000/api/answer \
-F "image=@photo.jpg" \
-F "question=Can this melt?"
Response:
{
"answer": "ice",
"model_used": "neuro-symbolic",
"kg_enhancement": "Yes β ice can melt. [Wikidata P2101: melting point = 0.0 Β°C]",
"knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
"wikidata_entity": "Q86"
}
Project Structure
βββ backend_api.py # FastAPI server
βββ ensemble_vqa_app.py # VQA orchestrator (routing + inference)
βββ semantic_neurosymbolic_vqa.py # Wikidata KB + Groq verbalizer
βββ groq_service.py # Groq accessibility descriptions
βββ conversation_manager.py # Multi-turn conversation tracking
βββ model.py # VQA model definition
βββ train.py # Training pipeline
βββ ui/ # Expo React Native app
β βββ src/screens/HomeScreen.js
βββ .github/
βββ workflows/ # CI β backend lint + UI build
βββ ISSUE_TEMPLATE/
Environment Variables
| Variable | Required | Description |
|---|---|---|
GROQ_API_KEY |
β | Groq API key β get one free |
MODEL_PATH |
optional | Path to VQA checkpoint (default: vqa_checkpoint.pt) |
PORT |
optional | API server port (default: 8000) |
Requirements
- Python 3.10+
- CUDA GPU recommended (CPU works but is slow)
- Node.js 20+ (for UI)
- Groq API key (free tier available)
License
MIT Β© DevaRajan8