Spaces:
Sleeping
π§ GenVQA β Generative Visual Question Answering
A hybrid neuro-symbolic VQA system that intelligently routes between pure neural networks and knowledge-grounded reasoning
Overview
GenVQA is an advanced Visual Question Answering system that combines the best of both worlds:
- Neural networks for perception-based visual questions
- Symbolic reasoning for knowledge-intensive reasoning questions
The system automatically classifies incoming questions and routes them to the optimal processing pipeline, ensuring accurate and grounded answers.
System Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLIENT β
β Expo React Native App (iOS/Android/Web) β
β β’ Image upload via camera/gallery β
β β’ Question input with suggested prompts β
β β’ Multi-turn conversational interface β
β β’ Google OAuth authentication β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β HTTP POST /api/answer
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKEND API LAYER β
β FastAPI (backend_api.py) β
β β’ Request handling & validation β
β β’ Session management & authentication β
β β’ Multi-turn conversation tracking β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INTELLIGENT ROUTING LAYER β
β (ensemble_vqa_app.py) β
β β
β CLIP Semantic Classifier: β
β Encodes question β Compares similarity: β
β "This is a reasoning question about facts" β
β vs β
β "This is a visual perception question" β
β β
β Similarity > threshold?
β
β βββββββββββ¬βββββββββ β
β β β β β
β REASONING VISUAL SPATIAL β
β β β β β
βββββββββββββββββββββββΌββββββββββΌβββββββββΌββββββββββββββββββββββββββ
β β β
βββββββββββββββ β βββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββ
β NEURO-SYMBOLIC β β NEURAL VQA PATH β β SPATIAL ADAPTER β
β PIPELINE β β β β PATH β
β β β CLIP + GRU + β β β
β β VQA Model β β Attention β β Enhanced with β
β Detects β β β β spatial β
β Objects β β Direct answer β β self-attention β
β (e.g. "soup") β β prediction from β β for left/right β
β β β image features β β above/below β
β β‘ Wikidata API β β β β questions β
β Fetches Facts β β Outputs: β β β
β P31: category β β "red" β β Outputs: β
β P186: materialβ βββββββββ¬ββββββββββββ β "on the left" β
β P2101: meltingβ β ββββββββββ¬βββββββββ
β P366: use β β β
β P2054: densityβ β β
β β β β
β β’ Groq LLM β β β
β Verbalizes β β β
β from facts β β β
β (instead
of free β β β
β reasoning) β β β
β β β β
β Outputs: β β β
β "Soup is made of β β β
β water and β β β
β vegetables, β β β
β used for eating"β β β
ββββββββββ¬ββββββββββ β β
β β β
ββββββββββββ¬βββββββββββ΄βββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββ
β GROQ ACCESSIBILITY β
β SERVICE β
β β
β Generates 2-sentenceβ
β screen-reader β
β friendly descriptionβ
β for every answer β
ββββββββββββ¬ββββββββββββ
β
βΌ
JSON Response
{
"answer": "...",
"model_used": "neuro_symbolic|base|spatial",
"confidence": 0.85,
"kg_enhancement": true/false,
"wikidata_entity": "Q123456",
"description": "...",
"session_id": "..."
}
Neural vs Neuro-Symbolic: Deep Dive
Neural Pathway
When Used: Perceptual questions about what's directly visible
- "What color is the car?"
- "How many people are in the image?"
- "Is the dog sitting or standing?"
Architecture:
Image Input
β
βΌ
βββββββββββββββββββββββββββββββ
β CLIP Vision Encoder β
β (ViT-B/16) β
β β’ Pre-trained on 400M β
β image-text pairs β
β β’ 512-dim embeddings β
ββββββββββββ¬βββββββββββββββββββ
β
βΌ
[512-dim vector] βββββββββββββ
β
Question Input β
β β
βΌ β
βββββββββββββββββββββββββββββββ β
β GPT-2 Text Encoder β β
β (distilgpt2) β β
β β’ Contextual embeddings β β
β β’ 768-dim output β β
ββββββββββββ¬βββββββββββββββββββ β
β β
βΌ β
[768-dim vector] β
β β
βΌ β
ββββββββββββββββ β
β Linear Proj β β
β 768 β 512 β β
ββββββββ¬ββββββββ β
β β
βββββββββββββ¬ββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Multimodal Fusion β
β β’ Gated combination β
β β’ 3-layer MLP β
β β’ ReLU + Dropout β
ββββββββββββ¬ββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β GRU Decoder with β
β Attention Mechanism β
β β
β β’ Hidden: 512-dim β
β β’ 2 layers β
β β’ Seq2seq decoding β
β β’ Attention over β
β fused features β
ββββββββββββ¬ββββββββββββ
β
βΌ
Answer Tokens
"red car"
Key Components:
- CLIP: Zero-shot image understanding, robust to domain shift
- GPT-2: Contextual question encoding
- Attention: Decoder focuses on relevant image regions per word
- GRU: Sequential answer generation with memory
Training:
- Dataset: VQA v2 (curated, balanced subset)
- Loss: Cross-entropy over answer vocabulary
- Fine-tuning: Last 2 CLIP layers + full decoder
- Accuracy: ~39% on general VQA, ~28% on spatial questions
Neuro-Symbolic Pathway (Knowledge-Grounded Reasoning)
When Used: Questions requiring external knowledge or reasoning
- "Can soup melt?"
- "What is ice cream made of?"
- "Does this float in water?"
Architecture:
Step 1: NEURAL DETECTION
βββββββββββββββββββββββββ
Image + Question
β
βΌ
ββββββββββββββββββββββββ
β VQA Model β
β (same as above) β
β β
β Predicts: "soup" β
ββββββββββββ¬ββββββββββββ
β
βΌ
Detected Object
"soup"
Step 2: SYMBOLIC FACT RETRIEVAL
ββββββββββββββββββββββββββββββββ
"soup"
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β Wikidata SPARQL Queries β
β β
β β Entity Resolution: β
β "soup" β Q41415 (Wikidata ID) β
β β
β β‘ Fetch ALL Relevant Properties: β
β β
β P31 (instance of): β
β β "food" β
β β "liquid food" β
β β "dish" β
β β
β P186 (made of): β
β β "water" β
β β "vegetables" β
β β "broth" β
β β
β P366 (used for): β
β β "consumption" β
β β "nutrition" β
β β
β P2101 (melting point): β
β β (not found) β
β β
β P2054 (density): β
β β ~1000 kg/mΒ³ β
β β (floats/sinks calc) β
β β
β P2777 (flash point): β
β β (not found) β
ββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
Structured Knowledge Graph
{
"entity": "soup (Q41415)",
"categories": ["food", "liquid"],
"materials": ["water", "vegetables"],
"uses": ["consumption"],
"density": 1000,
"melting_point": null
}
Step 3: LLM VERBALIZATION (NOT REASONING!)
βββββββββββββββββββββββββββββββββββββββββββ
Knowledge Graph
β
βΌ
ββββββββββββββββββββββββββββββββββββββ
β Groq API β
β (Llama 3.3 70B) β
β β
β System Prompt: β
β "You are a fact verbalizer. β
β Answer ONLY from provided β
β Wikidata facts. Do NOT use β
β your training knowledge. β
β If facts don't contain the β
β answer, say 'unknown from β
β available data'." β
β β
β User Input: β
β Question: "Can soup melt?" β
β Facts: {structured data above} β
ββββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
Natural Language Answer
"According to Wikidata, soup is
a liquid food made of water and
vegetables. Since it's already
liquid, it doesn't have a melting
point like solids do. It can
freeze, but not melt."
Critical Design Principle:
Groq is a verbalizer, NOT a reasoner. All reasoning happens in the symbolic layer (Wikidata facts). Groq only translates structured facts into natural language.
Why This Matters:
- Without facts: Groq hallucinates from training data
- With facts: Groq grounds answers in real-time data
- Result: Factual accuracy, no made-up information
Knowledge Base Properties Fetched:
| Property | Wikidata Code | Example Value |
|---|---|---|
| Category | P31 | "food", "tool", "animal" |
| Material | P186 | "metal", "wood", "plastic" |
| Melting Point | P2101 | 273.15 K (0Β°C) |
| Density | P2054 | 917 kg/mΒ³ (floats/sinks) |
| Use | P366 | "eating", "transportation" |
| Flash Point | P2777 | 310 K (flammable) |
| Location | P276 | "ocean", "forest" |
Spatial Reasoning Pathway
When Used: Questions about relative positions
- "What is to the left of the car?"
- "Is the cat above or below the table?"
Architecture Enhancement:
Base VQA Model
β
βΌ
ββββββββββββββββββββββββββββββββ
β Spatial Self-Attention β
β β’ Multi-head attention (8) β
β β’ Learns spatial relations β
β β’ Position-aware weighting β
ββββββββββββ¬ββββββββββββββββββββ
β
βΌ
Spatial-aware answer
"on the left side"
Keyword Triggering:
- Detects:
left,right,above,below,top,bottom,next to,behind,between, etc. - Routes to spatial adapter model
- Enhanced accuracy on positional questions
Intelligent Routing System
CLIP-Based Semantic Routing:
# Encode question with CLIP
question_embedding = clip.encode_text(question)
# Compare against two templates
reasoning_prompt = "This is a reasoning question about facts and knowledge"
visual_prompt = "This is a visual perception question about what you see"
reasoning_similarity = cosine_similarity(question_embedding,
clip.encode_text(reasoning_prompt))
visual_similarity = cosine_similarity(question_embedding,
clip.encode_text(visual_prompt))
# Route decision
if reasoning_similarity > visual_similarity + THRESHOLD:
route_to_neuro_symbolic()
elif contains_spatial_keywords(question):
route_to_spatial_adapter()
else:
route_to_base_neural()
Routing Logic:
- Neuro-Symbolic if CLIP classifies as reasoning (>0.6 similarity)
- Spatial if contains spatial keywords (
left,right,above, etc.) - Base Neural for all other visual perception questions
Multi-Turn Conversation Support
Conversation Manager Features:
- Session tracking with UUID
- Context retention across turns
- Pronoun resolution (
it,this,thatβ previous object) - Automatic session expiry (30 min timeout)
Example Conversation:
Turn 1:
User: "What is this?"
VQA: "A red car"
Objects: ["car"]
Turn 2:
User: "Can it float?" # "it" = "car"
System: Resolves "it" β "car"
VQA: [Neuro-Symbolic] "According to Wikidata, cars are made
of metal and plastic with density around 800-1000 kg/mΒ³,
which is close to water. Most cars would sink."
Turn 3:
User: "What color is it again?" # Still referring to car
VQA: [Neural] "red" # From Turn 1 context
Quick Start
Prerequisites
- Python 3.10+
- CUDA GPU (recommended, 4GB+ VRAM)
- Node.js 16+ (for mobile UI)
- Groq API key (get one free)
Backend Setup
# 1. Clone repository
git clone https://github.com/YourUsername/vqa_coes.git
cd vqa_coes
# 2. Install dependencies
pip install -r requirements_api.txt
# 3. Set environment variables
echo "GROQ_API_KEY=your_groq_api_key_here" > .env
# 4. Download model checkpoints (if not included)
# Ensure these files exist in project root:
# - vqa_checkpoint.pt (base model)
# - vqa_spatial_checkpoint.pt (spatial model)
# 5. Start API server
python backend_api.py
# Server will start at http://localhost:8000
Mobile UI Setup
# 1. Navigate to UI folder
cd ui
# 2. Install dependencies
npm install
# 3. Configure API endpoint
# Edit ui/src/config/api.js
# Change: export const API_BASE_URL = 'http://YOUR_LOCAL_IP:8000';
# 4. Start Expo
npx expo start --clear
# Scan QR code with Expo Go app, or press 'w' for web
π§ API Reference
POST /api/answer
Answer a visual question with optional conversation context.
Request:
curl -X POST http://localhost:8000/api/answer \
-F "image=@photo.jpg" \
-F "question=Can this float in water?" \
-F "session_id=optional-uuid-here"
Response:
{
"answer": "According to Wikidata, this object has a density of 917 kg/mΒ³, which is less than water (1000 kg/mΒ³), so it would float.",
"model_used": "neuro_symbolic",
"confidence": 0.87,
"kg_enhancement": true,
"wikidata_entity": "Q41576",
"description": "The object appears to be made of ice. Based on its physical properties from scientific data, it would float on water due to lower density.",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"conversation_turn": 2
}
## π License
MIT License - see LICENSE file for details
---