Spaces:
Sleeping
Sleeping
| <div align="center"> | |
| # π§ GenVQA β Generative Visual Question Answering | |
| **A hybrid neuro-symbolic VQA system that intelligently routes between pure neural networks and knowledge-grounded reasoning** | |
| </div> | |
| --- | |
| ## Overview | |
| GenVQA is an advanced Visual Question Answering system that combines the best of both worlds: | |
| - **Neural networks** for perception-based visual questions | |
| - **Symbolic reasoning** for knowledge-intensive reasoning questions | |
| The system automatically classifies incoming questions and routes them to the optimal processing pipeline, ensuring accurate and grounded answers. | |
| --- | |
| ## System Architecture | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β CLIENT β | |
| β Expo React Native App (iOS/Android/Web) β | |
| β β’ Image upload via camera/gallery β | |
| β β’ Question input with suggested prompts β | |
| β β’ Multi-turn conversational interface β | |
| β β’ Google OAuth authentication β | |
| βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β HTTP POST /api/answer | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β BACKEND API LAYER β | |
| β FastAPI (backend_api.py) β | |
| β β’ Request handling & validation β | |
| β β’ Session management & authentication β | |
| β β’ Multi-turn conversation tracking β | |
| βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β INTELLIGENT ROUTING LAYER β | |
| β (ensemble_vqa_app.py) β | |
| β β | |
| β CLIP Semantic Classifier: β | |
| β Encodes question β Compares similarity: β | |
| β "This is a reasoning question about facts" β | |
| β vs β | |
| β "This is a visual perception question" β | |
| β β | |
| β Similarity > threshold? | |
| β | |
| β βββββββββββ¬βββββββββ β | |
| β β β β β | |
| β REASONING VISUAL SPATIAL β | |
| β β β β β | |
| βββββββββββββββββββββββΌββββββββββΌβββββββββΌββββββββββββββββββββββββββ | |
| β β β | |
| βββββββββββββββ β βββββββββββββββ | |
| βΌ βΌ βΌ | |
| ββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββ | |
| β NEURO-SYMBOLIC β β NEURAL VQA PATH β β SPATIAL ADAPTER β | |
| β PIPELINE β β β β PATH β | |
| β β β CLIP + GRU + β β β | |
| β β VQA Model β β Attention β β Enhanced with β | |
| β Detects β β β β spatial β | |
| β Objects β β Direct answer β β self-attention β | |
| β (e.g. "soup") β β prediction from β β for left/right β | |
| β β β image features β β above/below β | |
| β β‘ Wikidata API β β β β questions β | |
| β Fetches Facts β β Outputs: β β β | |
| β P31: category β β "red" β β Outputs: β | |
| β P186: materialβ βββββββββ¬ββββββββββββ β "on the left" β | |
| β P2101: meltingβ β ββββββββββ¬βββββββββ | |
| β P366: use β β β | |
| β P2054: densityβ β β | |
| β β β β | |
| β β’ Groq LLM β β β | |
| β Verbalizes β β β | |
| β from facts β β β | |
| β (instead | |
| of free β β β | |
| β reasoning) β β β | |
| β β β β | |
| β Outputs: β β β | |
| β "Soup is made of β β β | |
| β water and β β β | |
| β vegetables, β β β | |
| β used for eating"β β β | |
| ββββββββββ¬ββββββββββ β β | |
| β β β | |
| ββββββββββββ¬βββββββββββ΄βββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββ | |
| β GROQ ACCESSIBILITY β | |
| β SERVICE β | |
| β β | |
| β Generates 2-sentenceβ | |
| β screen-reader β | |
| β friendly descriptionβ | |
| β for every answer β | |
| ββββββββββββ¬ββββββββββββ | |
| β | |
| βΌ | |
| JSON Response | |
| { | |
| "answer": "...", | |
| "model_used": "neuro_symbolic|base|spatial", | |
| "confidence": 0.85, | |
| "kg_enhancement": true/false, | |
| "wikidata_entity": "Q123456", | |
| "description": "...", | |
| "session_id": "..." | |
| } | |
| ``` | |
| --- | |
| ## Neural vs Neuro-Symbolic: Deep Dive | |
| ### Neural Pathway | |
| **When Used**: Perceptual questions about what's directly visible | |
| - _"What color is the car?"_ | |
| - _"How many people are in the image?"_ | |
| - _"Is the dog sitting or standing?"_ | |
| **Architecture**: | |
| ``` | |
| Image Input | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β CLIP Vision Encoder β | |
| β (ViT-B/16) β | |
| β β’ Pre-trained on 400M β | |
| β image-text pairs β | |
| β β’ 512-dim embeddings β | |
| ββββββββββββ¬βββββββββββββββββββ | |
| β | |
| βΌ | |
| [512-dim vector] βββββββββββββ | |
| β | |
| Question Input β | |
| β β | |
| βΌ β | |
| βββββββββββββββββββββββββββββββ β | |
| β GPT-2 Text Encoder β β | |
| β (distilgpt2) β β | |
| β β’ Contextual embeddings β β | |
| β β’ 768-dim output β β | |
| ββββββββββββ¬βββββββββββββββββββ β | |
| β β | |
| βΌ β | |
| [768-dim vector] β | |
| β β | |
| βΌ β | |
| ββββββββββββββββ β | |
| β Linear Proj β β | |
| β 768 β 512 β β | |
| ββββββββ¬ββββββββ β | |
| β β | |
| βββββββββββββ¬ββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββ | |
| β Multimodal Fusion β | |
| β β’ Gated combination β | |
| β β’ 3-layer MLP β | |
| β β’ ReLU + Dropout β | |
| ββββββββββββ¬ββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββ | |
| β GRU Decoder with β | |
| β Attention Mechanism β | |
| β β | |
| β β’ Hidden: 512-dim β | |
| β β’ 2 layers β | |
| β β’ Seq2seq decoding β | |
| β β’ Attention over β | |
| β fused features β | |
| ββββββββββββ¬ββββββββββββ | |
| β | |
| βΌ | |
| Answer Tokens | |
| "red car" | |
| ``` | |
| **Key Components**: | |
| - **CLIP**: Zero-shot image understanding, robust to domain shift | |
| - **GPT-2**: Contextual question encoding | |
| - **Attention**: Decoder focuses on relevant image regions per word | |
| - **GRU**: Sequential answer generation with memory | |
| **Training**: | |
| - Dataset: VQA v2 (curated, balanced subset) | |
| - Loss: Cross-entropy over answer vocabulary | |
| - Fine-tuning: Last 2 CLIP layers + full decoder | |
| - Accuracy: ~39% on general VQA, ~28% on spatial questions | |
| --- | |
| ### Neuro-Symbolic Pathway (Knowledge-Grounded Reasoning) | |
| **When Used**: Questions requiring external knowledge or reasoning | |
| - _"Can soup melt?"_ | |
| - _"What is ice cream made of?"_ | |
| - _"Does this float in water?"_ | |
| **Architecture**: | |
| ``` | |
| Step 1: NEURAL DETECTION | |
| βββββββββββββββββββββββββ | |
| Image + Question | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββ | |
| β VQA Model β | |
| β (same as above) β | |
| β β | |
| β Predicts: "soup" β | |
| ββββββββββββ¬ββββββββββββ | |
| β | |
| βΌ | |
| Detected Object | |
| "soup" | |
| Step 2: SYMBOLIC FACT RETRIEVAL | |
| ββββββββββββββββββββββββββββββββ | |
| "soup" | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββ | |
| β Wikidata SPARQL Queries β | |
| β β | |
| β β Entity Resolution: β | |
| β "soup" β Q41415 (Wikidata ID) β | |
| β β | |
| β β‘ Fetch ALL Relevant Properties: β | |
| β β | |
| β P31 (instance of): β | |
| β β "food" β | |
| β β "liquid food" β | |
| β β "dish" β | |
| β β | |
| β P186 (made of): β | |
| β β "water" β | |
| β β "vegetables" β | |
| β β "broth" β | |
| β β | |
| β P366 (used for): β | |
| β β "consumption" β | |
| β β "nutrition" β | |
| β β | |
| β P2101 (melting point): β | |
| β β (not found) β | |
| β β | |
| β P2054 (density): β | |
| β β ~1000 kg/mΒ³ β | |
| β β (floats/sinks calc) β | |
| β β | |
| β P2777 (flash point): β | |
| β β (not found) β | |
| ββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| βΌ | |
| Structured Knowledge Graph | |
| { | |
| "entity": "soup (Q41415)", | |
| "categories": ["food", "liquid"], | |
| "materials": ["water", "vegetables"], | |
| "uses": ["consumption"], | |
| "density": 1000, | |
| "melting_point": null | |
| } | |
| Step 3: LLM VERBALIZATION (NOT REASONING!) | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| Knowledge Graph | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββ | |
| β Groq API β | |
| β (Llama 3.3 70B) β | |
| β β | |
| β System Prompt: β | |
| β "You are a fact verbalizer. β | |
| β Answer ONLY from provided β | |
| β Wikidata facts. Do NOT use β | |
| β your training knowledge. β | |
| β If facts don't contain the β | |
| β answer, say 'unknown from β | |
| β available data'." β | |
| β β | |
| β User Input: β | |
| β Question: "Can soup melt?" β | |
| β Facts: {structured data above} β | |
| ββββββββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| Natural Language Answer | |
| "According to Wikidata, soup is | |
| a liquid food made of water and | |
| vegetables. Since it's already | |
| liquid, it doesn't have a melting | |
| point like solids do. It can | |
| freeze, but not melt." | |
| ``` | |
| **Critical Design Principle**: | |
| > Groq is a **verbalizer**, NOT a reasoner. All reasoning happens in the symbolic layer (Wikidata facts). Groq only translates structured facts into natural language. | |
| **Why This Matters**: | |
| - **Without facts**: Groq hallucinates from training data | |
| - **With facts**: Groq grounds answers in real-time data | |
| - **Result**: Factual accuracy, no made-up information | |
| **Knowledge Base Properties Fetched**: | |
| | Property | Wikidata Code | Example Value | | |
| |----------|---------------|---------------| | |
| | Category | P31 | "food", "tool", "animal" | | |
| | Material | P186 | "metal", "wood", "plastic" | | |
| | Melting Point | P2101 | 273.15 K (0Β°C) | | |
| | Density | P2054 | 917 kg/mΒ³ (floats/sinks) | | |
| | Use | P366 | "eating", "transportation" | | |
| | Flash Point | P2777 | 310 K (flammable) | | |
| | Location | P276 | "ocean", "forest" | | |
| --- | |
| ### Spatial Reasoning Pathway | |
| **When Used**: Questions about relative positions | |
| - _"What is to the left of the car?"_ | |
| - _"Is the cat above or below the table?"_ | |
| **Architecture Enhancement**: | |
| ``` | |
| Base VQA Model | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Spatial Self-Attention β | |
| β β’ Multi-head attention (8) β | |
| β β’ Learns spatial relations β | |
| β β’ Position-aware weighting β | |
| ββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| βΌ | |
| Spatial-aware answer | |
| "on the left side" | |
| ``` | |
| **Keyword Triggering**: | |
| - Detects: `left`, `right`, `above`, `below`, `top`, `bottom`, `next to`, `behind`, `between`, etc. | |
| - Routes to spatial adapter model | |
| - Enhanced accuracy on positional questions | |
| --- | |
| ## Intelligent Routing System | |
| **CLIP-Based Semantic Routing**: | |
| ```python | |
| # Encode question with CLIP | |
| question_embedding = clip.encode_text(question) | |
| # Compare against two templates | |
| reasoning_prompt = "This is a reasoning question about facts and knowledge" | |
| visual_prompt = "This is a visual perception question about what you see" | |
| reasoning_similarity = cosine_similarity(question_embedding, | |
| clip.encode_text(reasoning_prompt)) | |
| visual_similarity = cosine_similarity(question_embedding, | |
| clip.encode_text(visual_prompt)) | |
| # Route decision | |
| if reasoning_similarity > visual_similarity + THRESHOLD: | |
| route_to_neuro_symbolic() | |
| elif contains_spatial_keywords(question): | |
| route_to_spatial_adapter() | |
| else: | |
| route_to_base_neural() | |
| ``` | |
| **Routing Logic**: | |
| 1. **Neuro-Symbolic** if CLIP classifies as reasoning (>0.6 similarity) | |
| 2. **Spatial** if contains spatial keywords (`left`, `right`, `above`, etc.) | |
| 3. **Base Neural** for all other visual perception questions | |
| --- | |
| ## Multi-Turn Conversation Support | |
| **Conversation Manager Features**: | |
| - Session tracking with UUID | |
| - Context retention across turns | |
| - Pronoun resolution (`it`, `this`, `that` β previous object) | |
| - Automatic session expiry (30 min timeout) | |
| **Example Conversation**: | |
| ``` | |
| Turn 1: | |
| User: "What is this?" | |
| VQA: "A red car" | |
| Objects: ["car"] | |
| Turn 2: | |
| User: "Can it float?" # "it" = "car" | |
| System: Resolves "it" β "car" | |
| VQA: [Neuro-Symbolic] "According to Wikidata, cars are made | |
| of metal and plastic with density around 800-1000 kg/mΒ³, | |
| which is close to water. Most cars would sink." | |
| Turn 3: | |
| User: "What color is it again?" # Still referring to car | |
| VQA: [Neural] "red" # From Turn 1 context | |
| ``` | |
| --- | |
| ## Quick Start | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - CUDA GPU (recommended, 4GB+ VRAM) | |
| - Node.js 16+ (for mobile UI) | |
| - Groq API key ([get one free](https://console.groq.com)) | |
| ### Backend Setup | |
| ```bash | |
| # 1. Clone repository | |
| git clone https://github.com/YourUsername/vqa_coes.git | |
| cd vqa_coes | |
| # 2. Install dependencies | |
| pip install -r requirements_api.txt | |
| # 3. Set environment variables | |
| echo "GROQ_API_KEY=your_groq_api_key_here" > .env | |
| # 4. Download model checkpoints (if not included) | |
| # Ensure these files exist in project root: | |
| # - vqa_checkpoint.pt (base model) | |
| # - vqa_spatial_checkpoint.pt (spatial model) | |
| # 5. Start API server | |
| python backend_api.py | |
| # Server will start at http://localhost:8000 | |
| ``` | |
| ### Mobile UI Setup | |
| ```bash | |
| # 1. Navigate to UI folder | |
| cd ui | |
| # 2. Install dependencies | |
| npm install | |
| # 3. Configure API endpoint | |
| # Edit ui/src/config/api.js | |
| # Change: export const API_BASE_URL = 'http://YOUR_LOCAL_IP:8000'; | |
| # 4. Start Expo | |
| npx expo start --clear | |
| # Scan QR code with Expo Go app, or press 'w' for web | |
| ``` | |
| --- | |
| ## π§ API Reference | |
| ### POST `/api/answer` | |
| Answer a visual question with optional conversation context. | |
| **Request**: | |
| ```bash | |
| curl -X POST http://localhost:8000/api/answer \ | |
| -F "image=@photo.jpg" \ | |
| -F "question=Can this float in water?" \ | |
| -F "session_id=optional-uuid-here" | |
| ``` | |
| **Response**: | |
| ```json | |
| { | |
| "answer": "According to Wikidata, this object has a density of 917 kg/mΒ³, which is less than water (1000 kg/mΒ³), so it would float.", | |
| "model_used": "neuro_symbolic", | |
| "confidence": 0.87, | |
| "kg_enhancement": true, | |
| "wikidata_entity": "Q41576", | |
| "description": "The object appears to be made of ice. Based on its physical properties from scientific data, it would float on water due to lower density.", | |
| "session_id": "550e8400-e29b-41d4-a716-446655440000", | |
| "conversation_turn": 2 | |
| } | |
| ## π License | |
| MIT License - see LICENSE file for details | |
| --- | |
| ``` | |