Spaces:
Sleeping
Sleeping
| title: VQA Backend | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| <div align="center"> | |
| # GenVQA β Generative Visual Question Answering | |
| **A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.** | |
| [](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml) | |
| [](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml) | |
|  | |
|  | |
| </div> | |
| --- | |
| ## Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β CLIENT LAYER β | |
| β π± Expo Mobile App (React Native) β | |
| β β’ Image upload + question input β | |
| β β’ Displays answer + accessibility description β | |
| ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ | |
| β HTTP POST /api/answer | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β BACKEND LAYER (FastAPI) β | |
| β backend_api.py β | |
| β β’ Request handling, session management β | |
| β β’ Conversation Manager β multi-turn context tracking β | |
| ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ROUTING LAYER (ensemble_vqa_app.py) β | |
| β β | |
| β CLIP encodes question β compares against: β | |
| β "reasoning question" vs "visual/perceptual question" β | |
| β β | |
| β Reasoning? Visual? β | |
| β β β β | |
| β βΌ βΌ β | |
| β βββββββββββββββββββ βββββββββββββββββββββββ β | |
| β β NEURO-SYMBOLIC β β NEURAL VQA PATH β β | |
| β β β β β β | |
| β β 1. VQA model β β VQA model (GRU + β β | |
| β β detects obj β β Attention) predicts β β | |
| β β β β answer directly β β | |
| β β 2. Wikidata API β ββββββββββββ¬βββββββββββ β | |
| β β fetches factsβ β β | |
| β β (P31, P2101, β β β | |
| β β P2054, P186,β β β | |
| β β P366 ...) β β β | |
| β β β β β | |
| β β 3. Groq LLM β β β | |
| β β verbalizes β β β | |
| β β from facts β β β | |
| β βββββββββββ¬ββββββββ β β | |
| β ββββββββββββββββ¬βββββββββββ β | |
| βββββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β GROQ SERVICE β | |
| β Accessibility β | |
| β description β | |
| β (2 sentences, β | |
| β screen-reader β | |
| β friendly) β | |
| ββββββββββ¬βββββββββ | |
| β | |
| βΌ | |
| JSON response | |
| { answer, model_used, | |
| kg_enhancement, | |
| wikidata_entity, | |
| description } | |
| ``` | |
| | Layer | Component | Role | | |
| |---|---|---| | |
| | **Client** | Expo React Native | Image upload, question input, answer display | | |
| | **API** | FastAPI (`backend_api.py`) | Routing, sessions, conversation state | | |
| | **Conversation** | `conversation_manager.py` | Multi-turn context, history tracking | | |
| | **Router** | CLIP (in `ensemble_vqa_app.py`) | Classifies question as reasoning vs visual | | |
| | **Neural VQA** | GRU + Attention (`model.py`) | Answers visual questions directly from image | | |
| | **Neuro-Symbolic** | `semantic_neurosymbolic_vqa.py` | VQA detects objects β Wikidata fetches facts β Groq verbalizes | | |
| | **Accessibility** | `groq_service.py` | Generates spoken-friendly 2-sentence description for every answer | | |
| --- | |
| ## Features | |
| - π **Visual Question Answering** β trained on VQAv2, fine-tuned on custom data | |
| - π§ **Neuro-Symbolic Routing** β CLIP semantically classifies questions as _reasoning_ vs _visual_, routes accordingly | |
| - π **Live Wikidata Facts** β queries physical properties, categories, materials, uses in real time | |
| - π€ **Groq Verbalization** β Llama 3.3 70B answers from structured facts, not hallucination | |
| - π¬ **Conversational Support** β multi-turn conversation manager with context tracking | |
| - π± **Expo Mobile UI** β React Native app for iOS/Android/Web | |
| - βΏ **Accessibility** β Groq generates spoken-friendly descriptions for every answer | |
| --- | |
| ## Quick Start | |
| ### 1 β Backend | |
| ```bash | |
| # Clone and install | |
| git clone https://github.com/DevaRajan8/Generative-vqa.git | |
| cd Generative-vqa | |
| pip install -r requirements_api.txt | |
| # Set your Groq API key | |
| cp .env.example .env | |
| # Edit .env β GROQ_API_KEY=your_key_here | |
| # Start API | |
| python backend_api.py | |
| # β http://localhost:8000 | |
| ``` | |
| ### 2 β Mobile UI | |
| ```bash | |
| cd ui | |
| npm install | |
| npx expo start --clear | |
| ``` | |
| > Scan the QR code with Expo Go, or press `w` for browser. | |
| --- | |
| ## API | |
| | Endpoint | Method | Description | | |
| |---|---|---| | |
| | `/api/answer` | POST | Answer a question about an uploaded image | | |
| | `/api/health` | GET | Health check | | |
| | `/api/conversation/new` | POST | Start a new conversation session | | |
| **Example:** | |
| ```bash | |
| curl -X POST http://localhost:8000/api/answer \ | |
| -F "image=@photo.jpg" \ | |
| -F "question=Can this melt?" | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "answer": "ice", | |
| "model_used": "neuro-symbolic", | |
| "kg_enhancement": "Yes β ice can melt. [Wikidata P2101: melting point = 0.0 Β°C]", | |
| "knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)", | |
| "wikidata_entity": "Q86" | |
| } | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| βββ backend_api.py # FastAPI server | |
| βββ ensemble_vqa_app.py # VQA orchestrator (routing + inference) | |
| βββ semantic_neurosymbolic_vqa.py # Wikidata KB + Groq verbalizer | |
| βββ groq_service.py # Groq accessibility descriptions | |
| βββ conversation_manager.py # Multi-turn conversation tracking | |
| βββ model.py # VQA model definition | |
| βββ train.py # Training pipeline | |
| βββ ui/ # Expo React Native app | |
| β βββ src/screens/HomeScreen.js | |
| βββ .github/ | |
| βββ workflows/ # CI β backend lint + UI build | |
| βββ ISSUE_TEMPLATE/ | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Required | Description | | |
| |---|---|---| | |
| | `GROQ_API_KEY` | β | Groq API key β [get one free](https://console.groq.com) | | |
| | `MODEL_PATH` | optional | Path to VQA checkpoint (default: `vqa_checkpoint.pt`) | | |
| | `PORT` | optional | API server port (default: `8000`) | | |
| --- | |
| ## Requirements | |
| - Python 3.10+ | |
| - CUDA GPU recommended (CPU works but is slow) | |
| - Node.js 20+ (for UI) | |
| - Groq API key (free tier available) | |
| --- | |
| ## License | |
| MIT Β© [DevaRajan8](https://github.com/DevaRajan8) | |