Spaces:

Deva8
/

vqa-backend

Sleeping

App Files Files Community

Deva8 commited on Mar 9

Commit

bb8f662

1 Parent(s): 016e102

Deploy VQA Space with model downloader

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.env.example +6 -0
.gitattributes +12 -34
DATASET_CARD.md +250 -0
Dockerfile +23 -0
HOW_TO_RUN.md +255 -0
PATTERN_MATCHING_FIX.md +86 -0
QUICK_START.md +196 -0
README.md +203 -7
README_COMPLETE.md +530 -0
SETUP_GUIDE.md +118 -0
VQA_ENHANCEMENTS.md +298 -0
__pycache__/backend_api.cpython-312.pyc +0 -0
__pycache__/conversation_manager.cpython-312.pyc +0 -0
__pycache__/ensemble_vqa_app.cpython-312.pyc +0 -0
__pycache__/groq_service.cpython-312.pyc +0 -0
__pycache__/knowledge_graph_service.cpython-312.pyc +0 -0
__pycache__/llm_reasoning_service.cpython-312.pyc +0 -0
__pycache__/model_spatial.cpython-312.pyc +0 -0
__pycache__/semantic_neurosymbolic_vqa.cpython-312.pyc +0 -0
architecture_draft.html +89 -0
architecture_draft.mmd +69 -0
backend_api.py +341 -0
continue.py +344 -0
continued_training_metric.csv +21 -0
conversation_manager.py +312 -0
download_models.py +27 -0
draft_generator.py +112 -0
ensemble_vqa_app.py +458 -0
enterprise_architecture.drawio +341 -0
exp_results/feature_extraction_metric.csv +31 -0
experiments/__pycache__/train.cpython-312.pyc +0 -0
experiments/test.py +73 -0
experiments/train.py +349 -0
experiments/utils/preprocess.py +164 -0
experiments/utils/vocab.py +65 -0
finetune.py +220 -0
finetune2.py +395 -0
genvqa-dataset.py +78 -0
groq_service.py +118 -0
knowledge_graph_service.py +291 -0
llm_reasoning_service.py +292 -0
model.py +224 -0
model_spatial.py +309 -0
models/__pycache__/model.cpython-312.pyc +0 -0
models/model.py +224 -0
quick_start.bat +71 -0
requirements_api.txt +14 -0
scores/feature.txt +77 -0
scores/score.py +300 -0
scores/vqa_evaluation_feature.csv +0 -0

.env.example ADDED Viewed

	@@ -0,0 +1,6 @@

+# Groq API Configuration
+# Get your API key from: https://console.groq.com/keys
+GROQ_API_KEY=your_groq_api_key_here
+# Optional: Model selection (default: llama-3.3-70b-versatile)
+# GROQ_MODEL=llama-3.3-70b-versatile

.gitattributes CHANGED Viewed

@@ -1,35 +1,13 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

 *.pt filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lhs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.json
+filter=lfs
+diff=lfs
+merge=lfs
+-text
+*.csv
+filter=lfs
+diff=lfs
+merge=lfs
+-text

DATASET_CARD.md ADDED Viewed

	@@ -0,0 +1,250 @@

+# VQA v2 Curated Dataset for Spatial Reasoning
+## Dataset Description
+This is a **curated and balanced subset** of the VQA v2 (Visual Question Answering v2.0) dataset, specifically preprocessed for training visual question answering models with enhanced spatial reasoning capabilities.
+### Dataset Summary
+- **Source**: VQA v2 (MSCOCO train2014 split)
+- **Task**: Visual Question Answering
+- **Language**: English
+- **License**: CC BY 4.0 (inherited from VQA v2)
+### Key Features
+✨ **Quality-Focused Curation**:
+- Filtered out ambiguous yes/no questions
+- Removed vague questions ("what is in the image", etc.)
+- Answer length limited to 5 words / 30 characters
+- Minimum answer frequency threshold (20 occurrences)
+🎯 **Balanced Distribution**:
+- Maximum 600 samples per answer class
+- Prevents model bias toward common answers
+- Ensures diverse question-answer coverage
+📊 **Dataset Statistics**:
+- **Total Q-A pairs**: ~[Your final count from running the script]
+- **Unique answers**: ~[Number of unique answer classes]
+- **Images**: MSCOCO train2014 subset
+- **Format**: JSON + CSV metadata
+---
+## Dataset Structure
+### Data Fields
+Each sample contains:
+```json
+{
+  "image_id": 123456,           // MSCOCO image ID
+  "question_id": 789012,        // VQA v2 question ID
+  "question": "What color is the car?",
+  "answer": "red",              // Most frequent answer from annotators
+  "image_path": "images/COCO_train2014_000000123456.jpg"
+}
+```
+### Data Splits
+- **Training**: Main dataset (recommend 80-90% for training)
+- **Validation**: User-defined split (recommend 10-20% for validation)
+### File Structure
+```
+gen_vqa_v2/
+├── images/                    # MSCOCO train2014 images
+│   └── COCO_train2014_*.jpg
+├── qa_pairs.json             # Question-answer pairs (JSON)
+└── metadata.csv              # Same data in CSV format
+```
+---
+## Data Preprocessing
+### Filtering Criteria
+**Excluded Answers**:
+- Generic responses: `yes`, `no`, `unknown`, `none`, `n/a`, `cant tell`, `not sure`
+**Excluded Questions**:
+- Ambiguous queries: "what is in the image", "what is this", "what is that", "what do you see"
+**Answer Constraints**:
+- Maximum 5 words per answer
+- Maximum 30 characters per answer
+- Minimum frequency: 20 occurrences across dataset
+**Balancing Strategy**:
+- Maximum 600 samples per answer class
+- Prevents over-representation of common answers (e.g., "white", "2")
+### Preprocessing Script
+The dataset was generated using `genvqa-dataset.py`:
+```python
+# Key parameters
+MIN_ANSWER_FREQ = 20          # Minimum answer occurrences
+MAX_SAMPLES_PER_ANSWER = 600  # Class balancing limit
+```
+---
+## Intended Use
+### Primary Use Cases
+✅ **Training VQA Models**:
+- Visual question answering systems
+- Multimodal vision-language models
+- Spatial reasoning research
+✅ **Research Applications**:
+- Evaluating spatial understanding in VQA
+- Studying answer distribution bias
+- Benchmarking ensemble architectures
+### Out-of-Scope Use
+❌ Medical diagnosis or safety-critical applications
+❌ Surveillance or privacy-invasive systems
+❌ Generating misleading or harmful content
+---
+## Dataset Creation
+### Source Data
+**VQA v2 Dataset**:
+- **Paper**: [Making the V in VQA Matter](https://arxiv.org/abs/1612.00837)
+- **Authors**: Goyal et al. (2017)
+- **Images**: MSCOCO train2014
+- **Original Size**: 443,757 question-answer pairs (train split)
+### Curation Rationale
+This curated subset addresses common VQA training challenges:
+1. **Bias Reduction**: Limits over-represented answers
+2. **Quality Control**: Removes ambiguous/uninformative samples
+3. **Spatial Focus**: Retains questions requiring spatial reasoning
+4. **Practical Constraints**: Focuses on concise, specific answers
+### Annotations
+Annotations are inherited from VQA v2:
+- 10 answers per question from human annotators
+- **Answer selection**: Most frequent answer among annotators
+- **Consensus**: Majority voting for ground truth
+---
+## Considerations for Using the Data
+### Social Impact
+This dataset inherits biases from MSCOCO and VQA v2:
+- **Geographic bias**: Primarily Western/North American scenes
+- **Cultural bias**: Limited representation of global diversity
+- **Object bias**: Common objects over-represented
+### Limitations
+⚠️ **Known Issues**:
+- Answer distribution still skewed toward common objects (e.g., "white", "2", "yes")
+- Spatial reasoning questions may be underrepresented
+- Some questions may have multiple valid answers
+⚠️ **Not Suitable For**:
+- Fine-grained visual reasoning (e.g., "How many stripes on the 3rd zebra?")
+- Rare object recognition
+- Non-English languages
+---
+## Citation
+### BibTeX
+```bibtex
+@inproceedings{goyal2017making,
+  title={Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
+  author={Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi},
+  booktitle={CVPR},
+  year={2017}
+}
+```
+### Original VQA v2 Dataset
+- **Homepage**: https://visualqa.org/
+- **Paper**: https://arxiv.org/abs/1612.00837
+- **License**: CC BY 4.0
+---
+## Additional Information
+### Dataset Curators
+Curated from VQA v2 by [Your Name/Organization]
+### Licensing
+This dataset is released under **CC BY 4.0**, consistent with the original VQA v2 license.
+### Contact
+For questions or issues, please contact [your email/GitHub].
+---
+## Usage Example
+### Loading the Dataset
+```python
+import json
+import pandas as pd
+from PIL import Image
+# Load metadata
+with open("gen_vqa_v2/qa_pairs.json", "r") as f:
+    data = json.load(f)
+# Or use CSV
+df = pd.read_csv("gen_vqa_v2/metadata.csv")
+# Access a sample
+sample = data[0]
+image = Image.open(f"gen_vqa_v2/{sample['image_path']}")
+question = sample['question']
+answer = sample['answer']
+print(f"Q: {question}")
+print(f"A: {answer}")
+```
+### Training Split
+```python
+from sklearn.model_selection import train_test_split
+# 80-20 train-val split
+train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)
+```
+---
+## Acknowledgments
+- **VQA v2 Team**: Goyal et al. for the original dataset
+- **MSCOCO Team**: Lin et al. for the image dataset
+- **Community**: Open-source VQA research community

Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
+WORKDIR /app
+# System deps
+RUN apt-get update && apt-get install -y \
+    git \
+    libglib2.0-0 \
+    libsm6 \
+    libxrender1 \
+    libxext6 \
+    libgl1-mesa-glx \
+    && rm -rf /var/lib/apt/lists/*
+# Install Python deps
+COPY requirements_api.txt .
+RUN pip install --no-cache-dir -r requirements_api.txt
+# Copy all project files
+COPY . .
+# Download models before starting server
+CMD python download_models.py && uvicorn backend_api:app --host 0.0.0.0 --port 7860

HOW_TO_RUN.md ADDED Viewed

	@@ -0,0 +1,255 @@

+# 🚀 How to Run the VQA Mobile App
+## Quick Overview
+You now have a complete React Native mobile app for Visual Question Answering! Here's what was created:
+### ✅ What's Built
+1. **Backend API** (`backend_api.py`)
+   - FastAPI server wrapping your ensemble VQA models
+   - Automatic routing between base and spatial models
+   - Image upload and question answering endpoints
+2. **Mobile App** (`ui/` folder)
+   - Beautiful React Native app with Expo
+   - Google OAuth authentication
+   - Camera and gallery image picker
+   - Question input and answer display
+   - Model routing visualization
+## 🎯 Running the App (3 Steps)
+### Step 1: Start the Backend Server
+```bash
+# Open PowerShell/Terminal
+cd c:\Users\rdeva\Downloads\vqa_coes
+# Install API dependencies (FIRST TIME ONLY)
+# If you get import errors, run this:
+pip install fastapi uvicorn python-multipart
+# Start the server
+python start_backend.py
+# Or: python backend_api.py
+```
+> **Note**: If you get "ModuleNotFoundError", see [IMPORT_ERRORS_FIX.md](file:///c:/Users/rdeva/Downloads/vqa_coes/IMPORT_ERRORS_FIX.md) for solutions.
+✅ **Keep this window open!** The server must stay running.
+You should see:
+```
+🚀 INITIALIZING ENSEMBLE VQA SYSTEM
+✅ Ensemble ready!
+```
+### Step 2: Configure the Mobile App
+1. **Find your local IP address:**
+   ```bash
+   ipconfig
+   ```
+   Look for "IPv4 Address" (e.g., `192.168.1.100`)
+2. **Update the API URL:**
+   - Open: `ui\src\config\api.js`
+   - Change line 8:
+     ```javascript
+     export const API_BASE_URL = 'http://YOUR_IP_HERE:8000';
+     ```
+   - Example:
+     ```javascript
+     export const API_BASE_URL = 'http://192.168.1.100:8000';
+     ```
+### Step 3: Start the Mobile App
+```bash
+# Open a NEW PowerShell/Terminal window
+cd c:\Users\rdeva\Downloads\vqa_coes\ui
+# Start Expo
+npm start
+```
+You'll see a QR code in the terminal.
+### Step 4: Run on Your Phone
+1. **Install Expo Go** on your smartphone:
+   - [Android - Play Store](https://play.google.com/store/apps/details?id=host.exp.exponent)
+   - [iOS - App Store](https://apps.apple.com/app/expo-go/id982107779)
+2. **Scan the QR code:**
+   - Android: Open Expo Go → Scan QR
+   - iOS: Open Camera → Scan QR → Tap notification
+3. **Wait for the app to load** (first time takes ~1-2 minutes)
+## 📱 Using the App
+### Option A: Test Without Google Login
+For quick testing, you can bypass Google authentication:
+1. Open `ui\App.js`
+2. Find line 23-27 and replace with:
+   ```javascript
+   <Stack.Screen name="Home" component={HomeScreen} />
+   ```
+3. Save and reload the app (shake phone → Reload)
+### Option B: Set Up Google Login
+1. Go to [Google Cloud Console](https://console.cloud.google.com/)
+2. Create a new project
+3. Enable Google+ API
+4. Create OAuth 2.0 credentials
+5. Update `ui\src\config\google.js` with your client IDs
+### Testing VQA Functionality
+1. **Select an image:**
+   - Tap "Camera" to take a photo
+   - Tap "Gallery" to choose existing image
+2. **Ask a question:**
+   - Type your question (e.g., "What color is the car?")
+   - Tap "Ask Question"
+3. **View the answer:**
+   - See the AI-generated answer
+   - Check which model was used:
+     - 🔍 **Base Model** - General questions
+     - 📍 **Spatial Model** - Spatial questions (left, right, above, etc.)
+## 🧪 Example Questions to Try
+### General Questions (Base Model 🔍)
+- "What color is the car?"
+- "How many people are in the image?"
+- "What room is this?"
+- "Is there a dog?"
+### Spatial Questions (Spatial Model 📍)
+- "What is to the right of the table?"
+- "What is above the chair?"
+- "What is next to the door?"
+- "What is on the left side?"
+## 🔧 Troubleshooting
+### "Cannot connect to server"
+- ✅ Check backend is running (`python backend_api.py`)
+- ✅ Verify IP address in `api.js` matches your computer's IP
+- ✅ Ensure phone and computer are on the **same WiFi network**
+- ✅ Check Windows Firewall isn't blocking port 8000
+### "Model not loaded"
+- ✅ Ensure these files exist in `c:\Users\rdeva\Downloads\vqa_coes\`:
+  - `vqa_checkpoint.pt`
+  - `vqa_spatial_checkpoint.pt`
+- ✅ Check backend terminal for error messages
+### App won't load on phone
+- ✅ Verify Expo Go is installed
+- ✅ Both devices on same WiFi
+- ✅ Try restarting Expo: Press `Ctrl+C`, then `npm start`
+- ✅ Clear cache: `npm start -- --clear`
+### Camera/Gallery not working
+- ✅ Grant permissions when prompted
+- ✅ Check phone Settings → App Permissions
+## 📁 Project Structure
+```
+vqa_coes/
+├── backend_api.py              # FastAPI backend server
+├── ensemble_vqa_app.py         # Your existing ensemble system
+├── model_spatial.py            # Spatial model
+├── models/model.py             # Base model
+├── vqa_checkpoint.pt           # Base model weights
+├── vqa_spatial_checkpoint.pt   # Spatial model weights
+├── requirements_api.txt        # Backend dependencies
+��── QUICK_START.md             # This guide
+└── ui/                        # Mobile app
+    ├── App.js                 # Main app component
+    ├── app.json               # Expo configuration
+    ├── package.json           # Dependencies
+    └── src/
+        ├── config/
+        │   ├── api.js         # ⚠️ UPDATE YOUR IP HERE
+        │   └── google.js      # Google OAuth config
+        ├── contexts/
+        │   └── AuthContext.js # Authentication
+        ├── screens/
+        │   ├── LoginScreen.js # Login UI
+        │   └── HomeScreen.js  # Main VQA UI
+        ├── services/
+        │   └── api.js         # API client
+        └── styles/
+            ├── theme.js       # Design system
+            └── globalStyles.js
+```
+## 📚 Documentation
+- **Quick Start**: `QUICK_START.md` (this file)
+- **Full README**: `ui/README.md`
+- **Implementation Details**: See walkthrough artifact
+## 🎨 Customization
+### Change Colors
+Edit `ui/src/styles/theme.js`:
+```javascript
+colors: {
+  primary: '#6366F1',    // Change to your color
+  secondary: '#EC4899',  // Change to your color
+  // ...
+}
+```
+### Change App Name
+Edit `ui/app.json`:
+```json
+{
+  "expo": {
+    "name": "Your App Name",
+    "slug": "your-app-slug"
+  }
+}
+```
+## 🚢 Next Steps
+Once everything works:
+1. **Add Google OAuth** for production
+2. **Create custom icons** (see `ui/assets/ICONS_README.md`)
+3. **Build standalone app**:
+   ```bash
+   npx eas-cli build --platform android
+   ```
+## 💡 Tips
+- **Backend must run first** before starting the mobile app
+- **Same WiFi network** is required for phone and computer
+- **First load is slow** - subsequent loads are faster
+- **Shake phone** to access Expo developer menu
+- **Check logs** in both terminals for debugging
+## 🆘 Need Help?
+1. Check the troubleshooting section above
+2. Review backend terminal for errors
+3. Check Expo console in terminal
+4. Verify all configuration steps
+---
+**Ready to test?** Follow the 4 steps above and start asking questions about images! 🎉

PATTERN_MATCHING_FIX.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# Fix: Removed Hardcoded Patterns from Neuro-Symbolic VQA
+## Problem Identified
+The `_detect_objects_with_clip()` method in `semantic_neurosymbolic_vqa.py` contained a **predefined list of object categories**, which is essentially pattern matching and defeats the purpose of a truly neuro-symbolic approach.
+```python
+# ❌ OLD CODE - Hardcoded categories (pattern matching!)
+object_categories = [
+    "food", "soup", "noodles", "rice", "meat", "vegetable", "fruit",
+    "bowl", "plate", "cup", "glass", "spoon", "fork", "knife", ...
+]
+```
+This is **not acceptable** because:
+- It limits detection to predefined categories only
+- It's essentially pattern matching, not true neural understanding
+- It violates the neuro-symbolic principle of learning from data
+## Solution Applied
+### 1. Deprecated `_detect_objects_with_clip()`
+The method now returns an empty list and warns that it's deprecated:
+```python
+# ✅ NEW CODE - No predefined lists!
+def _detect_objects_with_clip(self, image_features, image_path=None):
+    """
+    NOTE: This method is deprecated in favor of using the VQA model
+    directly from ensemble_vqa_app.py.
+    """
+    print("⚠️  _detect_objects_with_clip is deprecated")
+    print("→ Use VQA model's _detect_multiple_objects() instead")
+    return []
+```
+### 2. Updated `answer_with_clip_features()`
+Now **requires** objects to be provided by the VQA model:
+```python
+# ✅ Objects must come from VQA model, not predefined lists
+def answer_with_clip_features(
+    self,
+    image_features,
+    question,
+    image_path=None,
+    detected_objects: List[str] = None  # REQUIRED!
+):
+    if not detected_objects:
+        print("⚠️  No objects provided - neuro-symbolic reasoning requires VQA-detected objects")
+        return None
+```
+### 3. Ensemble VQA Uses True VQA Detection
+The `ensemble_vqa_app.py` already uses `_detect_multiple_objects()` which:
+- Asks the VQA model **open-ended questions** like "What is this?"
+- Uses the model's learned knowledge, not predefined categories
+- Generates objects dynamically based on visual understanding
+```python
+# ✅ TRUE NEURO-SYMBOLIC APPROACH
+detected_objects = self._detect_multiple_objects(image, model, top_k=5)
+# This asks VQA model: "What is this?", "What food is this?", etc.
+# NO predefined categories!
+```
+## Result
+✅ **Pure Neuro-Symbolic Pipeline**:
+1. **VQA Model** detects objects using learned visual understanding (no predefined lists)
+2. **Wikidata** provides factual knowledge about detected objects
+3. **LLM** performs Chain-of-Thought reasoning on the facts
+4. **No pattern matching** anywhere in the pipeline
+## Files Modified
+- `semantic_neurosymbolic_vqa.py`:
+  - Deprecated `_detect_objects_with_clip()`
+  - Updated `answer_with_clip_features()` to require VQA-detected objects
+  - Changed knowledge source from "CLIP + Wikidata" to "VQA + Wikidata"
+## Verification
+The system now uses a **truly neuro-symbolic approach**:
+- ✅ No hardcoded object categories
+- ✅ No predefined patterns
+- ✅ Pure learned visual understanding from VQA model
+- ✅ Symbolic reasoning from Wikidata + LLM
+- ✅ Chain-of-Thought transparency

QUICK_START.md ADDED Viewed

	@@ -0,0 +1,196 @@

+# Quick Start Guide - VQA Mobile App
+This guide will help you get the VQA mobile app running quickly.
+## Prerequisites Checklist
+- [ ] Python 3.8+ installed
+- [ ] Node.js 16+ installed
+- [ ] VQA model checkpoints available
+- [ ] Smartphone with Expo Go app installed
+- [ ] Computer and phone on same WiFi network
+## Step-by-Step Setup
+### Step 1: Start the Backend Server
+```bash
+# Open terminal/PowerShell
+cd c:\Users\rdeva\Downloads\vqa_coes
+# Install backend dependencies (first time only)
+pip install -r requirements_api.txt
+# Start the server
+python backend_api.py
+```
+**Expected output:**
+```
+🚀 INITIALIZING ENSEMBLE VQA SYSTEM
+⚙️  Device: cuda
+📥 Loading models...
+✅ Ensemble ready!
+```
+**Important:** Keep this terminal window open! The server must keep running.
+### Step 2: Find Your Local IP Address
+**Windows:**
+```bash
+ipconfig
+```
+Look for "IPv4 Address" under your WiFi adapter (e.g., `192.168.1.100`)
+**Mac/Linux:**
+```bash
+ifconfig
+# or
+ip addr
+```
+### Step 3: Configure the Mobile App
+1. Open `ui/src/config/api.js`
+2. Replace the IP address:
+   ```javascript
+   export const API_BASE_URL = 'http://YOUR_IP_HERE:8000';
+   // Example: export const API_BASE_URL = 'http://192.168.1.100:8000';
+   ```
+### Step 4: Configure Google OAuth (Optional for Testing)
+**For testing without Google login**, you can skip this and modify the app to bypass authentication.
+**For full Google login:**
+1. Go to [Google Cloud Console](https://console.cloud.google.com/)
+2. Create a project
+3. Enable Google+ API
+4. Create OAuth 2.0 credentials
+5. Update `ui/src/config/google.js` with your client IDs
+### Step 5: Start the Mobile App
+```bash
+# Open a NEW terminal/PowerShell
+cd c:\Users\rdeva\Downloads\vqa_coes\ui
+# Start Expo
+npm start
+```
+**Expected output:**
+```
+Metro waiting on exp://192.168.1.100:8081
+› Scan the QR code above with Expo Go (Android) or the Camera app (iOS)
+```
+### Step 6: Run on Your Phone
+1. **Install Expo Go** on your phone:
+   - [Android - Play Store](https://play.google.com/store/apps/details?id=host.exp.exponent)
+   - [iOS - App Store](https://apps.apple.com/app/expo-go/id982107779)
+2. **Scan the QR code**:
+   - Android: Open Expo Go app → Scan QR code
+   - iOS: Open Camera app → Scan QR code → Tap notification
+3. **Wait for app to load** (first time may take 1-2 minutes)
+## Testing Without Google Login
+If you want to test the VQA functionality without setting up Google OAuth:
+1. Open `ui/App.js`
+2. Temporarily modify the navigation to always show HomeScreen:
+```javascript
+// Replace this:
+{user ? (
+  <Stack.Screen name="Home" component={HomeScreen} />
+) : (
+  <Stack.Screen name="Login" component={LoginScreen} />
+)}
+// With this:
+<Stack.Screen name="Home" component={HomeScreen} />
+```
+3. Restart the Expo server
+## Testing the App
+### Test 1: General Question (Base Model)
+1. Tap "Gallery" and select an image
+2. Enter question: "What color is the car?"
+3. Tap "Ask Question"
+4. Should show: 🔍 Base Model
+### Test 2: Spatial Question (Spatial Model)
+1. Select an image with multiple objects
+2. Enter question: "What is to the right of the table?"
+3. Tap "Ask Question"
+4. Should show: 📍 Spatial Model
+## Troubleshooting
+### "Cannot connect to server"
+- ✅ Check backend is running
+- ✅ Verify IP address in `api.js` is correct
+- ✅ Ensure phone and computer on same WiFi
+- ✅ Check firewall isn't blocking port 8000
+### "Model not loaded"
+- ✅ Check checkpoint files are in project root
+- ✅ Verify file names: `vqa_checkpoint.pt` and `vqa_spatial_checkpoint.pt`
+- ✅ Check backend terminal for error messages
+### App won't load on phone
+- ✅ Ensure Expo Go is installed
+- ✅ Check both devices on same network
+- ✅ Try restarting Expo server (Ctrl+C, then `npm start`)
+- ✅ Clear Expo cache: `npm start -- --clear`
+### "Permission denied" for camera/gallery
+- ✅ Grant permissions when prompted
+- ✅ Check phone settings → App permissions
+## Next Steps
+Once everything works:
+1. **Set up Google OAuth** for production use
+2. **Customize the UI** in `src/styles/theme.js`
+3. **Add custom icons** in `assets/` folder
+4. **Build standalone app** with `eas build`
+## Quick Commands Reference
+```bash
+# Start backend
+cd c:\Users\rdeva\Downloads\vqa_coes
+python backend_api.py
+# Start mobile app
+cd c:\Users\rdeva\Downloads\vqa_coes\ui
+npm start
+# Clear Expo cache
+npm start -- --clear
+# Install new package
+npm install package-name
+# Check backend health
+curl http://localhost:8000/health
+```
+## Support
+If you encounter issues:
+1. Check the main README.md
+2. Review backend terminal logs
+3. Check Expo console for errors
+4. Verify all prerequisites are met

README.md CHANGED Viewed

@@ -1,10 +1,206 @@
 ---
-title: Vqa Backend
-emoji: 🐢
-colorFrom: indigo
-colorTo: purple
-sdk: docker
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<div align="center">
+# GenVQA — Generative Visual Question Answering
+**A neuro-symbolic VQA system that detects objects with a neural model, retrieves structured facts from Wikidata, and generates grounded answers with Groq.**
+[![Backend CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/backend-ci.yml)
+[![UI CI](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml/badge.svg)](https://github.com/DevaRajan8/Generative-vqa/actions/workflows/ui-ci.yml)
+![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python)
+![License](https://img.shields.io/badge/License-MIT-green)
+</div>
+---
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                   CLIENT LAYER                              │
+│   📱 Expo Mobile App (React Native)                         │
+│   • Image upload + question input                           │
+│   • Displays answer + accessibility description             │
+└────────────────────────┬────────────────────────────────────┘
+                         │ HTTP POST /api/answer
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│                   BACKEND LAYER  (FastAPI)                  │
+│   backend_api.py                                            │
+│   • Request handling, session management                    │
+│   • Conversation Manager → multi-turn context tracking      │
+└────────────────────────┬────────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────────┐
+│            ROUTING LAYER  (ensemble_vqa_app.py)             │
+│                                                             │
+│   CLIP encodes question → compares against:                 │
+│   "reasoning question" vs "visual/perceptual question"      │
+│                                                             │
+│         Reasoning?                 Visual?                  │
+│             │                          │                    │
+│             ▼                          ▼                    │
+│   ┌─────────────────┐      ┌─────────────────────┐         │
+│   │ NEURO-SYMBOLIC  │      │   NEURAL VQA PATH   │         │
+│   │                 │      │                     │         │
+│   │ 1. VQA model    │      │  VQA model (GRU +   │         │
+│   │    detects obj  │      │  Attention) predicts │         │
+│   │                 │      │  answer directly     │         │
+│   │ 2. Wikidata API │      └──────────┬──────────┘         │
+│   │    fetches facts│                 │                    │
+│   │    (P31, P2101, │                 │                    │
+│   │     P2054, P186,│                 │                    │
+│   │     P366 ...)   │                 │                    │
+│   │                 │                 │                    │
+│   │ 3. Groq LLM     │                 │                    │
+│   │    verbalizes   │                 │                    │
+│   │    from facts   │                 │                    │
+│   └─────────┬───────┘                 │                    │
+│             └──────────────┬──────────┘                    │
+└──────────────────────────  │  ─────────────────────────────┘
+                             │
+                             ▼
+                    ┌─────────────────┐
+                    │   GROQ SERVICE  │
+                    │  Accessibility  │
+                    │  description    │
+                    │  (2 sentences,  │
+                    │  screen-reader  │
+                    │  friendly)      │
+                    └────��───┬────────┘
+                             │
+                             ▼
+                      JSON response
+                    { answer, model_used,
+                      kg_enhancement,
+                      wikidata_entity,
+                      description }
+```
+| Layer | Component | Role |
+|---|---|---|
+| **Client** | Expo React Native | Image upload, question input, answer display |
+| **API** | FastAPI (`backend_api.py`) | Routing, sessions, conversation state |
+| **Conversation** | `conversation_manager.py` | Multi-turn context, history tracking |
+| **Router** | CLIP (in `ensemble_vqa_app.py`) | Classifies question as reasoning vs visual |
+| **Neural VQA** | GRU + Attention (`model.py`) | Answers visual questions directly from image |
+| **Neuro-Symbolic** | `semantic_neurosymbolic_vqa.py` | VQA detects objects → Wikidata fetches facts → Groq verbalizes |
+| **Accessibility** | `groq_service.py` | Generates spoken-friendly 2-sentence description for every answer |
+---
+## Features
+- 🔍 **Visual Question Answering** — trained on VQAv2, fine-tuned on custom data
+- 🧠 **Neuro-Symbolic Routing** — CLIP semantically classifies questions as _reasoning_ vs _visual_, routes accordingly
+- 🌐 **Live Wikidata Facts** — queries physical properties, categories, materials, uses in real time
+- 🤖 **Groq Verbalization** — Llama 3.3 70B answers from structured facts, not hallucination
+- 💬 **Conversational Support** — multi-turn conversation manager with context tracking
+- 📱 **Expo Mobile UI** — React Native app for iOS/Android/Web
+- ♿ **Accessibility** — Groq generates spoken-friendly descriptions for every answer
+---
+## Quick Start
+### 1 — Backend
+```bash
+# Clone and install
+git clone https://github.com/DevaRajan8/Generative-vqa.git
+cd Generative-vqa
+pip install -r requirements_api.txt
+# Set your Groq API key
+cp .env.example .env
+# Edit .env → GROQ_API_KEY=your_key_here
+# Start API
+python backend_api.py
+# → http://localhost:8000
+```
+### 2 — Mobile UI
+```bash
+cd ui
+npm install
+npx expo start --clear
+```
+> Scan the QR code with Expo Go, or press `w` for browser.
+---
+## API
+| Endpoint | Method | Description |
+|---|---|---|
+| `/api/answer` | POST | Answer a question about an uploaded image |
+| `/api/health` | GET | Health check |
+| `/api/conversation/new` | POST | Start a new conversation session |
+**Example:**
+```bash
+curl -X POST http://localhost:8000/api/answer \
+  -F "image=@photo.jpg" \
+  -F "question=Can this melt?"
+```
+**Response:**
+```json
+{
+  "answer": "ice",
+  "model_used": "neuro-symbolic",
+  "kg_enhancement": "Yes — ice can melt. [Wikidata P2101: melting point = 0.0 °C]",
+  "knowledge_source": "VQA (neural) + Wikidata (symbolic) + Groq (verbalize)",
+  "wikidata_entity": "Q86"
+}
+```
+---
+## Project Structure
+```
+├── backend_api.py                  # FastAPI server
+├── ensemble_vqa_app.py             # VQA orchestrator (routing + inference)
+├── semantic_neurosymbolic_vqa.py   # Wikidata KB + Groq verbalizer
+├── groq_service.py                 # Groq accessibility descriptions
+├── conversation_manager.py         # Multi-turn conversation tracking
+├── model.py                        # VQA model definition
+├── train.py                        # Training pipeline
+├── ui/                             # Expo React Native app
+│   └── src/screens/HomeScreen.js
+└── .github/
+    ├── workflows/                  # CI — backend lint + UI build
+    └── ISSUE_TEMPLATE/
+```
 ---
+## Environment Variables
+| Variable | Required | Description |
+|---|---|---|
+| `GROQ_API_KEY` | ✅ | Groq API key — [get one free](https://console.groq.com) |
+| `MODEL_PATH` | optional | Path to VQA checkpoint (default: `vqa_checkpoint.pt`) |
+| `PORT` | optional | API server port (default: `8000`) |
 ---
+## Requirements
+- Python 3.10+
+- CUDA GPU recommended (CPU works but is slow)
+- Node.js 20+ (for UI)
+- Groq API key (free tier available)
+---
+## License
+MIT © [DevaRajan8](https://github.com/DevaRajan8)

README_COMPLETE.md ADDED Viewed

	@@ -0,0 +1,530 @@

+<div align="center">
+# 🧠 GenVQA — Generative Visual Question Answering
+**A hybrid neuro-symbolic VQA system that intelligently routes between pure neural networks and knowledge-grounded reasoning**
+</div>
+---
+## Overview
+GenVQA is an advanced Visual Question Answering system that combines the best of both worlds:
+- **Neural networks** for perception-based visual questions
+- **Symbolic reasoning** for knowledge-intensive reasoning questions
+The system automatically classifies incoming questions and routes them to the optimal processing pipeline, ensuring accurate and grounded answers.
+---
+## System Architecture
+```
+┌──────────────────────────────────────────────────────────────────┐
+│                        CLIENT                           │
+│         Expo React Native App (iOS/Android/Web)                  │
+│         • Image upload via camera/gallery                        │
+│         • Question input with suggested prompts                  │
+│         • Multi-turn conversational interface                    │
+│         • Google OAuth authentication                            │
+└───────────────────────────┬──────────────────────────────────────┘
+                            │ HTTP POST /api/answer
+                            ▼
+┌──────────────────────────────────────────────────────────────────┐
+│                    BACKEND API LAYER                          │
+│                    FastAPI (backend_api.py)                      │
+│         • Request handling & validation                          │
+│         • Session management & authentication                    │
+│         • Multi-turn conversation tracking                       │
+└───────────────────────────┬──────────────────────────────────────┘
+                            │
+                            ▼
+┌──────────────────────────────────────────────────────────────────┐
+│                   INTELLIGENT ROUTING LAYER                   │
+│                  (ensemble_vqa_app.py)                           │
+│                                                                  │
+│   CLIP Semantic Classifier:                                     │
+│   Encodes question → Compares similarity:                       │
+│   "This is a reasoning question about facts"                    │
+│              vs                                                  │
+│   "This is a visual perception question"                        │
+│                                                                  │
+│           Similarity > threshold?
+                                │
+│                     ├─────────┬────────┐                        │
+│                     │         │        │                        │
+│               REASONING    VISUAL   SPATIAL                      │
+│                     │         │        │                        │
+└─────────────────────┼─────────┼────────┼─────────────────────────┘
+                      │         │        │
+        ┌─────────────┘         │        └─────────────┐
+        ▼                       ▼                      ▼
+┌──────────────────┐   ┌───────────────────┐   ┌─────────────────┐
+│ NEURO-SYMBOLIC   │   │  NEURAL VQA PATH  │   │ SPATIAL ADAPTER │
+│    PIPELINE      │   │                   │   │      PATH       │
+│                  │   │  CLIP + GRU +     │   │                 │
+│ ① VQA Model      │   │  Attention        │   │  Enhanced with  │
+│    Detects       │   │                   │   │  spatial        │
+│    Objects       │   │  Direct answer    │   │  self-attention │
+│    (e.g. "soup") │   │  prediction from  │   │  for left/right │
+│                  │   │  image features   │   │  above/below    │
+│ ② Wikidata API   │   │                   │   │  questions      │
+│    Fetches Facts │   │  Outputs:         │   │                 │
+│    P31: category │   │  "red"            │   │  Outputs:       │
+│    P186: material│   └───────┬───────────┘   │  "on the left"  │
+│    P2101: melting│           │               └────────┬────────┘
+│    P366: use     │           │                        │
+│    P2054: density│           │                        │
+│                  │           │                        │
+│ ③ Groq LLM       │           │                        │
+│    Verbalizes    │           │                        │
+│    from facts    │           │                        │
+│    (instead
+      of free      │           │                        │
+│     reasoning)   │           │                        │
+│                  │           │                        │
+│ Outputs:         │           │                        │
+│ "Soup is made of │           │                        │
+│  water and       │           │                        │
+│  vegetables,     │           │                        │
+│  used for eating"│           │                        │
+└────────┬─────────┘           │                        │
+         │                     │                        │
+         └──────────┬──────────┴────────────────────────┘
+                    ▼
+         ┌──────────────────────┐
+         │  GROQ ACCESSIBILITY  │
+         │       SERVICE        │
+         │                      │
+         │  Generates 2-sentence│
+         │  screen-reader       │
+         │  friendly description│
+         │  for every answer    │
+         └──────────┬───────────┘
+                    │
+                    ▼
+              JSON Response
+         {
+           "answer": "...",
+           "model_used": "neuro_symbolic|base|spatial",
+           "confidence": 0.85,
+           "kg_enhancement": true/false,
+           "wikidata_entity": "Q123456",
+           "description": "...",
+           "session_id": "..."
+         }
+```
+---
+## Neural vs Neuro-Symbolic: Deep Dive
+### Neural Pathway
+**When Used**: Perceptual questions about what's directly visible
+- _"What color is the car?"_
+- _"How many people are in the image?"_
+- _"Is the dog sitting or standing?"_
+**Architecture**:
+```
+Image Input
+    │
+    ▼
+┌─────────────────────────────┐
+│    CLIP Vision Encoder      │
+│    (ViT-B/16)               │
+│    • Pre-trained on 400M    │
+│      image-text pairs       │
+│    • 512-dim embeddings     │
+└──────────┬──────────────────┘
+           │
+           ▼
+      [512-dim vector] ────────────┐
+                                   │
+Question Input                     │
+    │                              │
+    ▼                              │
+┌─────────────────────────────┐   │
+│   GPT-2 Text Encoder        │   │
+│   (distilgpt2)              │   │
+│   • Contextual embeddings   │   │
+│   • 768-dim output          │   │
+└──────────┬──────────────────┘   │
+           │                       │
+           ▼                       │
+      [768-dim vector]             │
+           │                       │
+           ▼                       │
+    ┌──────────────┐               │
+    │ Linear Proj  │               │
+    │ 768 → 512    │               │
+    └──────┬───────┘               │
+           │                       │
+           └───────────┬───────────┘
+                       │
+                       ▼
+            ┌──────────────────────┐
+            │  Multimodal Fusion   │
+            │  • Gated combination │
+            │  • 3-layer MLP       │
+            │  • ReLU + Dropout    │
+            └──────────┬───────────┘
+                       │
+                       ▼
+            ┌──────────────────────┐
+            │  GRU Decoder with    │
+            │  Attention Mechanism │
+            │                      │
+            │  • Hidden: 512-dim   │
+            │  • 2 layers          │
+            │  • Seq2seq decoding  │
+            │  • Attention over    │
+            │    fused features    │
+            └──────────┬───────────┘
+                       │
+                       ▼
+                 Answer Tokens
+                 "red car"
+```
+**Key Components**:
+- **CLIP**: Zero-shot image understanding, robust to domain shift
+- **GPT-2**: Contextual question encoding
+- **Attention**: Decoder focuses on relevant image regions per word
+- **GRU**: Sequential answer generation with memory
+**Training**:
+- Dataset: VQA v2 (curated, balanced subset)
+- Loss: Cross-entropy over answer vocabulary
+- Fine-tuning: Last 2 CLIP layers + full decoder
+- Accuracy: ~39% on general VQA, ~28% on spatial questions
+---
+### Neuro-Symbolic Pathway (Knowledge-Grounded Reasoning)
+**When Used**: Questions requiring external knowledge or reasoning
+- _"Can soup melt?"_
+- _"What is ice cream made of?"_
+- _"Does this float in water?"_
+**Architecture**:
+```
+Step 1: NEURAL DETECTION
+─────────────────────────
+Image + Question
+    │
+    ▼
+┌──────────────────────┐
+│   VQA Model          │
+│   (same as above)    │
+│                      │
+│   Predicts: "soup"   │
+└──────────┬───────────┘
+           │
+           ▼
+    Detected Object
+       "soup"
+Step 2: SYMBOLIC FACT RETRIEVAL
+────────────────────────────────
+    "soup"
+       │
+       ▼
+┌──────────────────────────────────┐
+│    Wikidata SPARQL Queries       │
+│                                  │
+│ ① Entity Resolution:             │
+│    "soup" → Q41415 (Wikidata ID) │
+│                                  │
+│ ② Fetch ALL Relevant Properties: │
+│                                  │
+│    P31  (instance of):           │
+│         → "food"                 │
+│         → "liquid food"          │
+│         → "dish"                 │
+│                                  │
+│    P186 (made of):               │
+│         → "water"                │
+│         → "vegetables"           │
+│         → "broth"                │
+│                                  │
+│    P366 (used for):              │
+│         → "consumption"          │
+│         → "nutrition"            │
+│                                  │
+│    P2101 (melting point):        │
+│         → (not found)            │
+│                                  │
+│    P2054 (density):              │
+│         → ~1000 kg/m³            │
+│         → (floats/sinks calc)    │
+│                                  │
+│    P2777 (flash point):          │
+│         → (not found)            │
+└──────────────┬───────────────────┘
+               │
+               ▼
+    Structured Knowledge Graph
+    {
+      "entity": "soup (Q41415)",
+      "categories": ["food", "liquid"],
+      "materials": ["water", "vegetables"],
+      "uses": ["consumption"],
+      "density": 1000,
+      "melting_point": null
+    }
+Step 3: LLM VERBALIZATION (NOT REASONING!)
+───────────────────────────────────────────
+    Knowledge Graph
+         │
+         ▼
+┌────────────────────────────────────┐
+│         Groq API                   │
+│     (Llama 3.3 70B)                │
+│                                    │
+│  System Prompt:                    │
+│  "You are a fact verbalizer.      │
+│   Answer ONLY from provided        │
+│   Wikidata facts. Do NOT use       │
+│   your training knowledge.         │
+│   If facts don't contain the       │
+│   answer, say 'unknown from        │
+│   available data'."                │
+│                                    │
+│  User Input:                       │
+│  Question: "Can soup melt?"        │
+│  Facts: {structured data above}    │
+└────────────┬───────────────────────┘
+             │
+             ▼
+    Natural Language Answer
+    "According to Wikidata, soup is
+     a liquid food made of water and
+     vegetables. Since it's already
+     liquid, it doesn't have a melting
+     point like solids do. It can
+     freeze, but not melt."
+```
+**Critical Design Principle**:
+> Groq is a **verbalizer**, NOT a reasoner. All reasoning happens in the symbolic layer (Wikidata facts). Groq only translates structured facts into natural language.
+**Why This Matters**:
+- **Without facts**: Groq hallucinates from training data
+- **With facts**: Groq grounds answers in real-time data
+- **Result**: Factual accuracy, no made-up information
+**Knowledge Base Properties Fetched**:
+| Property | Wikidata Code | Example Value |
+|----------|---------------|---------------|
+| Category | P31 | "food", "tool", "animal" |
+| Material | P186 | "metal", "wood", "plastic" |
+| Melting Point | P2101 | 273.15 K (0°C) |
+| Density | P2054 | 917 kg/m³ (floats/sinks) |
+| Use | P366 | "eating", "transportation" |
+| Flash Point | P2777 | 310 K (flammable) |
+| Location | P276 | "ocean", "forest" |
+---
+### Spatial Reasoning Pathway
+**When Used**: Questions about relative positions
+- _"What is to the left of the car?"_
+- _"Is the cat above or below the table?"_
+**Architecture Enhancement**:
+```
+Base VQA Model
+    │
+    ▼
+┌──────────────────────────────┐
+│  Spatial Self-Attention      │
+│  • Multi-head attention (8)  │
+│  • Learns spatial relations  │
+│  • Position-aware weighting  │
+└──────────┬───────────────────┘
+           │
+           ▼
+    Spatial-aware answer
+    "on the left side"
+```
+**Keyword Triggering**:
+- Detects: `left`, `right`, `above`, `below`, `top`, `bottom`, `next to`, `behind`, `between`, etc.
+- Routes to spatial adapter model
+- Enhanced accuracy on positional questions
+---
+## Intelligent Routing System
+**CLIP-Based Semantic Routing**:
+```python
+# Encode question with CLIP
+question_embedding = clip.encode_text(question)
+# Compare against two templates
+reasoning_prompt = "This is a reasoning question about facts and knowledge"
+visual_prompt = "This is a visual perception question about what you see"
+reasoning_similarity = cosine_similarity(question_embedding,
+                                         clip.encode_text(reasoning_prompt))
+visual_similarity = cosine_similarity(question_embedding,
+                                      clip.encode_text(visual_prompt))
+# Route decision
+if reasoning_similarity > visual_similarity + THRESHOLD:
+    route_to_neuro_symbolic()
+elif contains_spatial_keywords(question):
+    route_to_spatial_adapter()
+else:
+    route_to_base_neural()
+```
+**Routing Logic**:
+1. **Neuro-Symbolic** if CLIP classifies as reasoning (>0.6 similarity)
+2. **Spatial** if contains spatial keywords (`left`, `right`, `above`, etc.)
+3. **Base Neural** for all other visual perception questions
+---
+## Multi-Turn Conversation Support
+**Conversation Manager Features**:
+- Session tracking with UUID
+- Context retention across turns
+- Pronoun resolution (`it`, `this`, `that` → previous object)
+- Automatic session expiry (30 min timeout)
+**Example Conversation**:
+```
+Turn 1:
+User: "What is this?"
+VQA: "A red car"
+Objects: ["car"]
+Turn 2:
+User: "Can it float?"              # "it" = "car"
+System: Resolves "it" → "car"
+VQA: [Neuro-Symbolic] "According to Wikidata, cars are made
+      of metal and plastic with density around 800-1000 kg/m³,
+      which is close to water. Most cars would sink."
+Turn 3:
+User: "What color is it again?"    # Still referring to car
+VQA: [Neural] "red"                # From Turn 1 context
+```
+---
+## Quick Start
+### Prerequisites
+- Python 3.10+
+- CUDA GPU (recommended, 4GB+ VRAM)
+- Node.js 16+ (for mobile UI)
+- Groq API key ([get one free](https://console.groq.com))
+### Backend Setup
+```bash
+# 1. Clone repository
+git clone https://github.com/YourUsername/vqa_coes.git
+cd vqa_coes
+# 2. Install dependencies
+pip install -r requirements_api.txt
+# 3. Set environment variables
+echo "GROQ_API_KEY=your_groq_api_key_here" > .env
+# 4. Download model checkpoints (if not included)
+# Ensure these files exist in project root:
+#   - vqa_checkpoint.pt (base model)
+#   - vqa_spatial_checkpoint.pt (spatial model)
+# 5. Start API server
+python backend_api.py
+# Server will start at http://localhost:8000
+```
+### Mobile UI Setup
+```bash
+# 1. Navigate to UI folder
+cd ui
+# 2. Install dependencies
+npm install
+# 3. Configure API endpoint
+# Edit ui/src/config/api.js
+# Change: export const API_BASE_URL = 'http://YOUR_LOCAL_IP:8000';
+# 4. Start Expo
+npx expo start --clear
+# Scan QR code with Expo Go app, or press 'w' for web
+```
+---
+## 🔧 API Reference
+### POST `/api/answer`
+Answer a visual question with optional conversation context.
+**Request**:
+```bash
+curl -X POST http://localhost:8000/api/answer \
+  -F "image=@photo.jpg" \
+  -F "question=Can this float in water?" \
+  -F "session_id=optional-uuid-here"
+```
+**Response**:
+```json
+{
+  "answer": "According to Wikidata, this object has a density of 917 kg/m³, which is less than water (1000 kg/m³), so it would float.",
+  "model_used": "neuro_symbolic",
+  "confidence": 0.87,
+  "kg_enhancement": true,
+  "wikidata_entity": "Q41576",
+  "description": "The object appears to be made of ice. Based on its physical properties from scientific data, it would float on water due to lower density.",
+  "session_id": "550e8400-e29b-41d4-a716-446655440000",
+  "conversation_turn": 2
+}
+## 📄 License
+MIT License - see LICENSE file for details
+---
+```

SETUP_GUIDE.md ADDED Viewed

	@@ -0,0 +1,118 @@

+# VQA Accessibility Enhancement - Setup Guide
+## Backend Setup
+### 1. Install Python Dependencies
+```bash
+cd c:\Users\rdeva\Downloads\vqa_coes
+pip install -r requirements_api.txt
+```
+### 2. Configure Groq API Key
+1. Get your Groq API key from: https://console.groq.com/keys
+2. Create a `.env` file in the project root:
+   ```bash
+   copy .env.example .env
+   ```
+3. Edit `.env` and add your API key:
+   ```
+   GROQ_API_KEY=your_actual_groq_api_key_here
+   ```
+### 3. Start Backend Server
+```bash
+python backend_api.py
+```
+The server will start on `http://localhost:8000`
+---
+## Frontend Setup
+### 1. Install Node Dependencies
+```bash
+cd ui
+npm install
+```
+This will install the new `expo-speech` package for text-to-speech functionality.
+### 2. Start Expo App
+```bash
+npm start
+```
+Then:
+- Press `a` for Android emulator
+- Press `i` for iOS simulator
+- Scan QR code with Expo Go app for physical device
+---
+## Testing the Features
+### Image Display Fix
+1. Open the app
+2. Tap "Camera" or "Gallery" to select an image
+3. **Expected**: Image should display correctly (no blank screen)
+### LLM Description Feature
+1. Upload an image
+2. Enter a question (e.g., "What color is the car?")
+3. Tap "Ask Question"
+4. **Expected**:
+   - Original answer appears in the "Answer" card
+   - "Accessible Description" card appears below with 2-sentence description
+   - Speaker icon button is visible
+### Text-to-Speech
+1. After getting an answer with description
+2. Tap the speaker icon (🔊) in the "Accessible Description" card
+3. **Expected**: The description is read aloud
+4. Tap the stop icon (⏹️) to stop playback
+---
+## Troubleshooting
+### Backend Issues
+**Groq API Key Error**
+```
+ValueError: Groq API key not found
+```
+**Solution**: Make sure `.env` file exists with `GROQ_API_KEY=your_key`
+**Models Not Loading**
+```
+❌ Base checkpoint not found
+```
+**Solution**: Ensure `vqa_checkpoint.pt` and `vqa_spatial_checkpoint.pt` are in the project root
+### Frontend Issues
+**Image Not Displaying**
+- Make sure you've run `npm install` to get the latest `expo-image` package
+- Check console logs for image URI format issues
+**Text-to-Speech Not Working**
+- Ensure device volume is turned up
+- Check that `expo-speech` package is installed
+- On iOS simulator, speech may not work (test on physical device)
+**Cannot Connect to Backend**
+- Verify backend is running on port 8000
+- Update `ui/src/config/api.js` with correct backend URL
+- For physical devices, use ngrok or your computer's local IP
+---
+## Features Summary
+✅ **Fixed**: Image display issue (using expo-image instead of react-native Image)
+✅ **Added**: Groq LLM integration for 2-sentence descriptions
+✅ **Added**: Text-to-speech accessibility feature
+✅ **Added**: Visual distinction between raw answer and description
+✅ **Added**: Fallback mode when Groq API is unavailable

VQA_ENHANCEMENTS.md ADDED Viewed

	@@ -0,0 +1,298 @@

+# VQA Enhancements: LLM Reasoning & Conversational VQA
+This document describes the two major enhancements added to the VQA system.
+## 🧠 Feature 1: LLM-Driven Reasoning Engine
+### Overview
+Replaced hardcoded if/else rules with **Groq LLM Chain-of-Thought reasoning** for intelligent deductive reasoning from Wikidata facts.
+### What Changed
+**Before**: Hardcoded rules in `semantic_neurosymbolic_vqa.py`
+```python
+if 'melt' in question:
+    check material properties...
+```
+**After**: LLM-driven reasoning
+```python
+reasoning_result = llm_service.reason_with_facts(
+    object_name="candle",
+    facts={"materials": ["wax"], "categories": ["light source"]},
+    question="Can this melt?"
+)
+# Returns: Chain-of-Thought reasoning + answer
+```
+### Benefits
+- ✅ Handles complex questions like "Would this survive a fire?"
+- ✅ Provides transparent reasoning chains
+- ✅ More flexible and generalizable
+- ✅ Automatic fallback to rule-based reasoning if LLM fails
+### Example
+**Question**: "Can this melt?"
+**Object**: Candle
+**Facts**: Material: wax, Category: light source
+**LLM Reasoning Chain**:
+1. The object is a candle
+2. It is made of wax
+3. Wax has a low melting point (~60°C)
+4. Therefore, yes, it can melt at moderate temperatures
+**Answer**: "Yes, the candle can melt because it's made of wax, which has a low melting point."
+### Files Added/Modified
+- **NEW**: `llm_reasoning_service.py` - LLM reasoning with Chain-of-Thought
+- **MODIFIED**: `semantic_neurosymbolic_vqa.py` - Integrated LLM reasoning
+- **MODIFIED**: `backend_api.py` - Added reasoning_chain to API responses
+---
+## 💬 Feature 2: Conversational VQA
+### Overview
+Added **multi-turn conversation support** with context management and pronoun resolution.
+### What Changed
+**Before**: Single-shot Q&A with no context
+```
+User: "What is this?" → System: "A red apple."
+User: "Is it healthy?" → System: "What is 'it'?" ❌
+```
+**After**: Multi-turn conversations
+```
+User: "What is this?" → System: "A red apple."
+User: "Is it healthy?" → System: "Yes, apples are rich in fiber..." ✅
+(System knows "it" = apple)
+```
+### Benefits
+- ✅ Natural follow-up questions
+- ✅ Context-aware pronoun resolution
+- ✅ Session management with auto-expiration
+- ✅ Conversation history tracking
+### Example Conversation
+```
+Turn 1:
+Q: "What is this?"
+A: "A red apple"
+Objects: ["apple"]
+Turn 2:
+Q: "Is it healthy?"
+Resolved: "Is apple healthy?"
+A: "Yes, apples are rich in fiber and vitamins"
+Turn 3:
+Q: "What color is it?"
+Resolved: "What color is apple?"
+A: "Red"
+```
+### Files Added/Modified
+- **NEW**: `conversation_manager.py` - Multi-turn conversation management
+- **MODIFIED**: `ensemble_vqa_app.py` - Added `answer_conversational()` method
+- **MODIFIED**: `backend_api.py` - Added conversation endpoints
+---
+## 🚀 API Endpoints
+### Existing Endpoint (Enhanced)
+**POST** `/api/answer`
+- Now includes `reasoning_chain` in response
+- Backward compatible
+### New Conversation Endpoints
+**POST** `/api/conversation/answer`
+- Multi-turn conversation support
+- Request: `image`, `question`, `session_id` (optional)
+- Response includes:
+  - `session_id` - For continuing conversation
+  - `resolved_question` - Question with pronouns resolved
+  - `conversation_context` - Previous turns, objects, etc.
+  - `reasoning_chain` - LLM reasoning steps (if applicable)
+**GET** `/api/conversation/{session_id}/history`
+- Get full conversation history
+- Returns all turns with timestamps
+**DELETE** `/api/conversation/{session_id}`
+- Clear conversation session
+- Useful for starting fresh
+---
+## 📋 Usage Examples
+### Example 1: LLM Reasoning (Python)
+```python
+from llm_reasoning_service import get_llm_reasoning_service
+service = get_llm_reasoning_service()
+result = service.reason_with_facts(
+    object_name="ice cream",
+    facts={
+        "materials": ["milk", "sugar", "cream"],
+        "categories": ["frozen dessert"]
+    },
+    question="Would this survive in the desert?"
+)
+print(result['answer'])
+# "No, ice cream would not survive in the desert because..."
+print(result['reasoning_chain'])
+# ["Ice cream is a frozen dessert", "Deserts are hot...", ...]
+```
+### Example 2: Conversational VQA (API)
+```bash
+# Turn 1: Ask what it is
+curl -X POST http://localhost:8000/api/conversation/answer \
+  -F "image=@apple.jpg" \
+  -F "question=What is this?"
+# Response: {"session_id": "abc123", "answer": "apple", ...}
+# Turn 2: Follow-up question with pronoun
+curl -X POST http://localhost:8000/api/conversation/answer \
+  -F "image=@apple.jpg" \
+  -F "question=Is it healthy?" \
+  -F "session_id=abc123"
+# Response: {
+#   "resolved_question": "Is apple healthy?",
+#   "answer": "Yes, apples are healthy",
+#   "conversation_context": {"turn_number": 2, ...}
+# }
+```
+### Example 3: Conversational VQA (Python)
+```python
+from ensemble_vqa_app import ProductionEnsembleVQA
+ensemble = ProductionEnsembleVQA(
+    base_checkpoint="vqa_checkpoint.pt",
+    spatial_checkpoint="vqa_spatial_checkpoint.pt"
+)
+# Turn 1
+result1 = ensemble.answer_conversational(
+    image_path="apple.jpg",
+    question="What is this?",
+    verbose=True
+)
+session_id = result1['session_id']
+print(f"Answer: {result1['answer']}")  # "apple"
+# Turn 2 - pronoun resolution
+result2 = ensemble.answer_conversational(
+    image_path="apple.jpg",
+    question="Is it healthy?",
+    session_id=session_id,
+    verbose=True
+)
+print(f"Resolved: {result2['resolved_question']}")  # "Is apple healthy?"
+print(f"Answer: {result2['answer']}")  # "Yes, apples are healthy"
+```
+---
+## ⚙️ Configuration
+### Environment Variables
+```bash
+# Required for LLM reasoning
+GROQ_API_KEY=your_groq_api_key_here
+```
+### Session Timeout
+Conversations expire after **30 minutes** of inactivity (configurable in `ConversationManager`).
+---
+## 🧪 Testing
+Run the test suite:
+```bash
+python test_vqa_enhancements.py
+```
+Tests include:
+- ✅ LLM reasoning with various question types
+- ✅ Conversation manager pronoun resolution
+- ✅ Session management and expiration
+- ✅ Integration with existing VQA system
+---
+## 🔄 Backward Compatibility
+**All existing functionality remains intact:**
+- ✅ Original `/api/answer` endpoint works unchanged
+- ✅ Single-shot Q&A still supported
+- ✅ Spatial routing unchanged
+- ✅ Neuro-symbolic fallback preserved
+**New features are opt-in:**
+- Use `/api/conversation/answer` for multi-turn
+- LLM reasoning activates automatically for reasoning questions
+- Fallback to rule-based if LLM unavailable
+---
+## 📊 Architecture
+```
+User Question
+    ↓
+Ensemble VQA
+    ↓
+┌─────────────────────────────────┐
+│  Conversation Manager           │
+│  - Resolve pronouns             │
+│  - Track context                │
+└─────────────────────────────────┘
+    ↓
+┌─────────────────────────────────┐
+│  Semantic Neuro-Symbolic VQA    │
+│  - Detect objects (VQA)         │
+│  - Query Wikidata               │
+└─────────────────────────────────┘
+    ↓
+┌─────────────────────────────────┐
+│  LLM Reasoning Service          │
+│  - Chain-of-Thought reasoning   │
+│  - Fallback to rules            │
+└─────────────────────────────────┘
+    ↓
+Answer + Reasoning Chain
+```
+---
+## 🎯 Key Improvements
+| Feature | Before | After |
+|---------|--------|-------|
+| **Reasoning** | Hardcoded if/else rules | LLM Chain-of-Thought |
+| **Conversations** | Single-shot only | Multi-turn with context |
+| **Pronouns** | Not handled | Automatic resolution |
+| **Transparency** | Black box | Reasoning chains visible |
+| **Flexibility** | Rigid rules | Adaptive LLM reasoning |
+---
+## 📝 Notes
+- LLM reasoning requires `GROQ_API_KEY` environment variable
+- Conversation sessions auto-expire after 30 minutes
+- All features have fallback mechanisms for robustness
+- Zero breaking changes to existing code

__pycache__/backend_api.cpython-312.pyc ADDED Viewed

Binary file (14.8 kB). View file

__pycache__/conversation_manager.cpython-312.pyc ADDED Viewed

Binary file (15.5 kB). View file

__pycache__/ensemble_vqa_app.cpython-312.pyc ADDED Viewed

Binary file (22.3 kB). View file

__pycache__/groq_service.cpython-312.pyc ADDED Viewed

Binary file (5.32 kB). View file

__pycache__/knowledge_graph_service.cpython-312.pyc ADDED Viewed

Binary file (10 kB). View file

__pycache__/llm_reasoning_service.cpython-312.pyc ADDED Viewed

Binary file (13.4 kB). View file

__pycache__/model_spatial.cpython-312.pyc ADDED Viewed

Binary file (25.5 kB). View file

__pycache__/semantic_neurosymbolic_vqa.cpython-312.pyc ADDED Viewed

Binary file (32 kB). View file

architecture_draft.html ADDED Viewed

	@@ -0,0 +1,89 @@

+<!DOCTYPE html>
+<html>
+<head>
+    <title>VQA Architecture Draft</title>
+    <script type="module">
+      import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
+      mermaid.initialize({ startOnLoad: true, theme: 'dark', flowchart: { curve: 'basis' } });
+    </script>
+    <style>
+        body { background-color: #0D1117; color: white; font-family: sans-serif; display: flex; justify-content: center; padding: 20px; }
+        .mermaid { background-color: #161B22; padding: 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.5); }
+    </style>
+</head>
+<body>
+    <div class="mermaid">
+graph TD
+    %% Styling
+    classDef default fill:#1A1A1A,stroke:#444,stroke-width:2px,color:#FFF,rx:8px,ry:8px,font-family:arial;
+    classDef mobile fill:#003366,stroke:#0055AA,stroke-width:2px,color:#FFF;
+    classDef preproc fill:#333333,stroke:#555,stroke-width:2px,color:#FFF;
+    classDef model fill:#4B0082,stroke:#8A2BE2,stroke-width:2px,color:#FFF;
+    classDef condition fill:#2B2B2B,stroke:#F4A460,stroke-width:2px,color:#FFF,shape:rhombus;
+    classDef external fill:#004d00,stroke:#009900,stroke-width:2px,color:#FFF;
+    classDef final fill:#660000,stroke:#CC0000,stroke-width:2px,color:#FFF;
+    %% Nodes
+    UserApp[📱 Mobile App]:::mobile
+    ImgUpload[🖼️ Image]:::preproc
+    Question[⌨️ Question Text]:::preproc
+    PIL[🐍 PIL Preprocessing<br/>RGB conversion]:::preproc
+    CLIP[👁️ OpenAI CLIP ViT-B/32<br/>Image Features 512-dim]:::model
+    GPT2[🤗 DistilGPT-2<br/>Tokenized Question]:::model
+    Route1{Question<br/>spatial?}:::condition
+    Spatial[📐 Spatial VQA Model<br/>8-head attention]:::model
+    Base[🧠 Base VQA Model<br/>General VQA]:::model
+    Decoder[🤗 GPT-2 Decoder<br/>vocab decode]:::model
+    NeuralAns[💬 Neural Answer]:::final
+    Route2{Knowledge<br/>question?}:::condition
+    ObjDet[👁️ CLIP Object Detector<br/>Top-3 objects]:::model
+    Wikidata[🌍 Wikidata SPARQL<br/>P31, P186, P366]:::external
+    GroqV[⚡ Groq Llama-3.3<br/>Verbalizer]:::external
+    KGAns[🧩 KG Enhancement]:::final
+    FastAPI[🚀 FastAPI]:::preproc
+    GroqA[⚡ Groq Llama-3.3<br/>Accessibility]:::external
+    Audio[🔊 2-sentence description]:::final
+    %% Edges
+    UserApp -- "Image uploaded" --> ImgUpload
+    UserApp -- "Question typed" --> Question
+    ImgUpload --> PIL
+    PIL --> CLIP
+    Question --> GPT2
+    CLIP & GPT2 --> Route1
+    Route1 -- "YES" --> Spatial
+    Route1 -- "NO" --> Base
+    Spatial & Base -- "Beam search (width=5)" --> Decoder
+    Decoder --> NeuralAns
+    CLIP -- "Anchor similarity" --> Route2
+    Route2 -- "YES" --> ObjDet
+    ObjDet -- "Detected objects" --> Wikidata
+    Wikidata -- "Structured facts" --> GroqV
+    GroqV --> KGAns
+    FastAPI -- "Narration request" --> GroqA
+    GroqA --> Audio
+    NeuralAns & KGAns & Audio -- "JSON output" --> FastAPI
+    FastAPI --> UserApp
+    </div>
+</body>
+</html>

architecture_draft.mmd ADDED Viewed

	@@ -0,0 +1,69 @@

+graph TD
+    %% Styling
+    classDef default fill:#1A1A1A,stroke:#444,stroke-width:2px,color:#FFF,rx:8px,ry:8px,font-family:arial;
+    classDef mobile fill:#003366,stroke:#0055AA,stroke-width:2px,color:#FFF;
+    classDef preproc fill:#333333,stroke:#555,stroke-width:2px,color:#FFF;
+    classDef model fill:#4B0082,stroke:#8A2BE2,stroke-width:2px,color:#FFF;
+    classDef condition fill:#2B2B2B,stroke:#F4A460,stroke-width:2px,color:#FFF,shape:rhombus;
+    classDef external fill:#004d00,stroke:#009900,stroke-width:2px,color:#FFF;
+    classDef final fill:#660000,stroke:#CC0000,stroke-width:2px,color:#FFF;
+    %% Nodes
+    UserApp[📱 Mobile App]:::mobile
+    ImgUpload[🖼️ Image]:::preproc
+    Question[⌨️ Question Text]:::preproc
+    PIL[🐍 PIL Preprocessing<br/>RGB conversion]:::preproc
+    CLIP[👁️ OpenAI CLIP ViT-B/32<br/>Image Features 512-dim]:::model
+    GPT2[🤗 DistilGPT-2<br/>Tokenized Question]:::model
+    Route1{Question<br/>spatial?}:::condition
+    Spatial[📐 Spatial VQA Model<br/>8-head attention]:::model
+    Base[🧠 Base VQA Model<br/>General VQA]:::model
+    Decoder[🤗 GPT-2 Decoder<br/>vocab decode]:::model
+    NeuralAns[💬 Neural Answer]:::final
+    Route2{Knowledge<br/>question?}:::condition
+    ObjDet[👁️ CLIP Object Detector<br/>Top-3 objects]:::model
+    Wikidata[🌍 Wikidata SPARQL<br/>P31, P186, P366]:::external
+    GroqV[⚡ Groq Llama-3.3<br/>Verbalizer]:::external
+    KGAns[🧩 KG Enhancement]:::final
+    FastAPI[🚀 FastAPI]:::preproc
+    GroqA[⚡ Groq Llama-3.3<br/>Accessibility]:::external
+    Audio[🔊 2-sentence description]:::final
+    %% Edges
+    UserApp -- "Image uploaded" --> ImgUpload
+    UserApp -- "Question typed" --> Question
+    ImgUpload --> PIL
+    PIL --> CLIP
+    Question --> GPT2
+    CLIP & GPT2 --> Route1
+    Route1 -- "YES" --> Spatial
+    Route1 -- "NO" --> Base
+    Spatial & Base -- "Beam search (width=5)" --> Decoder
+    Decoder --> NeuralAns
+    CLIP -- "Anchor similarity" --> Route2
+    Route2 -- "YES" --> ObjDet
+    ObjDet -- "Detected objects" --> Wikidata
+    Wikidata -- "Structured facts" --> GroqV
+    GroqV --> KGAns
+    FastAPI -- "Narration request" --> GroqA
+    GroqA --> Audio
+    NeuralAns & KGAns & Audio -- "JSON output" --> FastAPI
+    FastAPI --> UserApp

backend_api.py ADDED Viewed

	@@ -0,0 +1,341 @@

+"""
+FastAPI Backend for Ensemble VQA Mobile App
+Provides REST API endpoints for the React Native mobile application
+"""
+from fastapi import FastAPI, File, UploadFile, Form, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import JSONResponse
+import uvicorn
+from PIL import Image
+import io
+import os
+import sys
+from pathlib import Path
+from dotenv import load_dotenv
+load_dotenv()
+from ensemble_vqa_app import ProductionEnsembleVQA
+from groq_service import get_groq_service
+app = FastAPI(
+    title="Ensemble VQA API",
+    description="Visual Question Answering API with ensemble model routing",
+    version="1.0.0"
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+ensemble_model = None
+groq_service = None
+@app.on_event("startup")
+async def startup_event():
+    """Initialize the ensemble VQA model on server startup"""
+    global ensemble_model, groq_service
+    print("=" * 80)
+    print("🚀 STARTING VQA API SERVER")
+    print("=" * 80)
+    BASE_CHECKPOINT = "./vqa_checkpoint.pt"
+    SPATIAL_CHECKPOINT = "./vqa_spatial_checkpoint.pt"
+    if not os.path.exists(BASE_CHECKPOINT):
+        print(f"❌ Base checkpoint not found: {BASE_CHECKPOINT}")
+        print("Please ensure vqa_checkpoint.pt is in the project root")
+        sys.exit(1)
+    if not os.path.exists(SPATIAL_CHECKPOINT):
+        print(f"❌ Spatial checkpoint not found: {SPATIAL_CHECKPOINT}")
+        print("Please ensure vqa_spatial_checkpoint.pt is in the project root")
+        sys.exit(1)
+    try:
+        ensemble_model = ProductionEnsembleVQA(
+            base_checkpoint=BASE_CHECKPOINT,
+            spatial_checkpoint=SPATIAL_CHECKPOINT,
+            device='cuda'
+        )
+        print("\n✅ VQA models loaded successfully!")
+        try:
+            groq_service = get_groq_service()
+            print("✅ Groq LLM service initialized for accessibility features")
+        except ValueError as e:
+            print(f"⚠️  Groq service not available: {e}")
+            print("   Accessibility descriptions will use fallback mode")
+            groq_service = None
+        print("📱 Mobile app can now connect")
+        print("=" * 80)
+    except Exception as e:
+        print(f"\n❌ Failed to load models: {e}")
+        sys.exit(1)
+@app.get("/")
+async def root():
+    """Root endpoint"""
+    return {
+        "message": "Ensemble VQA API",
+        "version": "1.0.0",
+        "status": "running",
+        "endpoints": {
+            "health": "/health",
+            "answer": "/api/answer (POST)"
+        }
+    }
+@app.get("/health")
+async def health_check():
+    """Health check endpoint"""
+    return {
+        "status": "healthy",
+        "model_loaded": ensemble_model is not None,
+        "models": {
+            "base": "loaded" if ensemble_model else "not loaded",
+            "spatial": "loaded" if ensemble_model else "not loaded"
+        }
+    }
+@app.post("/api/answer")
+async def answer_question(
+    image: UploadFile = File(...),
+    question: str = Form(...)
+):
+    """
+    Answer a visual question using the ensemble VQA system
+    Args:
+        image: Image file (JPEG, PNG)
+        question: Question text
+    Returns:
+        JSON response with answer, model used, accessibility description, and metadata
+    """
+    if ensemble_model is None:
+        raise HTTPException(status_code=503, detail="Model not loaded")
+    if not question or question.strip() == "":
+        raise HTTPException(status_code=400, detail="Question cannot be empty")
+    try:
+        image_bytes = await image.read()
+        try:
+            pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
+        except Exception as e:
+            raise HTTPException(status_code=400, detail=f"Invalid image format: {str(e)}")
+        temp_image_path = "temp_upload.jpg"
+        pil_image.save(temp_image_path)
+        result = ensemble_model.answer(
+            image_path=temp_image_path,
+            question=question,
+            use_beam_search=True,
+            beam_width=5,
+            verbose=True
+        )
+        if os.path.exists(temp_image_path):
+            os.remove(temp_image_path)
+        is_spatial = ensemble_model.is_spatial_question(question)
+        description = None
+        description_status = "not_available"
+        if groq_service is not None:
+            try:
+                desc_result = groq_service.generate_description(
+                    question=question,
+                    answer=result['answer']
+                )
+                description = desc_result.get('description')
+                description_status = desc_result.get('status', 'success')
+            except Exception as e:
+                print(f"⚠️  Groq description generation failed: {e}")
+                description = f"Question: {question}. Answer: {result['answer']}."
+                description_status = "fallback"
+        else:
+            description = f"Question: {question}. Answer: {result['answer']}."
+            description_status = "fallback"
+        reasoning_chain = None
+        if result.get('kg_enhancement'):
+            reasoning_chain = result.get('reasoning_chain', [])
+        return JSONResponse(content={
+            "success": True,
+            "answer": result['answer'],
+            "description": description,
+            "description_status": description_status,
+            "model_used": result['model_used'],
+            "confidence": result['confidence'],
+            "question_type": "spatial" if is_spatial else "general",
+            "question": question,
+            "kg_enhancement": result.get('kg_enhancement'),
+            "reasoning_type": result.get('reasoning_type', 'neural'),
+            "reasoning_chain": reasoning_chain,
+            "metadata": {
+                "beam_search": True,
+                "beam_width": 5
+            }
+        })
+    except HTTPException:
+        raise
+    except Exception as e:
+        print(f"❌ Error processing request: {e}")
+        raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
+@app.get("/api/models/info")
+async def models_info():
+    """Get information about loaded models"""
+    if ensemble_model is None:
+        raise HTTPException(status_code=503, detail="Models not loaded")
+    return {
+        "base_model": {
+            "name": "Base VQA Model",
+            "description": "General visual question answering",
+            "accuracy": "50%",
+            "use_case": "General questions about objects, colors, counts, etc."
+        },
+        "spatial_model": {
+            "name": "Spatial Adapter Model",
+            "description": "Spatial reasoning and positional questions",
+            "accuracy": "40%",
+            "use_case": "Spatial questions (left, right, above, below, etc.)"
+        },
+        "routing": {
+            "method": "Keyword-based classification",
+            "spatial_keywords": ensemble_model.SPATIAL_KEYWORDS
+        },
+        "conversation": {
+            "enabled": ensemble_model.conversation_enabled if ensemble_model else False,
+            "timeout_minutes": 30
+        }
+    }
+@app.post("/api/conversation/answer")
+async def answer_conversational(
+    image: UploadFile = File(...),
+    question: str = Form(...),
+    session_id: str = Form(None)
+):
+    """
+    Answer a visual question with multi-turn conversation support.
+    Handles pronoun resolution and maintains conversation context.
+    Args:
+        image: Image file (JPEG, PNG)
+        question: Question text (may contain pronouns like "it", "this")
+        session_id: Optional session ID to continue conversation
+    Returns:
+        JSON response with answer, session_id, resolved question, and context
+    """
+    if ensemble_model is None:
+        raise HTTPException(status_code=503, detail="Model not loaded")
+    if not ensemble_model.conversation_enabled:
+        raise HTTPException(
+            status_code=501,
+            detail="Conversational VQA not available. Use /api/answer instead."
+        )
+    if not question or question.strip() == "":
+        raise HTTPException(status_code=400, detail="Question cannot be empty")
+    try:
+        image_bytes = await image.read()
+        try:
+            pil_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')
+        except Exception as e:
+            raise HTTPException(status_code=400, detail=f"Invalid image format: {str(e)}")
+        temp_image_path = "temp_upload.jpg"
+        pil_image.save(temp_image_path)
+        result = ensemble_model.answer_conversational(
+            image_path=temp_image_path,
+            question=question,
+            session_id=session_id,
+            use_beam_search=True,
+            beam_width=5,
+            verbose=True
+        )
+        if os.path.exists(temp_image_path):
+            os.remove(temp_image_path)
+        description = None
+        if groq_service is not None:
+            try:
+                desc_result = groq_service.generate_description(
+                    question=result['resolved_question'],
+                    answer=result['answer']
+                )
+                description = desc_result.get('description')
+            except:
+                description = f"Question: {question}. Answer: {result['answer']}."
+        else:
+            description = f"Question: {question}. Answer: {result['answer']}."
+        return JSONResponse(content={
+            "success": True,
+            "answer": result['answer'],
+            "description": description,
+            "session_id": result['session_id'],
+            "resolved_question": result['resolved_question'],
+            "original_question": question,
+            "conversation_context": result['conversation_context'],
+            "model_used": result['model_used'],
+            "confidence": result['confidence'],
+            "kg_enhancement": result.get('kg_enhancement'),
+            "reasoning_type": result.get('reasoning_type', 'neural'),
+            "reasoning_chain": result.get('reasoning_chain'),
+            "metadata": {
+                "beam_search": True,
+                "beam_width": 5,
+                "conversation_enabled": True
+            }
+        })
+    except HTTPException:
+        raise
+    except Exception as e:
+        print(f"❌ Error processing conversational request: {e}")
+        raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
+@app.get("/api/conversation/{session_id}/history")
+async def get_conversation_history(session_id: str):
+    """
+    Get conversation history for a session.
+    Args:
+        session_id: Session ID
+    Returns:
+        JSON with conversation history
+    """
+    if ensemble_model is None or not ensemble_model.conversation_enabled:
+        raise HTTPException(status_code=503, detail="Conversation service not available")
+    history = ensemble_model.conversation_manager.get_history(session_id)
+    if history is None:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Session {session_id} not found or expired"
+        )
+    return JSONResponse(content={
+        "success": True,
+        "session_id": session_id,
+        "history": history,
+        "turn_count": len(history)
+    })
+@app.delete("/api/conversation/{session_id}")
+async def delete_conversation(session_id: str):
+    """
+    Delete a conversation session.
+    Args:
+        session_id: Session ID to delete
+    Returns:
+        JSON with success status
+    """
+    if ensemble_model is None or not ensemble_model.conversation_enabled:
+        raise HTTPException(status_code=503, detail="Conversation service not available")
+    deleted = ensemble_model.conversation_manager.delete_session(session_id)
+    if not deleted:
+        raise HTTPException(
+            status_code=404,
+            detail=f"Session {session_id} not found"
+        )
+    return JSONResponse(content={
+        "success": True,
+        "message": f"Session {session_id} deleted"
+    })
+if __name__ == "__main__":
+    print("\n" + "=" * 80)
+    print("🚀 ENSEMBLE VQA API SERVER")
+    print("=" * 80)
+    print("\n📋 Configuration:")
+    print("  - Host: 0.0.0.0 (accessible from network)")
+    print("  - Port: 8000")
+    print("  - Reload: Enabled (development mode)")
+    print("\n🔗 Access URLs:")
+    print("  - Local: http://localhost:8000")
+    print("  - Network: http://<your-ip>:8000")
+    print("  - Docs: http://localhost:8000/docs")
+    print("\n💡 For mobile testing:")
+    print("  1. Find your local IP: ipconfig (Windows) or ifconfig (Mac/Linux)")
+    print("  2. Update API_URL in mobile app to http://<your-ip>:8000")
+    print("  3. Ensure phone and computer are on same network")
+    print("=" * 80 + "\n")
+    uvicorn.run(
+        "backend_api:app",
+        host="0.0.0.0",
+        port=7860,  # HuggingFace Spaces requires port 7860
+        reload=True,
+        log_level="info"
+    )

continue.py ADDED Viewed

	@@ -0,0 +1,344 @@

+import os
+import pandas as pd
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+from PIL import Image
+from transformers import GPT2Tokenizer
+import matplotlib.pyplot as plt
+import numpy as np
+from tqdm import tqdm
+from collections import Counter
+from nltk.tokenize import word_tokenize
+from sklearn.model_selection import train_test_split
+from torchvision import transforms
+from model import VQAModel
+device = 'cuda'
+class Vocab:
+    def __init__(self):
+        self.vocab = None
+        self.vocab_size = None
+        self.word2idx = None
+        self.idx2word = None
+        self.pad = '<pad>'
+        self.bos = '<bos>'
+        self.eos = '<eos>'
+        self.unk = '<unk>'
+    def build_vocab(self, df, min_freq=1):
+        counter = Counter()
+        for ans in df['answer']:
+            tokens = word_tokenize(ans.lower())
+            counter.update(tokens)
+        vocab = sorted([word for word, freq in counter.items() if freq >= min_freq])
+        vocab = [self.pad, self.bos, self.eos, self.unk] + vocab
+        word2idx = {word: idx for idx, word in enumerate(vocab)}
+        idx2word = {idx: word for word, idx in word2idx.items()}
+        self.vocab = vocab
+        self.word2idx = word2idx
+        self.idx2word = idx2word
+        self.vocab_size = len(vocab)
+        self.pad_token_id = self.word2idx["<pad>"]
+        self.bos_token_id = self.word2idx["<bos>"]
+        self.eos_token_id = self.word2idx["<eos>"]
+        self.unk_token_id = self.word2idx["<unk>"]
+    def encoder(self, text, max_len):
+        tokens = word_tokenize(text.lower())
+        token_ids = [self.word2idx.get(token, self.unk_token_id) for token in tokens]
+        token_ids = [self.bos_token_id] + token_ids + [self.eos_token_id]
+        if len(token_ids) < max_len:
+            token_ids += [self.pad_token_id] * (max_len - len(token_ids))
+        else:
+            token_ids = token_ids[:max_len]
+        return token_ids
+    def decoder(self, token_ids):
+        tokens = []
+        for idx in token_ids:
+            if idx == self.eos_token_id:
+                break
+            if idx in (self.pad_token_id, self.bos_token_id):
+                continue
+            tokens.append(self.idx2word.get(idx, "<unk>"))
+        return ' '.join(tokens).strip()
+class AugmentedVQADataset(Dataset):
+    def __init__(self, df, img_dir, question_tokenizer, text_processor, clip_processor,
+                 question_max_len=32, answer_max_len=16, augment=True):
+        self.df = df
+        self.img_dir = img_dir
+        self.question_tokenizer = question_tokenizer
+        self.text_processor = text_processor
+        self.clip_processor = clip_processor
+        self.question_max_len = question_max_len
+        self.answer_max_len = answer_max_len
+        self.augment = augment
+        if augment:
+            self.transform = transforms.Compose([
+                transforms.RandomHorizontalFlip(p=0.5),
+                transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
+                transforms.RandomRotation(10),
+            ])
+        else:
+            self.transform = None
+    def __len__(self):
+        return len(self.df)
+    def __getitem__(self, idx):
+        row = self.df.iloc[idx]
+        img_path = os.path.join(self.img_dir, row['image_path'])
+        image = Image.open(img_path).convert('RGB')
+        question = row['question']
+        answer = row['answer']
+        if self.augment and self.transform:
+            image = self.transform(image)
+        question_tokenized = self.question_tokenizer(
+            question,
+            padding='max_length',
+            truncation=True,
+            max_length=self.question_max_len,
+            return_tensors='pt'
+        )
+        answer_ids = self.text_processor.encoder(answer, max_len=self.answer_max_len)
+        image = self.clip_processor(image)
+        return {
+            'image_path': img_path,
+            'image': image,
+            'question_ids': question_tokenized['input_ids'].squeeze(0),
+            'question_mask': question_tokenized['attention_mask'].squeeze(0),
+            'answer_ids': torch.tensor(answer_ids, dtype=torch.long)
+        }
+def save_checkpoint(model, optimizer, epoch, vocab, path):
+    torch.save({
+        'epoch': epoch,
+        'model_state_dict': model.state_dict(),
+        'optimizer_state_dict': optimizer.state_dict(),
+        'vocab': vocab.vocab,
+        'word2idx': vocab.word2idx,
+        'idx2word': vocab.idx2word,
+        'pad_token_id': vocab.pad_token_id,
+        'bos_token_id': vocab.bos_token_id,
+        'eos_token_id': vocab.eos_token_id,
+        'unk_token_id': vocab.unk_token_id,
+        'question_max_len': model.question_max_len,
+        'answer_max_len': model.answer_max_len
+    }, path)
+def plot_losses(train_losses, val_losses, save_path="loss_plot.png"):
+    plt.figure(figsize=(8,6))
+    plt.plot(train_losses, label="Train Loss")
+    plt.plot(val_losses, label="Validation Loss")
+    plt.xlabel("Epoch")
+    plt.ylabel("Loss")
+    plt.title("Train vs Validation Loss")
+    plt.legend()
+    plt.savefig(save_path)
+    plt.close()
+def train_one_epoch(model, dataloader, optimizer, device, scaler, vocab):
+    model.train()
+    total_loss = 0
+    total_token_acc = 0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id, label_smoothing=0.1)
+    for batch in tqdm(dataloader):
+        optimizer.zero_grad()
+        images = batch['image'].to(device)
+        questions = {
+            'input_ids': batch['question_ids'].to(device),
+            'attention_mask': batch['question_mask'].to(device)
+        }
+        answers = batch['answer_ids'].to(device)
+        with torch.amp.autocast(device):
+            logits = model(images, questions, answer_input_ids=answers)
+            shifted_logits = logits[:, :-1, :]
+            shifted_answers = answers[:, 1:]
+            loss = criterion(
+                shifted_logits.reshape(-1, shifted_logits.size(-1)),
+                shifted_answers.reshape(-1)
+            )
+            predicted_tokens = shifted_logits.argmax(dim=-1)
+            correct = (predicted_tokens == shifted_answers).float()
+            mask = (shifted_answers != vocab.pad_token_id).float()
+            token_acc = (correct * mask).sum() / mask.sum()
+            total_token_acc += token_acc.item()
+        scaler.scale(loss).backward()
+        scaler.unscale_(optimizer)
+        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        scaler.step(optimizer)
+        scaler.update()
+        total_loss += loss.item()
+    avg_loss = total_loss / len(dataloader)
+    avg_token_acc = total_token_acc / len(dataloader)
+    return avg_loss, avg_token_acc
+def validate_one_epoch(model, dataloader, device, vocab):
+    model.eval()
+    total_loss = 0
+    total_token_acc = 0
+    exact_matches = 0
+    total_samples = 0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id)
+    with torch.no_grad():
+        for batch in tqdm(dataloader):
+            images = batch['image'].to(device)
+            questions = {
+                'input_ids': batch['question_ids'].to(device),
+                'attention_mask': batch['question_mask'].to(device)
+            }
+            answers = batch['answer_ids'].to(device)
+            logits = model(images, questions, answer_input_ids=answers)
+            shifted_logits = logits[:, :-1, :]
+            shifted_answers = answers[:, 1:]
+            loss = criterion(
+                shifted_logits.reshape(-1, shifted_logits.size(-1)),
+                shifted_answers.reshape(-1)
+            )
+            total_loss += loss.item()
+            predicted_tokens = shifted_logits.argmax(dim=-1)
+            correct = (predicted_tokens == shifted_answers).float()
+            mask = (shifted_answers != vocab.pad_token_id).float()
+            token_acc = (correct * mask).sum() / mask.sum()
+            total_token_acc += token_acc.item()
+            generated = model(images, questions)
+            for pred, true in zip(generated, answers):
+                pred_text = vocab.decoder(pred.cpu().numpy())
+                true_text = vocab.decoder(true.cpu().numpy())
+                if pred_text.strip() == true_text.strip():
+                    exact_matches += 1
+                total_samples += 1
+    avg_loss = total_loss / len(dataloader)
+    avg_token_acc = total_token_acc / len(dataloader)
+    exact_match_acc = exact_matches / total_samples
+    return avg_loss, avg_token_acc, exact_match_acc
+def main():
+    print()
+    print("# VQA: Continue Training (Same Settings)")
+    print()
+    import random
+    import numpy as np
+    torch.manual_seed(42)
+    random.seed(42)
+    np.random.seed(42)
+    if torch.cuda.is_available(): torch.cuda.manual_seed_all(42)
+    DATA_DIR = r"./gen_vqa_v2"
+    CSV_PATH = os.path.join(DATA_DIR, "metadata.csv")
+    RESUME_CHECKPOINT = r"./output2/continued_training/vqa_checkpoint.pt"
+    OUTPUT_DIR = r"./output2/continued_training_2"
+    CHECKPOINT_PATH = os.path.join(OUTPUT_DIR, "vqa_checkpoint.pt")
+    LOG_CSV = os.path.join(OUTPUT_DIR, "train_log.csv")
+    LOSS_GRAPH_PATH = os.path.join(OUTPUT_DIR, "loss_plot.png")
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+    batch_size = 64
+    additional_epochs = 50
+    patience = 8
+    question_max_len = 20
+    answer_max_len = 12
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    print(device)
+    print(f"Loading checkpoint from: {RESUME_CHECKPOINT}")
+    checkpoint = torch.load(RESUME_CHECKPOINT, map_location=device)
+    start_epoch = checkpoint['epoch'] + 1
+    metadata = pd.read_csv(CSV_PATH)
+    vocab = Vocab()
+    vocab.vocab = checkpoint['vocab']
+    vocab.vocab_size = len(checkpoint['vocab'])
+    vocab.word2idx = checkpoint['word2idx']
+    vocab.idx2word = checkpoint['idx2word']
+    vocab.pad_token_id = checkpoint['pad_token_id']
+    vocab.bos_token_id = checkpoint['bos_token_id']
+    vocab.eos_token_id = checkpoint['eos_token_id']
+    vocab.unk_token_id = checkpoint['unk_token_id']
+    print(f"Answer Vocab Size: {len(vocab.vocab)}")
+    print(f"Resuming from epoch: {start_epoch}")
+    train_df, test_df = train_test_split(metadata, test_size=0.2, random_state=42)
+    val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)
+    print(f"Train size: {len(train_df)}, Val size: {len(val_df)}, Test size: {len(test_df)}")
+    print()
+    model = VQAModel(
+        vocab_size=len(vocab.vocab),
+        device=device,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        pad_token_id=vocab.pad_token_id,
+        bos_token_id=vocab.bos_token_id,
+        eos_token_id=vocab.eos_token_id,
+        unk_token_id=vocab.unk_token_id,
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    clip_processor = model.clip_preprocess
+    question_tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if question_tokenizer.pad_token is None:
+        question_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        model.gpt2_model.resize_token_embeddings(len(question_tokenizer))
+    model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    print("Model loaded from checkpoint!")
+    if model.fine_tuning_mode:
+        print("Model already in fine-tuning mode (encoders unfrozen)")
+    else:
+        print("Continuing with same training configuration")
+    print()
+    train_dataset = AugmentedVQADataset(
+        train_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=clip_processor,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        augment=True
+    )
+    val_dataset = AugmentedVQADataset(
+        val_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=clip_processor,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        augment=False
+    )
+    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True)
+    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)
+    trainable_params = [p for p in model.parameters() if p.requires_grad]
+    optimizer = torch.optim.AdamW(trainable_params, lr=1e-6, weight_decay=1e-4)
+    print(f"Trainable parameters: {sum(p.numel() for p in trainable_params):,}")
+    if 'optimizer_state_dict' in checkpoint:
+        try:
+            optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+            print("Optimizer state loaded from checkpoint!")
+            for param_group in optimizer.param_groups:
+                print(f"  Loaded LR: {param_group['lr']}")
+        except Exception as e:
+            print(f"Could not load optimizer state: {e}")
+            print("Using fresh optimizer")
+    else:
+        print("No optimizer state in checkpoint, using fresh optimizer")
+    print()
+    scaler = torch.amp.GradScaler(device)
+    best_val_exact_match = 0.0
+    counter = 0
+    logs = []
+    if os.path.exists(LOG_CSV):
+        old_logs = pd.read_csv(LOG_CSV)
+        logs = old_logs.values.tolist()
+        best_val_exact_match = old_logs['val_exact_match'].max()
+        print(f"Previous best exact match: {best_val_exact_match:.4f}")
+    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
+        optimizer, mode='max', factor=0.5, patience=4, verbose=True
+    )
+    total_epochs = start_epoch + additional_epochs
+    for epoch in range(start_epoch, total_epochs):
+        print(f"\nEpoch {epoch+1}/{total_epochs}")
+        train_loss, train_token_acc = train_one_epoch(model, train_loader, optimizer, device, scaler, vocab)
+        val_loss, val_token_acc, val_exact_match = validate_one_epoch(model, val_loader, device, vocab)
+        print(f"Train Loss: {train_loss:.4f} | Train Token Acc: {train_token_acc:.4f}")
+        print(f"Val Loss: {val_loss:.4f} | Val Token Acc: {val_token_acc:.4f} | Val Exact Match: {val_exact_match:.4f}")
+        print(f"LR: {optimizer.param_groups[0]['lr']}")
+        scheduler.step(val_exact_match)
+        if val_exact_match > best_val_exact_match:
+            best_val_exact_match = val_exact_match
+            save_checkpoint(model, optimizer, epoch, vocab, CHECKPOINT_PATH)
+            print("Checkpoint saved!")
+            counter = 0
+        else:
+            counter += 1
+            print(f"No improvement in exact match for {counter} epochs.")
+        if counter >= patience:
+            print(f"\nEarly stopping after {patience} epochs without improvement")
+            break
+        logs.append([epoch+1, train_loss, train_token_acc, val_loss, val_token_acc, val_exact_match, optimizer.param_groups[0]['lr']])
+    log_df = pd.DataFrame(logs, columns=["epoch","train_loss","train_token_acc","val_loss","val_token_acc","val_exact_match","lr"])
+    log_df.to_csv(LOG_CSV, index=False)
+    plot_losses([x[1] for x in logs], [x[3] for x in logs], save_path=LOSS_GRAPH_PATH)
+    print("\nContinued training complete!")
+    print(f"Best exact match accuracy: {best_val_exact_match:.4f}")
+if __name__ == "__main__":
+    main()

continued_training_metric.csv ADDED Viewed

	@@ -0,0 +1,21 @@

+epoch,train_loss,train_token_acc,val_loss,val_token_acc,val_exact_match,lr
+30,1.9502590653601657,0.7322589969014642,1.3020859152640936,0.6990427433882119,0.38998964956380305,1e-06
+31,1.9464403521302605,0.7328945476229131,1.3008682691263702,0.7001620300535886,0.3919858051160727,1e-06
+32,1.9446046694293662,0.733795435205851,1.2995548267971795,0.7003354483617926,0.39220760017743606,1e-06
+33,1.9418390615673053,0.7339540544097625,1.2990998206835873,0.7004338480391592,0.3923554635516783,1e-06
+34,1.9405346881137806,0.733893274767451,1.299637350552487,0.7005681339299904,0.39257725861304155,1e-06
+35,1.9380957318931413,0.7351758044757201,1.2987835997680448,0.7006050677232023,0.39265119030016266,1e-06
+36,1.9369506880350187,0.7359647384978554,1.2979233675407913,0.7013796053405078,0.39405589235546357,1e-06
+37,1.9360789391220428,0.7364758676075498,1.2977605515493538,0.7014409610123005,0.39398196066834246,1e-06
+38,1.9357275886693557,0.7362391176412685,1.297402927054549,0.7011285817848062,0.39353837054561586,1e-06
+39,1.932767997896227,0.736806065813456,1.2974532218474262,0.7004903276573937,0.39220760017743606,1e-06
+40,1.9330583925010325,0.7374090065552876,1.2972412691363748,0.7010474972567469,0.39316871211001037,1e-06
+41,1.9306796564990991,0.7378562083616544,1.2969766115804888,0.7015751037957534,0.39427768741682684,1e-06
+42,1.9282727334571266,0.7377051650808099,1.2973702516195909,0.7011518692070583,0.39331657548425253,1e-06
+43,1.9271106582502864,0.7386680361718415,1.2968672679842643,0.7010392338599799,0.39316871211001037,1e-06
+44,1.9269962475047457,0.7397106953586509,1.296902930680311,0.7012923545432541,0.3936862339198581,1e-06
+45,1.9244166012701376,0.7400805678048972,1.2962118839880206,0.7011814176473977,0.39353837054561586,1e-06
+46,1.9289601324296857,0.7377478108470924,1.296351783118158,0.7014656890675707,0.3941298240425847,5e-07
+47,1.9269490434459369,0.7386778470752227,1.2962728831565604,0.7015336796922503,0.39420375572970573,5e-07
+48,1.9252020313075702,0.7394137923214155,1.2964817043745294,0.7014302642277952,0.39420375572970573,5e-07
+49,1.9241666916486853,0.7392096879001484,1.296351099070513,0.7016751350096937,0.39449948247819017,5e-07

conversation_manager.py ADDED Viewed

	@@ -0,0 +1,312 @@

+"""
+Conversation Manager for Multi-turn VQA
+Manages conversation state, context, and pronoun resolution
+"""
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Any
+from datetime import datetime, timedelta
+import uuid
+import re
+@dataclass
+class ConversationTurn:
+    """Represents a single turn in a conversation"""
+    question: str
+    answer: str
+    objects_detected: List[str]
+    timestamp: datetime
+    reasoning_chain: Optional[List[str]] = None
+    model_used: Optional[str] = None
+@dataclass
+class ConversationSession:
+    """Represents a complete conversation session"""
+    session_id: str
+    image_path: str
+    history: List[ConversationTurn] = field(default_factory=list)
+    current_objects: List[str] = field(default_factory=list)
+    created_at: datetime = field(default_factory=datetime.now)
+    last_activity: datetime = field(default_factory=datetime.now)
+    def add_turn(
+        self,
+        question: str,
+        answer: str,
+        objects_detected: List[str],
+        reasoning_chain: Optional[List[str]] = None,
+        model_used: Optional[str] = None
+    ):
+        """Add a new turn to the conversation"""
+        turn = ConversationTurn(
+            question=question,
+            answer=answer,
+            objects_detected=objects_detected,
+            timestamp=datetime.now(),
+            reasoning_chain=reasoning_chain,
+            model_used=model_used
+        )
+        self.history.append(turn)
+        if objects_detected:
+            self.current_objects = objects_detected
+        self.last_activity = datetime.now()
+    def get_context_summary(self) -> str:
+        """Get a summary of the conversation context"""
+        if not self.history:
+            return "No previous conversation"
+        summary_parts = []
+        for i, turn in enumerate(self.history[-3:], 1):
+            summary_parts.append(f"Turn {i}: Q: {turn.question} A: {turn.answer}")
+        return " | ".join(summary_parts)
+    def is_expired(self, timeout_minutes: int = 30) -> bool:
+        """Check if session has expired"""
+        expiry_time = self.last_activity + timedelta(minutes=timeout_minutes)
+        return datetime.now() > expiry_time
+class ConversationManager:
+    """
+    Manages multi-turn conversation sessions for VQA.
+    Handles context retention, pronoun resolution, and session lifecycle.
+    """
+    PRONOUNS = ['it', 'this', 'that', 'these', 'those', 'they', 'them']
+    def __init__(self, session_timeout_minutes: int = 30):
+        """
+        Initialize conversation manager
+        Args:
+            session_timeout_minutes: Minutes before a session expires
+        """
+        self.sessions: Dict[str, ConversationSession] = {}
+        self.session_timeout = session_timeout_minutes
+        print(f"✅ Conversation Manager initialized (timeout: {session_timeout_minutes}min)")
+    def create_session(self, image_path: str, session_id: Optional[str] = None) -> str:
+        """
+        Create a new conversation session
+        Args:
+            image_path: Path to the image for this conversation
+            session_id: Optional custom session ID (generates UUID if not provided)
+        Returns:
+            Session ID
+        """
+        if session_id is None:
+            session_id = str(uuid.uuid4())
+        session = ConversationSession(
+            session_id=session_id,
+            image_path=image_path
+        )
+        self.sessions[session_id] = session
+        return session_id
+    def get_session(self, session_id: str) -> Optional[ConversationSession]:
+        """
+        Get an existing session
+        Args:
+            session_id: Session ID to retrieve
+        Returns:
+            ConversationSession or None if not found/expired
+        """
+        session = self.sessions.get(session_id)
+        if session is None:
+            return None
+        if session.is_expired(self.session_timeout):
+            self.delete_session(session_id)
+            return None
+        return session
+    def get_or_create_session(
+        self,
+        session_id: Optional[str],
+        image_path: str
+    ) -> ConversationSession:
+        """
+        Get existing session or create new one
+        Args:
+            session_id: Optional session ID
+            image_path: Image path for new session
+        Returns:
+            ConversationSession
+        """
+        if session_id:
+            session = self.get_session(session_id)
+            if session:
+                return session
+        new_id = self.create_session(image_path, session_id)
+        return self.sessions[new_id]
+    def add_turn(
+        self,
+        session_id: str,
+        question: str,
+        answer: str,
+        objects_detected: List[str],
+        reasoning_chain: Optional[List[str]] = None,
+        model_used: Optional[str] = None
+    ) -> bool:
+        """
+        Add a turn to a conversation session
+        Args:
+            session_id: Session ID
+            question: User's question
+            answer: VQA answer
+            objects_detected: List of detected objects
+            reasoning_chain: Optional reasoning steps
+            model_used: Optional model identifier
+        Returns:
+            True if successful, False if session not found
+        """
+        session = self.get_session(session_id)
+        if session is None:
+            return False
+        session.add_turn(
+            question=question,
+            answer=answer,
+            objects_detected=objects_detected,
+            reasoning_chain=reasoning_chain,
+            model_used=model_used
+        )
+        return True
+    def resolve_references(
+        self,
+        question: str,
+        session: ConversationSession
+    ) -> str:
+        """
+        Resolve pronouns and references in a question using conversation context.
+        Args:
+            question: User's question (may contain pronouns)
+            session: Conversation session with context
+        Returns:
+            Question with pronouns resolved
+        Example:
+            Input: "Is it healthy?"
+            Context: Previous object was "apple"
+            Output: "Is apple healthy?"
+        """
+        if not session.history:
+            return question
+        q_lower = question.lower()
+        has_pronoun = any(pronoun in q_lower.split() for pronoun in self.PRONOUNS)
+        if not has_pronoun:
+            return question
+        recent_objects = session.current_objects
+        if not recent_objects:
+            return question
+        resolved = question
+        if any(pronoun in q_lower.split() for pronoun in ['it', 'this', 'that']):
+            primary_object = recent_objects[0]
+            resolved = re.sub(r'\bit\b', primary_object, resolved, flags=re.IGNORECASE)
+            resolved = re.sub(r'\bthis\b', primary_object, resolved, flags=re.IGNORECASE)
+            resolved = re.sub(r'\bthat\b', primary_object, resolved, flags=re.IGNORECASE)
+        if any(pronoun in q_lower.split() for pronoun in ['these', 'those', 'they', 'them']):
+            objects_phrase = ', '.join(recent_objects)
+            resolved = re.sub(r'\bthese\b', objects_phrase, resolved, flags=re.IGNORECASE)
+            resolved = re.sub(r'\bthose\b', objects_phrase, resolved, flags=re.IGNORECASE)
+            resolved = re.sub(r'\bthey\b', objects_phrase, resolved, flags=re.IGNORECASE)
+            resolved = re.sub(r'\bthem\b', objects_phrase, resolved, flags=re.IGNORECASE)
+        return resolved
+    def get_context_for_question(
+        self,
+        session_id: str,
+        question: str
+    ) -> Dict[str, Any]:
+        """
+        Get relevant context for answering a question
+        Args:
+            session_id: Session ID
+            question: Current question
+        Returns:
+            Dict with context information
+        """
+        session = self.get_session(session_id)
+        if session is None:
+            return {
+                'has_context': False,
+                'turn_number': 0,
+                'previous_objects': [],
+                'previous_questions': []
+            }
+        return {
+            'has_context': len(session.history) > 0,
+            'turn_number': len(session.history) + 1,
+            'previous_objects': session.current_objects,
+            'previous_questions': [turn.question for turn in session.history[-3:]],
+            'previous_answers': [turn.answer for turn in session.history[-3:]],
+            'context_summary': session.get_context_summary()
+        }
+    def get_history(self, session_id: str) -> Optional[List[Dict[str, Any]]]:
+        """
+        Get conversation history for a session
+        Args:
+            session_id: Session ID
+        Returns:
+            List of turn dictionaries or None if session not found
+        """
+        session = self.get_session(session_id)
+        if session is None:
+            return None
+        history = []
+        for turn in session.history:
+            history.append({
+                'question': turn.question,
+                'answer': turn.answer,
+                'objects_detected': turn.objects_detected,
+                'timestamp': turn.timestamp.isoformat(),
+                'reasoning_chain': turn.reasoning_chain,
+                'model_used': turn.model_used
+            })
+        return history
+    def delete_session(self, session_id: str) -> bool:
+        """
+        Delete a conversation session
+        Args:
+            session_id: Session ID to delete
+        Returns:
+            True if deleted, False if not found
+        """
+        if session_id in self.sessions:
+            del self.sessions[session_id]
+            return True
+        return False
+    def cleanup_expired_sessions(self):
+        """Remove all expired sessions"""
+        expired_ids = [
+            sid for sid, session in self.sessions.items()
+            if session.is_expired(self.session_timeout)
+        ]
+        for sid in expired_ids:
+            self.delete_session(sid)
+        return len(expired_ids)
+    def get_active_sessions_count(self) -> int:
+        """Get count of active (non-expired) sessions"""
+        self.cleanup_expired_sessions()
+        return len(self.sessions)
+if __name__ == "__main__":
+    print("=" * 80)
+    print("🧪 Testing Conversation Manager")
+    print("=" * 80)
+    manager = ConversationManager(session_timeout_minutes=30)
+    print("\n📝 Test 1: Multi-turn conversation")
+    session_id = manager.create_session("test_image.jpg")
+    print(f"Created session: {session_id}")
+    manager.add_turn(
+        session_id=session_id,
+        question="What is this?",
+        answer="apple",
+        objects_detected=["apple"]
+    )
+    print("Turn 1: 'What is this?' → 'apple'")
+    session = manager.get_session(session_id)
+    question_2 = "Is it healthy?"
+    resolved_2 = manager.resolve_references(question_2, session)
+    print(f"Turn 2: '{question_2}' → Resolved: '{resolved_2}'")
+    manager.add_turn(
+        session_id=session_id,
+        question=question_2,
+        answer="Yes, apples are healthy",
+        objects_detected=["apple"]
+    )
+    question_3 = "What color is it?"
+    resolved_3 = manager.resolve_references(question_3, session)
+    print(f"Turn 3: '{question_3}' → Resolved: '{resolved_3}'")
+    print("\n📝 Test 2: Context retrieval")
+    context = manager.get_context_for_question(session_id, "Another question")
+    print(f"Turn number: {context['turn_number']}")
+    print(f"Previous objects: {context['previous_objects']}")
+    print(f"Context summary: {context['context_summary']}")
+    print("\n📝 Test 3: Conversation history")
+    history = manager.get_history(session_id)
+    for i, turn in enumerate(history, 1):
+        print(f"  Turn {i}: Q: {turn['question']} | A: {turn['answer']}")
+    print("\n" + "=" * 80)
+    print("✅ Tests completed!")

download_models.py ADDED Viewed

	@@ -0,0 +1,27 @@

+import os
+from huggingface_hub import hf_hub_download
+REPO_ID = "Deva8/GENvqa-model"
+# We use the token from the environment variable (which the user must set in Settings -> Secrets)
+HF_TOKEN = os.getenv("HF_TOKEN")
+print("Downloading models from HuggingFace Hub...")
+# Download base checkpoint
+hf_hub_download(
+    repo_id=REPO_ID,
+    filename="vqa_checkpoint.pt",
+    local_dir=".",
+    token=HF_TOKEN
+)
+print("Base checkpoint downloaded successfully.")
+# Download spatial checkpoint
+hf_hub_download(
+    repo_id=REPO_ID,
+    filename="vqa_spatial_checkpoint.pt",
+    local_dir=".",
+    token=HF_TOKEN
+)
+print("Spatial checkpoint downloaded successfully.")

draft_generator.py ADDED Viewed

	@@ -0,0 +1,112 @@

+import subprocess
+import os
+mermaid_code = """
+graph TD
+    %% Styling
+    classDef default fill:#1A1A1A,stroke:#444,stroke-width:2px,color:#FFF,rx:8px,ry:8px,font-family:arial;
+    classDef mobile fill:#003366,stroke:#0055AA,stroke-width:2px,color:#FFF;
+    classDef preproc fill:#333333,stroke:#555,stroke-width:2px,color:#FFF;
+    classDef model fill:#4B0082,stroke:#8A2BE2,stroke-width:2px,color:#FFF;
+    classDef condition fill:#2B2B2B,stroke:#F4A460,stroke-width:2px,color:#FFF,shape:rhombus;
+    classDef external fill:#004d00,stroke:#009900,stroke-width:2px,color:#FFF;
+    classDef final fill:#660000,stroke:#CC0000,stroke-width:2px,color:#FFF;
+    %% Nodes
+    UserApp[📱 Mobile App]:::mobile
+    ImgUpload[🖼️ Image]:::preproc
+    Question[⌨️ Question Text]:::preproc
+    PIL[🐍 PIL Preprocessing<br/>RGB conversion]:::preproc
+    CLIP[👁️ OpenAI CLIP ViT-B/32<br/>Image Features 512-dim]:::model
+    GPT2[🤗 DistilGPT-2<br/>Tokenized Question]:::model
+    Route1{Question<br/>spatial?}:::condition
+    Spatial[📐 Spatial VQA Model<br/>8-head attention]:::model
+    Base[🧠 Base VQA Model<br/>General VQA]:::model
+    Decoder[🤗 GPT-2 Decoder<br/>vocab decode]:::model
+    NeuralAns[💬 Neural Answer]:::final
+    Route2{Knowledge<br/>question?}:::condition
+    ObjDet[👁️ CLIP Object Detector<br/>Top-3 objects]:::model
+    Wikidata[🌍 Wikidata SPARQL<br/>P31, P186, P366]:::external
+    GroqV[⚡ Groq Llama-3.3<br/>Verbalizer]:::external
+    KGAns[🧩 KG Enhancement]:::final
+    FastAPI[🚀 FastAPI]:::preproc
+    GroqA[⚡ Groq Llama-3.3<br/>Accessibility]:::external
+    Audio[🔊 2-sentence description]:::final
+    %% Edges
+    UserApp -- "Image uploaded" --> ImgUpload
+    UserApp -- "Question typed" --> Question
+    ImgUpload --> PIL
+    PIL --> CLIP
+    Question --> GPT2
+    CLIP & GPT2 --> Route1
+    Route1 -- "YES" --> Spatial
+    Route1 -- "NO" --> Base
+    Spatial & Base -- "Beam search (width=5)" --> Decoder
+    Decoder --> NeuralAns
+    CLIP -- "Anchor similarity" --> Route2
+    Route2 -- "YES" --> ObjDet
+    ObjDet -- "Detected objects" --> Wikidata
+    Wikidata -- "Structured facts" --> GroqV
+    GroqV --> KGAns
+    FastAPI -- "Narration request" --> GroqA
+    GroqA --> Audio
+    NeuralAns & KGAns & Audio -- "JSON output" --> FastAPI
+    FastAPI --> UserApp
+"""
+file_path = r"C:\Users\rdeva\Downloads\vqa_coes\architecture_draft.mmd"
+with open(file_path, "w", encoding="utf-8") as f:
+    f.write(mermaid_code)
+print(f"Mermaid file saved to {file_path}")
+# Note: In a real environment, we would use mermaid-cli (mmdc) to convert this to SVG/PNG.
+# Since it might not be installed globally, we will just provide the mermaid file and
+# instructions, or generate an HTML wrapper that renders it in browser.
+html_path = r"C:\Users\rdeva\Downloads\vqa_coes\architecture_draft.html"
+html_content = f"""
+<!DOCTYPE html>
+<html>
+<head>
+    <title>VQA Architecture Draft</title>
+    <script type="module">
+      import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
+      mermaid.initialize({{ startOnLoad: true, theme: 'dark', flowchart: {{ curve: 'basis' }} }});
+    </script>
+    <style>
+        body {{ background-color: #0D1117; color: white; font-family: sans-serif; display: flex; justify-content: center; padding: 20px; }}
+        .mermaid {{ background-color: #161B22; padding: 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.5); }}
+    </style>
+</head>
+<body>
+    <div class="mermaid">
+{mermaid_code}
+    </div>
+</body>
+</html>
+"""
+with open(html_path, "w", encoding="utf-8") as f:
+    f.write(html_content)
+print(f"HTML viewer saved to {html_path}")

ensemble_vqa_app.py ADDED Viewed

	@@ -0,0 +1,458 @@

+"""
+Production Ensemble VQA Application
+Combines base model (general VQA) and spatial adapter (spatial reasoning)
+for optimal performance on all question types.
+NEW: Neuro-Symbolic VQA with Knowledge Graph integration
+NEW: Multi-turn Conversational VQA with context management
+"""
+import os
+import torch
+from PIL import Image
+from transformers import GPT2Tokenizer
+from models.model import VQAModel
+from model_spatial import VQAModelWithSpatialAdapter
+from experiments.train import Vocab
+from knowledge_graph_service import KnowledgeGraphService
+from typing import Optional
+import time
+class ProductionEnsembleVQA:
+    SPATIAL_KEYWORDS = [
+        'right', 'left', 'above', 'below', 'top', 'bottom',
+        'up', 'down', 'upward', 'downward',
+        'front', 'behind', 'back', 'next to', 'beside', 'near', 'between',
+        'in front', 'in back', 'across from', 'opposite', 'adjacent',
+        'closest', 'farthest', 'nearest', 'furthest', 'closer', 'farther',
+        'where is', 'where are', 'which side', 'what side', 'what direction',
+        'on the left', 'on the right', 'at the top', 'at the bottom',
+        'to the left', 'to the right', 'in the middle', 'in the center',
+        'under', 'over', 'underneath', 'on top of', 'inside', 'outside'
+    ]
+    def __init__(self, base_checkpoint, spatial_checkpoint, device='cuda'):
+        self.device = device if torch.cuda.is_available() else 'cpu'
+        print("="*80)
+        print("🚀 INITIALIZING ENSEMBLE VQA SYSTEM")
+        print("="*80)
+        print(f"\n⚙️  Device: {self.device}")
+        print("\n📥 Loading models...")
+        start_time = time.time()
+        print("  [1/2] Loading base model (general VQA)...")
+        self.base_model, self.vocab, self.tokenizer = self._load_base_model(base_checkpoint)
+        print("      ✓ Base model loaded")
+        print("  [2/2] Loading spatial model (spatial reasoning)...")
+        self.spatial_model, _, _ = self._load_spatial_model(spatial_checkpoint)
+        print("      ✓ Spatial model loaded")
+        load_time = time.time() - start_time
+        print("  [3/3] Initializing Semantic Neuro-Symbolic VQA...")
+        try:
+            from semantic_neurosymbolic_vqa import SemanticNeurosymbolicVQA
+            self.kg_service = SemanticNeurosymbolicVQA(device=self.device)
+            print("      ✓ Semantic Neuro-Symbolic VQA ready (CLIP + Wikidata, no pattern matching)")
+            self.kg_enabled = True
+        except Exception as e:
+            print(f"      ⚠️  Semantic Neuro-Symbolic VQA unavailable: {e}")
+            print("      → Falling back to neural-only mode")
+            self.kg_service = None
+            self.kg_enabled = False
+        print(f"\n✅ Ensemble ready! (loaded in {load_time:.1f}s)")
+        print(f"📊 Memory: ~2x single model (~4GB GPU)")
+        print(f"🎯 Routing: Automatic based on question type")
+        print(f"🧠 Neuro-Symbolic: {'Enabled' if self.kg_enabled else 'Disabled (neural-only)'}")
+        print(f"💬 Conversation: Initializing multi-turn support...")
+        try:
+            from conversation_manager import ConversationManager
+            self.conversation_manager = ConversationManager(session_timeout_minutes=30)
+            self.conversation_enabled = True
+            print(f"      ✓ Conversational VQA ready (multi-turn with context)")
+        except Exception as e:
+            print(f"      ⚠️  Conversation manager unavailable: {e}")
+            print(f"      → Single-shot Q&A only")
+            self.conversation_manager = None
+            self.conversation_enabled = False
+        print("="*80)
+    def _load_base_model(self, checkpoint_path):
+        """Load base VQA model."""
+        checkpoint = torch.load(checkpoint_path, map_location=self.device)
+        vocab = Vocab()
+        vocab.vocab = checkpoint['vocab']
+        vocab.vocab_size = len(checkpoint['vocab'])
+        vocab.word2idx = checkpoint['word2idx']
+        vocab.idx2word = checkpoint['idx2word']
+        vocab.pad_token_id = checkpoint['pad_token_id']
+        vocab.bos_token_id = checkpoint['bos_token_id']
+        vocab.eos_token_id = checkpoint['eos_token_id']
+        vocab.unk_token_id = checkpoint['unk_token_id']
+        tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+        if tokenizer.pad_token is None:
+            tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        model = VQAModel(
+            vocab_size=len(checkpoint['vocab']),
+            device=self.device,
+            question_max_len=checkpoint.get('question_max_len', 20),
+            answer_max_len=checkpoint.get('answer_max_len', 12),
+            pad_token_id=checkpoint['pad_token_id'],
+            bos_token_id=checkpoint['bos_token_id'],
+            eos_token_id=checkpoint['eos_token_id'],
+            unk_token_id=checkpoint['unk_token_id'],
+            hidden_size=512,
+            num_layers=2
+        ).to(self.device)
+        model.gpt2_model.resize_token_embeddings(len(tokenizer))
+        model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+        model.eval()
+        return model, vocab, tokenizer
+    def _load_spatial_model(self, checkpoint_path):
+        """Load spatial adapter model."""
+        checkpoint = torch.load(checkpoint_path, map_location=self.device)
+        vocab = Vocab()
+        vocab.vocab = checkpoint['vocab']
+        vocab.vocab_size = len(checkpoint['vocab'])
+        vocab.word2idx = checkpoint['word2idx']
+        vocab.idx2word = checkpoint['idx2word']
+        vocab.pad_token_id = checkpoint['pad_token_id']
+        vocab.bos_token_id = checkpoint['bos_token_id']
+        vocab.eos_token_id = checkpoint['eos_token_id']
+        vocab.unk_token_id = checkpoint['unk_token_id']
+        tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+        if tokenizer.pad_token is None:
+            tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        base_model = VQAModel(
+            vocab_size=len(checkpoint['vocab']),
+            device=self.device,
+            question_max_len=checkpoint.get('question_max_len', 20),
+            answer_max_len=checkpoint.get('answer_max_len', 12),
+            pad_token_id=checkpoint['pad_token_id'],
+            bos_token_id=checkpoint['bos_token_id'],
+            eos_token_id=checkpoint['eos_token_id'],
+            unk_token_id=checkpoint['unk_token_id'],
+            hidden_size=512,
+            num_layers=2
+        ).to(self.device)
+        base_model.gpt2_model.resize_token_embeddings(len(tokenizer))
+        model = VQAModelWithSpatialAdapter(
+            base_model=base_model,
+            hidden_size=512,
+            num_heads=8,
+            dropout=0.3
+        ).to(self.device)
+        model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+        model.eval()
+        return model, vocab, tokenizer
+    def is_spatial_question(self, question):
+        """
+        Classify if a question is spatial using keyword matching.
+        Args:
+            question: Question string
+        Returns:
+            bool: True if spatial, False otherwise
+        """
+        q_lower = question.lower()
+        return any(keyword in q_lower for keyword in self.SPATIAL_KEYWORDS)
+    def answer(self, image_path, question, use_beam_search=True, beam_width=5, verbose=False):
+        """
+        Answer a question by routing to appropriate model.
+        Now with Neuro-Symbolic reasoning for common-sense questions!
+        Args:
+            image_path: Path to image file
+            question: Question string
+            use_beam_search: Whether to use beam search (better quality)
+            beam_width: Beam width for beam search
+            verbose: Print routing information
+        Returns:
+            dict: {
+                'answer': str,
+                'model_used': 'spatial' or 'base',
+                'confidence': float,
+                'kg_enhancement': str (optional),
+                'reasoning_type': 'neural' or 'neuro-symbolic'
+            }
+        """
+        is_spatial = self.is_spatial_question(question)
+        model_used = 'spatial' if is_spatial else 'base'
+        if verbose:
+            print(f"🔍 Question type: {'Spatial' if is_spatial else 'General'}")
+            print(f"🤖 Using: {model_used} model")
+        model = self.spatial_model if is_spatial else self.base_model
+        image = Image.open(image_path).convert('RGB')
+        image = model.clip_preprocess(image).unsqueeze(0).to(self.device)
+        question_tokens = self.tokenizer(
+            question,
+            padding='max_length',
+            truncation=True,
+            max_length=model.question_max_len,
+            return_tensors='pt'
+        )
+        questions = {
+            'input_ids': question_tokens['input_ids'].to(self.device),
+            'attention_mask': question_tokens['attention_mask'].to(self.device)
+        }
+        with torch.no_grad():
+            if use_beam_search and hasattr(model, 'generate_with_beam_search'):
+                generated = model.generate_with_beam_search(
+                    image, questions, beam_width=beam_width
+                )
+            else:
+                generated = model(image, questions)
+        # Always get the neural answer first — it is ALWAYS the primary answer
+        if verbose:
+            print("📝 Using neural VQA...")
+        neural_answer = self.vocab.decoder(generated[0].cpu().numpy())
+        # Neuro-symbolic is a *supplement* only — its result goes into kg_enhancement,
+        # never replacing the neural answer.
+        kg_enhancement   = None
+        reasoning_type   = 'neural'
+        objects_detected = []
+        question_intent  = None
+        wikidata_entity  = None
+        knowledge_source = None
+        if self.kg_enabled and self.kg_service:
+            if verbose:
+                print("🔍 Analyzing question semantics...")
+            should_use_ns = self.kg_service.should_use_neurosymbolic(
+                image_features=None,
+                question=question,
+                vqa_confidence=0.0,
+                image_path=image_path
+            )
+            if should_use_ns:
+                if verbose:
+                    print("🧠 Neuro-Symbolic supplement: detecting subject via CLIP...")
+                # CLIP zero-shot: compare image against 80+ concrete noun labels
+                # This is much more accurate than asking the VQA model
+                detected_objects = self.kg_service.detect_objects_with_clip(
+                    image_path=image_path, top_k=3
+                )
+                if verbose:
+                    print(f"   → CLIP detected: {detected_objects}")
+                    print("   → Fetching Wikidata facts + Groq verbalization...")
+                if detected_objects:
+                    ns_result = self.kg_service.answer_with_clip_features(
+                        image_features=None,
+                        question=question,
+                        image_path=image_path,
+                        detected_objects=tuple(detected_objects)
+                    )
+                    if ns_result:
+                        kg_enhancement   = ns_result['kg_enhancement']
+                        reasoning_type   = 'neuro-symbolic'
+                        objects_detected = detected_objects          # expose to return dict
+                        question_intent  = ns_result.get('question_intent')
+                        wikidata_entity  = ns_result.get('wikidata_entity')
+                        knowledge_source = ns_result.get('knowledge_source')
+                        if verbose:
+                            print(f"✨ Neuro-Symbolic supplement: {kg_enhancement}")
+                            print(f"   → Wikidata entity: {wikidata_entity}")
+                else:
+                    if verbose:
+                        print("   → CLIP could not identify subject, skipping Wikidata lookup")
+        return {
+            'answer':           neural_answer,
+            'model_used':       model_used,
+            'confidence':       1.0,
+            'kg_enhancement':   kg_enhancement,
+            'reasoning_type':   reasoning_type,
+            'objects_detected': objects_detected,
+            'question_intent':  question_intent,
+            'wikidata_entity':  wikidata_entity,
+            'knowledge_source': knowledge_source,
+        }
+    def answer_conversational(
+        self,
+        image_path: str,
+        question: str,
+        session_id: Optional[str] = None,
+        use_beam_search: bool = True,
+        beam_width: int = 5,
+        verbose: bool = False
+    ) -> dict:
+        """
+        Answer a question with multi-turn conversation support.
+        Handles pronoun resolution and context management.
+        Args:
+            image_path: Path to image file
+            question: Question string (may contain pronouns like "it", "this")
+            session_id: Optional session ID for continuing conversation
+            use_beam_search: Whether to use beam search
+            beam_width: Beam width for beam search
+            verbose: Print routing information
+        Returns:
+            dict: {
+                'answer': str,
+                'session_id': str,
+                'resolved_question': str,
+                'conversation_context': dict,
+                ... (other fields from answer())
+            }
+        """
+        if not self.conversation_enabled or not self.conversation_manager:
+            result = self.answer(image_path, question, use_beam_search, beam_width, verbose)
+            result['session_id'] = None
+            result['resolved_question'] = question
+            result['conversation_context'] = {'has_context': False}
+            return result
+        session = self.conversation_manager.get_or_create_session(session_id, image_path)
+        actual_session_id = session.session_id
+        if verbose:
+            print(f"💬 Session: {actual_session_id}")
+            print(f"   Turn number: {len(session.history) + 1}")
+        resolved_question = self.conversation_manager.resolve_references(question, session)
+        if verbose and resolved_question != question:
+            print(f"🔄 Pronoun resolution:")
+            print(f"   Original: {question}")
+            print(f"   Resolved: {resolved_question}")
+        result = self.answer(
+            image_path=image_path,
+            question=resolved_question,
+            use_beam_search=use_beam_search,
+            beam_width=beam_width,
+            verbose=verbose
+        )
+        self.conversation_manager.add_turn(
+            session_id=actual_session_id,
+            question=question,
+            answer=result['answer'],
+            objects_detected=result.get('objects_detected', []),
+            reasoning_chain=result.get('reasoning_chain'),
+            model_used=result.get('model_used')
+        )
+        context = self.conversation_manager.get_context_for_question(
+            actual_session_id,
+            question
+        )
+        result['session_id'] = actual_session_id
+        result['resolved_question'] = resolved_question
+        result['conversation_context'] = context
+        return result
+    def _detect_multiple_objects(self, image, vqa_model, top_k=3):
+        """
+        Detect the primary subject of the image using neutral, unbiased questions.
+        We ask the same question several ways so the VQA model has the best chance
+        of identifying the actual subject — never biasing toward food or objects.
+        Returns at most top_k unique answers.
+        """
+        # Neutral questions — no food bias, no category bias
+        detection_questions = [
+            "What is the main subject of this image?",
+            "What is in this image?",
+            "What is shown in this picture?",
+        ]
+        # Tokens we treat as non-answers
+        stop_words = {'a', 'an', 'the', 'this', 'that', 'it', 'yes', 'no',
+                      'some', 'there', 'here', 'image', 'picture', 'photo'}
+        detected = []
+        for question in detection_questions:
+            try:
+                question_tokens = self.tokenizer(
+                    question,
+                    padding='max_length',
+                    truncation=True,
+                    max_length=vqa_model.question_max_len,
+                    return_tensors='pt'
+                )
+                questions = {
+                    'input_ids': question_tokens['input_ids'].to(self.device),
+                    'attention_mask': question_tokens['attention_mask'].to(self.device)
+                }
+                with torch.no_grad():
+                    generated = vqa_model(image, questions)
+                answer = self.vocab.decoder(generated[0].cpu().numpy()).strip()
+                if (answer
+                        and answer.lower() not in stop_words
+                        and answer not in detected):
+                    detected.append(answer)
+                    if len(detected) >= top_k:
+                        break
+            except Exception as e:
+                print(f"   ⚠️  Error detecting objects: {e}")
+                continue
+        return detected if detected else []
+    def batch_answer(self, image_question_pairs, use_beam_search=True, verbose=False):
+        """
+        Answer multiple questions efficiently.
+        Args:
+            image_question_pairs: List of (image_path, question) tuples
+            use_beam_search: Whether to use beam search
+            verbose: Print progress
+        Returns:
+            List of result dicts
+        """
+        results = []
+        total = len(image_question_pairs)
+        for i, (image_path, question) in enumerate(image_question_pairs):
+            if verbose:
+                print(f"\n[{i+1}/{total}] Processing...")
+            result = self.answer(image_path, question, use_beam_search, verbose=verbose)
+            results.append(result)
+        return results
+def demo():
+    """Demo usage of production ensemble VQA."""
+    BASE_CHECKPOINT = "./output2/continued_training/vqa_checkpoint.pt"
+    SPATIAL_CHECKPOINT = "./output2/spatial_adapter_v2_2/vqa_spatial_checkpoint.pt"
+    IMAGE = "./im2.jpg"
+    ensemble = ProductionEnsembleVQA(BASE_CHECKPOINT, SPATIAL_CHECKPOINT)
+    test_cases = [
+        ("what is to the right of the soup?", True),
+        ("what is on the left side?", True),
+        ("what is above the table?", True),
+        ("what is next to the bowl?", True),
+        ("what color is the bowl?", False),
+        ("how many items are there?", False),
+        ("what room is this?", False),
+        ("is there a spoon?", False),
+    ]
+    print("\n" + "="*80)
+    print("🧪 TESTING ENSEMBLE VQA SYSTEM")
+    print("="*80)
+    print(f"\n📷 Image: {IMAGE}\n")
+    for question, expected_spatial in test_cases:
+        result = ensemble.answer(IMAGE, question, verbose=False)
+        is_spatial = result['model_used'] == 'spatial'
+        routing_correct = "✓" if is_spatial == expected_spatial else "✗"
+        print(f"Q: {question}")
+        print(f"A: {result['answer']}")
+        print(f"Model: {result['model_used']} {routing_correct}")
+        print()
+    print("="*80)
+    print("✅ Demo complete!")
+def interactive_mode():
+    """Interactive mode for testing."""
+    BASE_CHECKPOINT = "./output2/continued_training/vqa_checkpoint.pt"
+    SPATIAL_CHECKPOINT = "./output2/spatial_adapter_v2_2/vqa_spatial_checkpoint.pt"
+    ensemble = ProductionEnsembleVQA(BASE_CHECKPOINT, SPATIAL_CHECKPOINT)
+    print("\n" + "="*80)
+    print("🎮 INTERACTIVE MODE")
+    print("="*80)
+    print("\nCommands:")
+    print("  - Enter image path and question")
+    print("  - Type 'quit' to exit")
+    print("="*80 + "\n")
+    while True:
+        try:
+            image_path = input("📷 Image path: ").strip()
+            if image_path.lower() == 'quit':
+                break
+            question = input("❓ Question: ").strip()
+            if question.lower() == 'quit':
+                break
+            result = ensemble.answer(image_path, question, verbose=True)
+            print(f"\n💬 Answer: {result['answer']}\n")
+            print("-"*80 + "\n")
+        except KeyboardInterrupt:
+            print("\n\n👋 Goodbye!")
+            break
+        except Exception as e:
+            print(f"\n❌ Error: {e}\n")
+if __name__ == "__main__":
+    import sys
+    if len(sys.argv) > 1 and sys.argv[1] == "interactive":
+        interactive_mode()
+    else:
+        demo()

enterprise_architecture.drawio ADDED Viewed

	@@ -0,0 +1,341 @@

+<?xml version="1.0" encoding="UTF-8"?>
+<mxGraphModel dx="1800" dy="1100" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="1920" pageHeight="1080" math="0" shadow="1">
+  <root>
+    <mxCell id="0" />
+    <mxCell id="1" parent="0" />
+    <mxCell id="bg" value="" style="rounded=0;whiteSpace=wrap;html=1;fillColor=#0D1117;strokeColor=none;" vertex="1" parent="1">
+      <mxGeometry x="-20" y="-20" width="1960" height="1120" as="geometry" />
+    </mxCell>
+    <mxCell id="title_bg" value="" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#161B22;strokeColor=#30363D;" vertex="1" parent="1">
+      <mxGeometry x="20" y="20" width="1880" height="70" as="geometry" />
+    </mxCell>
+    <mxCell id="title" value="&lt;font style=&quot;font-size:24px;font-weight:bold;&quot; color=&quot;#58A6FF&quot;&gt;Semantic Neuro-Symbolic VQA -- Enterprise Architecture&lt;/font&gt;&lt;br&gt;&lt;font style=&quot;font-size:11px;&quot; color=&quot;#8B949E&quot;&gt;React Native Mobile UI | FastAPI (Uvicorn) | PyTorch | OpenAI CLIP | Wikidata SPARQL | Groq LLM (Llama-3.3-70B-Versatile)&lt;/font&gt;" style="text;html=1;strokeColor=none;fillColor=none;align=center;verticalAlign=middle;whiteSpace=wrap;" vertex="1" parent="1">
+      <mxGeometry x="20" y="20" width="1880" height="70" as="geometry" />
+    </mxCell>
+    <!-- ===================== CLIENT LAYER ===================== -->
+    <mxCell id="client_layer" value="&lt;font style=&quot;font-size:14px;font-weight:bold;&quot; color=&quot;#79C0FF&quot;&gt;[1] CLIENT LAYER&lt;/font&gt;" style="swimlane;startSize=30;fillColor=#161B22;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontStyle=1;fontSize=13;rounded=10;" vertex="1" parent="1">
+      <mxGeometry x="20" y="110" width="350" height="870" as="geometry" />
+    </mxCell>
+    <mxCell id="mobile_label" value="[React Native / Expo]" style="text;html=1;fontSize=20;align=center;fillColor=none;strokeColor=none;fontColor=#58A6FF;" vertex="1" parent="client_layer">
+      <mxGeometry x="80" y="38" width="190" height="35" as="geometry" />
+    </mxCell>
+    <mxCell id="mobile_app" value="&lt;b&gt;React Native Mobile App&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Expo Framework | iOS and Android&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=12;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="85" width="290" height="60" as="geometry" />
+    </mxCell>
+    <mxCell id="screen_login" value="&lt;b&gt;LoginScreen.js&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Auth | Session Management&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="165" width="290" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="screen_camera" value="&lt;b&gt;CameraScreen.js&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Image Capture | Upload&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="225" width="290" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="screen_home" value="&lt;b&gt;HomeScreen.js&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Main Dashboard | History&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="285" width="290" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="screen_qa" value="&lt;b&gt;QuestionScreen.js&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Q and A Interface | Conversation&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="345" width="290" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="screen_result" value="&lt;b&gt;ResultScreen.js&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Answer Display | KG Enhancement&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="405" width="290" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="api_js" value="&lt;b&gt;api.js (API Service)&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Axios | FormData | Session Tokens&lt;br&gt;REST calls to FastAPI backend&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A2820;strokeColor=#3FB950;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="478" width="290" height="70" as="geometry" />
+    </mxCell>
+    <mxCell id="ep1" value="POST /api/answer" style="rounded=5;whiteSpace=wrap;html=1;fillColor=#0D1117;strokeColor=#3FB950;fontColor=#3FB950;fontSize=10;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="565" width="135" height="30" as="geometry" />
+    </mxCell>
+    <mxCell id="ep2" value="POST /api/conversation/answer" style="rounded=5;whiteSpace=wrap;html=1;fillColor=#0D1117;strokeColor=#3FB950;fontColor=#3FB950;fontSize=10;" vertex="1" parent="client_layer">
+      <mxGeometry x="177" y="565" width="143" height="30" as="geometry" />
+    </mxCell>
+    <mxCell id="ep3" value="GET /api/models/info" style="rounded=5;whiteSpace=wrap;html=1;fillColor=#0D1117;strokeColor=#3FB950;fontColor=#3FB950;fontSize=10;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="605" width="135" height="30" as="geometry" />
+    </mxCell>
+    <mxCell id="ep4" value="GET/DELETE /api/conversation/{id}" style="rounded=5;whiteSpace=wrap;html=1;fillColor=#0D1117;strokeColor=#3FB950;fontColor=#3FB950;fontSize=10;" vertex="1" parent="client_layer">
+      <mxGeometry x="177" y="605" width="143" height="30" as="geometry" />
+    </mxCell>
+    <mxCell id="client_tech" value="&lt;b&gt;Tech:&lt;/b&gt; Expo | React Navigation | Axios | FormData&lt;br&gt;&lt;b&gt;Auth:&lt;/b&gt; Session tokens | Context API" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#161B22;strokeColor=#21262D;fontColor=#8B949E;fontSize=10;" vertex="1" parent="client_layer">
+      <mxGeometry x="30" y="660" width="290" height="55" as="geometry" />
+    </mxCell>
+    <!-- ===================== API GATEWAY LAYER ===================== -->
+    <mxCell id="api_layer" value="&lt;font style=&quot;font-size:14px;font-weight:bold;&quot; color=&quot;#56D364&quot;&gt;[2] API GATEWAY LAYER&lt;/font&gt;" style="swimlane;startSize=30;fillColor=#161B22;strokeColor=#3FB950;fontColor=#FFFFFF;fontStyle=1;fontSize=13;rounded=10;" vertex="1" parent="1">
+      <mxGeometry x="400" y="110" width="360" height="870" as="geometry" />
+    </mxCell>
+    <mxCell id="apigw_label" value="[FastAPI + Uvicorn]" style="text;html=1;fontSize=20;align=center;fillColor=none;strokeColor=none;fontColor=#3FB950;" vertex="1" parent="api_layer">
+      <mxGeometry x="85" y="38" width="190" height="35" as="geometry" />
+    </mxCell>
+    <mxCell id="fastapi_main" value="&lt;b&gt;FastAPI Backend (Uvicorn)&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;backend_api.py&lt;br&gt;Host: 0.0.0.0 | Port: 8000&lt;br&gt;CORS enabled | Auto-reload dev mode&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#162415;strokeColor=#3FB950;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="88" width="320" height="80" as="geometry" />
+    </mxCell>
+    <mxCell id="startup" value="&lt;b&gt;Startup Event&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Load checkpoints | Init models&lt;br&gt;Init Groq service | Health check&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="188" width="320" height="60" as="geometry" />
+    </mxCell>
+    <mxCell id="ep_health" value="GET /health&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Model status check&lt;/font&gt;" style="rounded=6;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="268" width="145" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="ep_root" value="GET /&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;API info and docs&lt;/font&gt;" style="rounded=6;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="175" y="268" width="145" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="ep_answer" value="POST /api/answer&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;image + question -&gt; JSON answer&lt;/font&gt;" style="rounded=6;whiteSpace=wrap;html=1;fillColor=#132D0E;strokeColor=#3FB950;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="328" width="300" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="ep_conv" value="POST /api/conversation/answer&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Multi-turn | session_id | pronouns&lt;/font&gt;" style="rounded=6;whiteSpace=wrap;html=1;fillColor=#132D0E;strokeColor=#3FB950;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="388" width="300" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="ep_hist" value="GET /api/conversation/{id}/history" style="rounded=6;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="448" width="300" height="38" as="geometry" />
+    </mxCell>
+    <mxCell id="ep_del" value="DELETE /api/conversation/{id}" style="rounded=6;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="496" width="300" height="38" as="geometry" />
+    </mxCell>
+    <mxCell id="ep_models" value="GET /api/models/info" style="rounded=6;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="544" width="300" height="38" as="geometry" />
+    </mxCell>
+    <mxCell id="middleware" value="&lt;b&gt;Middleware&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;CORS | Error handling | HTTP 400/503/500&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="600" width="320" height="50" as="geometry" />
+    </mxCell>
+    <mxCell id="conv_manager" value="&lt;b&gt;ConversationManager&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;conversation_manager.py&lt;br&gt;Session 30min timeout | Pronoun resolution&lt;br&gt;History storage | Context retrieval&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A1A2E;strokeColor=#7B2FBE;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="api_layer">
+      <mxGeometry x="20" y="670" width="320" height="80" as="geometry" />
+    </mxCell>
+    <!-- ===================== ML INFERENCE ENGINE ===================== -->
+    <mxCell id="ml_layer" value="&lt;font style=&quot;font-size:14px;font-weight:bold;&quot; color=&quot;#FFA657&quot;&gt;[3] ML INFERENCE ENGINE&lt;/font&gt;" style="swimlane;startSize=30;fillColor=#161B22;strokeColor=#D29922;fontColor=#FFFFFF;fontStyle=1;fontSize=13;rounded=10;" vertex="1" parent="1">
+      <mxGeometry x="800" y="110" width="380" height="870" as="geometry" />
+    </mxCell>
+    <mxCell id="ml_label" value="[PyTorch + CLIP + DistilGPT-2]" style="text;html=1;fontSize=16;align=center;fillColor=none;strokeColor=none;fontColor=#D29922;" vertex="1" parent="ml_layer">
+      <mxGeometry x="40" y="38" width="300" height="35" as="geometry" />
+    </mxCell>
+    <mxCell id="ensemble_vqa" value="&lt;b&gt;ProductionEnsembleVQA&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;ensemble_vqa_app.py&lt;br&gt;Device: CUDA / CPU auto-detect&lt;br&gt;Beam Search width=5 | Top-K Decoding&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#2D2000;strokeColor=#D29922;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="88" width="340" height="80" as="geometry" />
+    </mxCell>
+    <mxCell id="router" value="&lt;b&gt;Question Router (Keyword Classifier)&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;is_spatial_question()&lt;br&gt;Spatial keywords: left, right, above, below, next to...&lt;br&gt;Routes to Base or Spatial model&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#1E1E00;strokeColor=#D29922;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="188" width="340" height="75" as="geometry" />
+    </mxCell>
+    <mxCell id="base_model_box" value="&lt;b&gt;Base VQA Model&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;model.py | VQAModel&lt;br&gt;CLIP ViT-B/32 + GPT-2&lt;br&gt;vqa_checkpoint.pt (731 MB)&lt;br&gt;hidden=512 | layers=2 | acc~50%&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#162415;strokeColor=#3FB950;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="285" width="158" height="120" as="geometry" />
+    </mxCell>
+    <mxCell id="spatial_model_box" value="&lt;b&gt;Spatial VQA Model&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;model_spatial.py&lt;br&gt;SpatialAdapter + 8-head attn&lt;br&gt;vqa_spatial_checkpoint.pt (739 MB)&lt;br&gt;dropout=0.3 | acc~40%&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#0D2137;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="192" y="285" width="168" height="120" as="geometry" />
+    </mxCell>
+    <mxCell id="gpt2" value="&lt;b&gt;DistilGPT-2 Tokenizer&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Text tokenization | Vocab&lt;br&gt;BOS / EOS / PAD tokens | Beam search decoding&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="425" width="340" height="65" as="geometry" />
+    </mxCell>
+    <mxCell id="clip_box" value="&lt;b&gt;OpenAI CLIP (ViT-B/32)&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;Image encoder + Text encoder&lt;br&gt;Zero-shot object detection (80+ nouns)&lt;br&gt;Question routing: visual vs knowledge&lt;br&gt;Anchor similarity | Softmax x10&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A1A0D;strokeColor=#E3B341;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="508" width="340" height="90" as="geometry" />
+    </mxCell>
+    <mxCell id="img_proc" value="&lt;b&gt;Image Preprocessor (PIL)&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;JPEG/PNG -&gt; RGB | CLIP preprocess | Tensor&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#1C2128;strokeColor=#30363D;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="615" width="340" height="55" as="geometry" />
+    </mxCell>
+    <mxCell id="pt_files" value="&lt;b&gt;PyTorch Checkpoints (Local Disk)&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;vqa_checkpoint.pt (731 MB)&lt;br&gt;vqa_spatial_checkpoint.pt (739 MB)&lt;br&gt;state_dict | vocab | tokenizer config&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#251A00;strokeColor=#D29922;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="688" width="340" height="80" as="geometry" />
+    </mxCell>
+    <mxCell id="gpu_badge" value="GPU: CUDA | ~4 GB VRAM | 2x Model Parallel loading" style="rounded=5;whiteSpace=wrap;html=1;fillColor=#0D1117;strokeColor=#D29922;fontColor=#E3B341;fontSize=10;" vertex="1" parent="ml_layer">
+      <mxGeometry x="20" y="785" width="340" height="28" as="geometry" />
+    </mxCell>
+    <!-- ===================== NEURO-SYMBOLIC PIPELINE ===================== -->
+    <mxCell id="ns_layer" value="&lt;font style=&quot;font-size:14px;font-weight:bold;&quot; color=&quot;#BC8CFF&quot;&gt;[4] NEURO-SYMBOLIC PIPELINE&lt;/font&gt;" style="swimlane;startSize=30;fillColor=#161B22;strokeColor=#8957E5;fontColor=#FFFFFF;fontStyle=1;fontSize=13;rounded=10;" vertex="1" parent="1">
+      <mxGeometry x="1220" y="110" width="370" height="870" as="geometry" />
+    </mxCell>
+    <mxCell id="ns_label" value="[CLIP + Wikidata SPARQL + Groq LLM]" style="text;html=1;fontSize=14;align=center;fillColor=none;strokeColor=none;fontColor=#8957E5;" vertex="1" parent="ns_layer">
+      <mxGeometry x="15" y="38" width="340" height="35" as="geometry" />
+    </mxCell>
+    <mxCell id="ns_main" value="&lt;b&gt;SemanticNeurosymbolicVQA&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;semantic_neurosymbolic_vqa.py&lt;br&gt;Neural -&gt; Symbolic -&gt; Verbalize pipeline&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A0D2E;strokeColor=#8957E5;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ns_layer">
+      <mxGeometry x="20" y="88" width="330" height="65" as="geometry" />
+    </mxCell>
+    <mxCell id="ns_step1" value="&lt;b&gt;Step 1: CLIP Routing&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;should_use_neurosymbolic()&lt;br&gt;VISUAL anchor vs KNOWLEDGE anchor&lt;br&gt;Temperature softmax x10&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#0D1A30;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ns_layer">
+      <mxGeometry x="20" y="173" width="330" height="78" as="geometry" />
+    </mxCell>
+    <mxCell id="route_decision" value="VISUAL question?&lt;br&gt;-&gt; Neural VQA only&lt;br&gt;KNOWLEDGE question?&lt;br&gt;-&gt; Neuro-Symbolic" style="rhombus;whiteSpace=wrap;html=1;fillColor=#21262D;strokeColor=#8957E5;fontColor=#FFFFFF;fontSize=10;" vertex="1" parent="ns_layer">
+      <mxGeometry x="75" y="268" width="220" height="88" as="geometry" />
+    </mxCell>
+    <mxCell id="ns_step2" value="&lt;b&gt;Step 2: CLIP Object Detection&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;detect_objects_with_clip()&lt;br&gt;80+ noun vocabulary | Top-3 objects&lt;br&gt;Cosine similarity | prompt: &apos;a photo of a {label}&apos;&lt;/font&gt;" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#0D1A30;strokeColor=#1F6FEB;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ns_layer">
+      <mxGeometry x="20" y="375" width="330" height="80" as="geometry" />
+    </mxCell>
+    <mxCell id="wikidata_box" value="&lt;b&gt;Step 3: WikidataKnowledgeBase&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;SPARQL: query.wikidata.org&lt;br&gt;P31 (category) | P186 (material) | P366 (uses)&lt;br&gt;P2101 (melting pt) | P2054 (density)&lt;br&gt;lru_cache(500) | timeout=10s&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#0D2E2E;strokeColor=#2EA8A8;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ns_layer">
+      <mxGeometry x="20" y="473" width="330" height="100" as="geometry" />
+    </mxCell>
+    <mxCell id="groq_box" value="&lt;b&gt;Step 4: Groq LLM Verbalizer&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;WikidataGroqAnswerer&lt;br&gt;Model: llama-3.3-70b-versatile&lt;br&gt;Temp=0.1 | max_tokens=180 | top_p=0.9&lt;br&gt;Answers ONLY from Wikidata facts&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A2B1A;strokeColor=#F85149;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ns_layer">
+      <mxGeometry x="20" y="592" width="330" height="95" as="geometry" />
+    </mxCell>
+    <mxCell id="groq_access" value="&lt;b&gt;Groq Accessibility Service&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;groq_service.py | GroqDescriptionService&lt;br&gt;2-sentence narrations for blind users&lt;br&gt;Temp=0.7 | max_tokens=150&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A2B1A;strokeColor=#F85149;fontColor=#FFFFFF;fontSize=11;" vertex="1" parent="ns_layer">
+      <mxGeometry x="20" y="706" width="330" height="85" as="geometry" />
+    </mxCell>
+    <mxCell id="groq_badge" value="Groq API | Llama-3.3-70B-Versatile | GROQ_API_KEY env var" style="rounded=5;whiteSpace=wrap;html=1;fillColor=#0D1117;strokeColor=#F85149;fontColor=#F85149;fontSize=10;" vertex="1" parent="ns_layer">
+      <mxGeometry x="20" y="808" width="330" height="28" as="geometry" />
+    </mxCell>
+    <!-- ===================== EXTERNAL SERVICES ===================== -->
+    <mxCell id="wikidata_ext" value="&lt;b&gt;Wikidata SPARQL API&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;query.wikidata.org/sparql&lt;br&gt;wikidata.org/w/api.php&lt;br&gt;Entity lookup | Property values&lt;br&gt;Free and Open Knowledge Base&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#0A2525;strokeColor=#2EA8A8;fontColor=#FFFFFF;fontSize=12;" vertex="1" parent="1">
+      <mxGeometry x="1640" y="200" width="250" height="130" as="geometry" />
+    </mxCell>
+    <mxCell id="groq_cloud" value="&lt;b&gt;Groq Cloud API&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;api.groq.com&lt;br&gt;Llama-3.3-70B-Versatile&lt;br&gt;Ultra-low latency inference&lt;br&gt;chat.completions endpoint&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A0A0A;strokeColor=#F85149;fontColor=#FFFFFF;fontSize=12;" vertex="1" parent="1">
+      <mxGeometry x="1640" y="385" width="250" height="130" as="geometry" />
+    </mxCell>
+    <mxCell id="hf_clip" value="&lt;b&gt;OpenAI / HuggingFace Hub&lt;/b&gt;&lt;br&gt;&lt;font color=&quot;#8B949E&quot;&gt;CLIP ViT-B/32 weights&lt;br&gt;GPT-2 / DistilGPT-2 tokenizer&lt;br&gt;Cached locally after first download&lt;/font&gt;" style="rounded=10;whiteSpace=wrap;html=1;fillColor=#1A1000;strokeColor=#E3B341;fontColor=#FFFFFF;fontSize=12;" vertex="1" parent="1">
+      <mxGeometry x="1640" y="565" width="250" height="105" as="geometry" />
+    </mxCell>
+    <!-- ===================== LEGEND ===================== -->
+    <mxCell id="legend" value="&lt;b&gt;LEGEND&lt;/b&gt;&lt;br&gt;[1] Blue  = Client Layer (React Native)&lt;br&gt;[2] Green = API Gateway (FastAPI)&lt;br&gt;[3] Orange = ML Inference (PyTorch)&lt;br&gt;[4] Purple = Neuro-Symbolic Pipeline&lt;br&gt;Solid arrow = Primary data flow&lt;br&gt;Dashed arrow = Conditional / supplement&lt;br&gt;Animated = Live request flow" style="rounded=8;whiteSpace=wrap;html=1;fillColor=#161B22;strokeColor=#30363D;fontColor=#8B949E;fontSize=11;align=left;" vertex="1" parent="1">
+      <mxGeometry x="1640" y="710" width="250" height="155" as="geometry" />
+    </mxCell>
+    <!-- ===================== EDGES / ANIMATED FLOWS ===================== -->
+    <!-- 1. api.js -> FastAPI (HTTP REST) -->
+    <mxCell id="flow_1" value="&lt;font color=&quot;#3FB950&quot;&gt;HTTP REST (JSON/FormData)&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;orthogonalLoop=1;jettySize=auto;strokeColor=#3FB950;strokeWidth=3;fontSize=10;fontColor=#3FB950;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="api_js" target="fastapi_main">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 2. FastAPI -> Ensemble VQA -->
+    <mxCell id="flow_2" value="&lt;font color=&quot;#FFA657&quot;&gt;answer()&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;orthogonalLoop=1;jettySize=auto;strokeColor=#D29922;strokeWidth=3;fontSize=10;fontColor=#FFA657;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="fastapi_main" target="ensemble_vqa">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 3. Ensemble -> Router -->
+    <mxCell id="flow_3" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#D29922;strokeWidth=2;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="ensemble_vqa" target="router">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 4a. Router -> Base Model -->
+    <mxCell id="flow_4a" value="&lt;font color=&quot;#3FB950&quot;&gt;General Q&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#3FB950;strokeWidth=2;animation=1;endArrow=block;endFill=1;fontSize=10;fontColor=#3FB950;" edge="1" parent="1" source="router" target="base_model_box">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 4b. Router -> Spatial Model -->
+    <mxCell id="flow_4b" value="&lt;font color=&quot;#58A6FF&quot;&gt;Spatial Q&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#1F6FEB;strokeWidth=2;animation=1;endArrow=block;endFill=1;fontSize=10;fontColor=#58A6FF;" edge="1" parent="1" source="router" target="spatial_model_box">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 5. Ensemble -> NS Pipeline (supplement) -->
+    <mxCell id="flow_5" value="&lt;font color=&quot;#BC8CFF&quot;&gt;NS supplement&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;orthogonalLoop=1;jettySize=auto;strokeColor=#8957E5;strokeWidth=3;fontSize=10;fontColor=#BC8CFF;animation=1;dashed=1;endArrow=block;endFill=1;" edge="1" parent="1" source="ensemble_vqa" target="ns_main">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 6. NS main -> CLIP Routing -->
+    <mxCell id="flow_6" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#8957E5;strokeWidth=2;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="ns_main" target="ns_step1">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 7. CLIP Routing -> Decision diamond -->
+    <mxCell id="flow_7" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#8957E5;strokeWidth=2;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="ns_step1" target="route_decision">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 8. Decision -> Object Detection -->
+    <mxCell id="flow_8" value="&lt;font color=&quot;#BC8CFF&quot;&gt;Knowledge Q&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#8957E5;strokeWidth=2;animation=1;dashed=1;endArrow=block;endFill=1;fontSize=10;fontColor=#BC8CFF;" edge="1" parent="1" source="route_decision" target="ns_step2">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 9. Object Detection -> Wikidata box -->
+    <mxCell id="flow_9" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#2EA8A8;strokeWidth=2;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="ns_step2" target="wikidata_box">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 10. Wikidata box -> Wikidata external API -->
+    <mxCell id="flow_10" value="&lt;font color=&quot;#2EA8A8&quot;&gt;SPARQL queries&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#2EA8A8;strokeWidth=3;fontSize=10;fontColor=#2EA8A8;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="wikidata_box" target="wikidata_ext">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 11. Wikidata facts -> Groq verbalizer -->
+    <mxCell id="flow_11" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#F85149;strokeWidth=2;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="wikidata_box" target="groq_box">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 12. Groq box -> Groq Cloud -->
+    <mxCell id="flow_12" value="&lt;font color=&quot;#F85149&quot;&gt;API call | Llama-3.3-70B&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#F85149;strokeWidth=3;fontSize=10;fontColor=#F85149;animation=1;endArrow=block;endFill=1;" edge="1" parent="1" source="groq_box" target="groq_cloud">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 13. Groq accessibility -> Groq Cloud -->
+    <mxCell id="flow_13" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#F85149;strokeWidth=2;animation=1;dashed=1;endArrow=block;endFill=1;" edge="1" parent="1" source="groq_access" target="groq_cloud">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 14. FastAPI -> Groq Accessibility (top arc) -->
+    <mxCell id="flow_14" value="&lt;font color=&quot;#F85149&quot;&gt;accessibility narration&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#F85149;strokeWidth=2;fontSize=10;fontColor=#F85149;animation=1;dashed=1;endArrow=block;endFill=1;exitX=0.5;exitY=0;exitDx=0;exitDy=0;entryX=0.5;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="fastapi_main" target="groq_access">
+      <mxGeometry relative="1" as="geometry">
+        <Array as="points">
+          <mxPoint x="580" y="140" />
+          <mxPoint x="1385" y="140" />
+        </Array>
+      </mxGeometry>
+    </mxCell>
+    <!-- 15. CLIP box -> HuggingFace (model weights) -->
+    <mxCell id="flow_15" value="&lt;font color=&quot;#E3B341&quot;&gt;model weights (cached)&lt;/font&gt;" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#E3B341;strokeWidth=2;fontSize=10;fontColor=#E3B341;dashed=1;endArrow=block;endFill=1;" edge="1" parent="1" source="clip_box" target="hf_clip">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 16a. Base model -> GPT2 Tokenizer -->
+    <mxCell id="flow_16a" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#30363D;strokeWidth=1;endArrow=block;endFill=1;" edge="1" parent="1" source="base_model_box" target="gpt2">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 16b. Spatial model -> GPT2 Tokenizer -->
+    <mxCell id="flow_16b" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#30363D;strokeWidth=1;endArrow=block;endFill=1;" edge="1" parent="1" source="spatial_model_box" target="gpt2">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- 17. Conv Manager <-> Ensemble VQA -->
+    <mxCell id="flow_17" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=1;strokeColor=#7B2FBE;strokeWidth=2;animation=1;dashed=1;endArrow=block;endFill=1;startArrow=block;startFill=1;" edge="1" parent="1" source="conv_manager" target="ensemble_vqa">
+      <mxGeometry relative="1" as="geometry" />
+    </mxCell>
+    <!-- ===================== PHASE ANNOTATIONS ===================== -->
+    <mxCell id="ann1" value="(1) User uploads image + question" style="text;html=1;strokeColor=none;fillColor=#0D1117;fontColor=#58A6FF;fontSize=11;fontStyle=1;align=center;" vertex="1" parent="1">
+      <mxGeometry x="100" y="988" width="250" height="28" as="geometry" />
+    </mxCell>
+    <mxCell id="ann2" value="(2) REST API routes to ensemble" style="text;html=1;strokeColor=none;fillColor=#0D1117;fontColor=#3FB950;fontSize=11;fontStyle=1;align=center;" vertex="1" parent="1">
+      <mxGeometry x="460" y="988" width="240" height="28" as="geometry" />
+    </mxCell>
+    <mxCell id="ann3" value="(3) Neural model answers question" style="text;html=1;strokeColor=none;fillColor=#0D1117;fontColor=#FFA657;fontSize=11;fontStyle=1;align=center;" vertex="1" parent="1">
+      <mxGeometry x="860" y="988" width="250" height="28" as="geometry" />
+    </mxCell>
+    <mxCell id="ann4" value="(4) Symbolic + Groq enriches answer" style="text;html=1;strokeColor=none;fillColor=#0D1117;fontColor=#BC8CFF;fontSize=11;fontStyle=1;align=center;" vertex="1" parent="1">
+      <mxGeometry x="1270" y="988" width="260" height="28" as="geometry" />
+    </mxCell>
+  </root>
+</mxGraphModel>

exp_results/feature_extraction_metric.csv ADDED Viewed

	@@ -0,0 +1,31 @@

+epoch,train_loss,train_token_acc,val_loss,val_token_acc,val_exact_match,lr
+1,3.687392619148223,0.5010925703618669,2.6377785576964325,0.531001718679689,0.0625462073044507,0.0001
+2,3.0861334211370917,0.5492582896593264,2.1873294205035805,0.5735707690693298,0.1437971314505397,0.0001
+3,2.8613873058208554,0.5773015727241105,2.0188139058508963,0.5919563278274717,0.18172408694366404,0.0001
+4,2.737266832805117,0.5940482014385925,1.8989913845961948,0.6057449292461827,0.2079698358716546,0.0001
+5,2.64607786060719,0.6068536389304081,1.8126546847370435,0.6131761748835726,0.22467839716102322,0.0001
+6,2.5737654996439945,0.6159161500927967,1.745610311908542,0.6227055006432083,0.23806003252994234,0.0001
+7,2.514629547727101,0.6238974923921153,1.6846065549355633,0.6310539678582605,0.25521218394203754,0.0001
+8,2.467853448716654,0.630066124487741,1.6530387682734795,0.6351331795723933,0.2616442407215733,0.0001
+9,2.430272235876001,0.6363310434633568,1.6044414886888467,0.6438829395568596,0.2796096406920006,0.0001
+10,2.3940254725485,0.6410929495099732,1.5768477393771119,0.6476609546620891,0.2876681945882005,0.0001
+11,2.3626844231579023,0.6466396824626934,1.553934060740021,0.6507747072093891,0.2935087978707674,0.0001
+12,2.3347287295768417,0.6508579807194079,1.5344560882955227,0.6529503009229336,0.29957119621469763,0.0001
+13,2.309176077580466,0.6551987208042674,1.5069528773145855,0.6592958943461472,0.3086647937305929,0.0001
+14,2.2852324938224235,0.6583507632729854,1.4877223473674845,0.6627878375210852,0.31820198136921485,0.0001
+15,2.265477722738707,0.6621552250710977,1.4731922914397042,0.6635274037999926,0.3206417270442111,0.0001
+16,2.245406344189297,0.6660276569959188,1.454425812892194,0.6657813076140746,0.3254472867070827,1e-06
+17,2.2047869251156476,0.6741207528932076,1.4267255866302635,0.6736559963451242,0.3408990093153926,1e-06
+18,2.173899897451869,0.6801777819710184,1.4036545191171035,0.6780021879470574,0.34703533934644387,1e-06
+19,2.15051551812644,0.6852958937991237,1.3850691127327253,0.6806749330376679,0.3535413278131007,1e-06
+20,2.130151925532512,0.6903713528113137,1.3759601954019294,0.682907020145992,0.3590862043471832,1e-06
+21,2.111327923803482,0.6937075932303665,1.3607378039719924,0.6867363317957464,0.3650746710039923,1e-06
+22,2.092705831874552,0.6989087903379759,1.3529389587775715,0.6871686296642951,0.3676622800532308,1e-06
+23,2.0762000757163266,0.7018636832358497,1.3471845992893543,0.6889090611124938,0.3711370693479225,1e-06
+24,2.0588077032516723,0.7061800249295429,1.3332587570514318,0.6925943864966339,0.37853023806003255,1e-06
+25,2.043530640342685,0.7086816234112068,1.323614944545728,0.6927403596774587,0.3790477598698802,1e-06
+26,2.028976038177644,0.7119645012827895,1.321273627989697,0.6960837739818501,0.38511015821381045,1e-06
+27,2.0125017191516372,0.7166598519934908,1.3151825143481202,0.6966083350608934,0.38651486026911136,1e-06
+28,1.998029633995205,0.7198163744333156,1.3046240308937036,0.6980289071798325,0.38836315244713887,1e-06
+29,1.9832194559959038,0.7228894007410402,1.3061683574375116,0.6981341627971182,0.3905811030607719,1e-06
+30,1.96923152904127,0.7272438684699805,1.3041821732273642,0.6986926667532831,0.3902114446251663,1e-06

experiments/__pycache__/train.cpython-312.pyc ADDED Viewed

Binary file (21.6 kB). View file

experiments/test.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import os
+import torch
+from PIL import Image
+from transformers import GPT2Tokenizer
+from model import VQAModel
+from train import Vocab
+def load_model(checkpoint_path, device='cuda'):
+    checkpoint = torch.load(checkpoint_path, map_location=device)
+    vocab = Vocab()
+    vocab.vocab = checkpoint['vocab']
+    vocab.vocab_size = len(checkpoint['vocab'])
+    vocab.word2idx = checkpoint['word2idx']
+    vocab.idx2word = checkpoint['idx2word']
+    vocab.pad_token_id = checkpoint['pad_token_id']
+    vocab.bos_token_id = checkpoint['bos_token_id']
+    vocab.eos_token_id = checkpoint['eos_token_id']
+    vocab.unk_token_id = checkpoint['unk_token_id']
+    model = VQAModel(
+        vocab_size=len(checkpoint['vocab']),
+        device=device,
+        question_max_len=checkpoint.get('question_max_len', 20),
+        answer_max_len=checkpoint.get('answer_max_len', 12),
+        pad_token_id=checkpoint['pad_token_id'],
+        bos_token_id=checkpoint['bos_token_id'],
+        eos_token_id=checkpoint['eos_token_id'],
+        unk_token_id=checkpoint['unk_token_id'],
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        model.gpt2_model.resize_token_embeddings(len(tokenizer))
+    model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    model.eval()
+    return model, vocab, tokenizer
+def answer_question(model, vocab, tokenizer, image_path, question, device='cuda', use_beam_search=True, beam_width=5, temperature=0.8):
+    image = Image.open(image_path).convert('RGB')
+    image = model.clip_preprocess(image).unsqueeze(0).to(device)
+    question_tokens = tokenizer(
+        question,
+        padding='max_length',
+        truncation=True,
+        max_length=model.question_max_len,
+        return_tensors='pt'
+    )
+    questions = {
+        'input_ids': question_tokens['input_ids'].to(device),
+        'attention_mask': question_tokens['attention_mask'].to(device)
+    }
+    with torch.no_grad():
+        if use_beam_search and hasattr(model, 'generate_with_beam_search'):
+            generated = model.generate_with_beam_search(image, questions, beam_width=beam_width)
+        else:
+            generated = model(image, questions)
+    answer = vocab.decoder(generated[0].cpu().numpy())
+    return answer
+CHECKPOINT = "./output2/spatial_adapter_v2_2/vqa_spatial_checkpoint.pt"
+IMAGE_PATH = r"./im2.jpg"
+QUESTION = ""
+if __name__ == "__main__":
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    print("Loading model...")
+    model, vocab, tokenizer = load_model(CHECKPOINT, device)
+    print("Model loaded!\n")
+    test_questions = [
+        "What is to the right of the soup?"
+    ]
+    print(f"Image: {IMAGE_PATH}\n")
+    for question in test_questions:
+        print(f"Question: {question}")
+        answer = answer_question(model, vocab, tokenizer, IMAGE_PATH, question, device, use_beam_search=True, beam_width=5)
+        print(f"Answer: {answer}\n")

experiments/train.py ADDED Viewed

	@@ -0,0 +1,349 @@

+import os
+import pandas as pd
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+from PIL import Image
+from transformers import GPT2Tokenizer
+import matplotlib.pyplot as plt
+import numpy as np
+from tqdm import tqdm
+from collections import Counter
+from nltk.tokenize import word_tokenize
+from sklearn.model_selection import train_test_split
+from torchvision import transforms
+from models.model import VQAModel
+device = 'cuda'
+class Vocab:
+    def __init__(self):
+        self.vocab = None
+        self.vocab_size = None
+        self.word2idx = None
+        self.idx2word = None
+        self.pad = '<pad>'
+        self.bos = '<bos>'
+        self.eos = '<eos>'
+        self.unk = '<unk>'
+    def build_vocab(self, df, min_freq=1):
+        counter = Counter()
+        for ans in df['answer']:
+            tokens = word_tokenize(ans.lower())
+            counter.update(tokens)
+        vocab = sorted([word for word, freq in counter.items() if freq >= min_freq])
+        vocab = [self.pad, self.bos, self.eos, self.unk] + vocab
+        word2idx = {word: idx for idx, word in enumerate(vocab)}
+        idx2word = {idx: word for word, idx in word2idx.items()}
+        self.vocab = vocab
+        self.word2idx = word2idx
+        self.idx2word = idx2word
+        self.vocab_size = len(vocab)
+        self.pad_token_id = self.word2idx["<pad>"]
+        self.bos_token_id = self.word2idx["<bos>"]
+        self.eos_token_id = self.word2idx["<eos>"]
+        self.unk_token_id = self.word2idx["<unk>"]
+    def encoder(self, text, max_len):
+        tokens = word_tokenize(text.lower())
+        token_ids = [self.word2idx.get(token, self.unk_token_id) for token in tokens]
+        token_ids = [self.bos_token_id] + token_ids + [self.eos_token_id]
+        if len(token_ids) < max_len:
+            token_ids += [self.pad_token_id] * (max_len - len(token_ids))
+        else:
+            token_ids = token_ids[:max_len]
+        return token_ids
+    def decoder(self, token_ids):
+        tokens = []
+        for idx in token_ids:
+            if idx == self.eos_token_id:
+                break
+            if idx in (self.pad_token_id, self.bos_token_id):
+                continue
+            tokens.append(self.idx2word.get(idx, "<unk>"))
+        return ' '.join(tokens).strip()
+class AugmentedVQADataset(Dataset):
+    def __init__(self, df, img_dir, question_tokenizer, text_processor, clip_processor,
+                 question_max_len=32, answer_max_len=16, augment=True):
+        self.df = df
+        self.img_dir = img_dir
+        self.question_tokenizer = question_tokenizer
+        self.text_processor = text_processor
+        self.clip_processor = clip_processor
+        self.question_max_len = question_max_len
+        self.answer_max_len = answer_max_len
+        self.augment = augment
+        if augment:
+            self.transform = transforms.Compose([
+                transforms.RandomHorizontalFlip(p=0.5),
+                transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
+                transforms.RandomRotation(10),
+            ])
+        else:
+            self.transform = None
+    def __len__(self):
+        return len(self.df)
+    def __getitem__(self, idx):
+        row = self.df.iloc[idx]
+        img_path = os.path.join(self.img_dir, row['image_path'])
+        image = Image.open(img_path).convert('RGB')
+        question = row['question']
+        answer = row['answer']
+        if self.augment and self.transform:
+            image = self.transform(image)
+        question_tokenized = self.question_tokenizer(
+            question,
+            padding='max_length',
+            truncation=True,
+            max_length=self.question_max_len,
+            return_tensors='pt'
+        )
+        answer_ids = self.text_processor.encoder(answer, max_len=self.answer_max_len)
+        image = self.clip_processor(image)
+        return {
+            'image_path': img_path,
+            'image': image,
+            'question_ids': question_tokenized['input_ids'].squeeze(0),
+            'question_mask': question_tokenized['attention_mask'].squeeze(0),
+            'answer_ids': torch.tensor(answer_ids, dtype=torch.long)
+        }
+def save_checkpoint(model, optimizer, epoch, vocab, path):
+    torch.save({
+        'epoch': epoch,
+        'model_state_dict': model.state_dict(),
+        'optimizer_state_dict': optimizer.state_dict(),
+        'vocab': vocab.vocab,
+        'word2idx': vocab.word2idx,
+        'idx2word': vocab.idx2word,
+        'pad_token_id': vocab.pad_token_id,
+        'bos_token_id': vocab.bos_token_id,
+        'eos_token_id': vocab.eos_token_id,
+        'unk_token_id': vocab.unk_token_id,
+        'question_max_len': model.question_max_len,
+        'answer_max_len': model.answer_max_len
+    }, path)
+def plot_losses(train_losses, val_losses, save_path="loss_plot.png"):
+    plt.figure(figsize=(8,6))
+    plt.plot(train_losses, label="Train Loss")
+    plt.plot(val_losses, label="Validation Loss")
+    plt.xlabel("Epoch")
+    plt.ylabel("Loss")
+    plt.title("Train vs Validation Loss")
+    plt.legend()
+    plt.savefig(save_path)
+    plt.close()
+def train_one_epoch(model, dataloader, optimizer, device, scaler, vocab):
+    model.train()
+    total_loss = 0
+    total_token_acc = 0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id, label_smoothing=0.1)
+    for batch in tqdm(dataloader):
+        optimizer.zero_grad()
+        images = batch['image'].to(device)
+        questions = {
+            'input_ids': batch['question_ids'].to(device),
+            'attention_mask': batch['question_mask'].to(device)
+        }
+        answers = batch['answer_ids'].to(device)
+        with torch.amp.autocast(device):
+            logits = model(images, questions, answer_input_ids=answers)
+            shifted_logits = logits[:, :-1, :]
+            shifted_answers = answers[:, 1:]
+            loss = criterion(
+                shifted_logits.reshape(-1, shifted_logits.size(-1)),
+                shifted_answers.reshape(-1)
+            )
+            predicted_tokens = shifted_logits.argmax(dim=-1)
+            correct = (predicted_tokens == shifted_answers).float()
+            mask = (shifted_answers != vocab.pad_token_id).float()
+            token_acc = (correct * mask).sum() / mask.sum()
+            total_token_acc += token_acc.item()
+        scaler.scale(loss).backward()
+        scaler.unscale_(optimizer)
+        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        scaler.step(optimizer)
+        scaler.update()
+        total_loss += loss.item()
+    avg_loss = total_loss / len(dataloader)
+    avg_token_acc = total_token_acc / len(dataloader)
+    return avg_loss, avg_token_acc
+def validate_one_epoch(model, dataloader, device, vocab):
+    model.eval()
+    total_loss = 0
+    total_token_acc = 0
+    exact_matches = 0
+    total_samples = 0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id)
+    with torch.no_grad():
+        for batch in tqdm(dataloader):
+            images = batch['image'].to(device)
+            questions = {
+                'input_ids': batch['question_ids'].to(device),
+                'attention_mask': batch['question_mask'].to(device)
+            }
+            answers = batch['answer_ids'].to(device)
+            logits = model(images, questions, answer_input_ids=answers)
+            shifted_logits = logits[:, :-1, :]
+            shifted_answers = answers[:, 1:]
+            loss = criterion(
+                shifted_logits.reshape(-1, shifted_logits.size(-1)),
+                shifted_answers.reshape(-1)
+            )
+            total_loss += loss.item()
+            predicted_tokens = shifted_logits.argmax(dim=-1)
+            correct = (predicted_tokens == shifted_answers).float()
+            mask = (shifted_answers != vocab.pad_token_id).float()
+            token_acc = (correct * mask).sum() / mask.sum()
+            total_token_acc += token_acc.item()
+            if hasattr(model, 'generate_with_beam_search'):
+                generated = model.generate_with_beam_search(images, questions, beam_width=3)
+            else:
+                generated = model(images, questions)
+            for pred, true in zip(generated, answers):
+                pred_text = vocab.decoder(pred.cpu().numpy())
+                true_text = vocab.decoder(true.cpu().numpy())
+                if pred_text.strip() == true_text.strip():
+                    exact_matches += 1
+                total_samples += 1
+    avg_loss = total_loss / len(dataloader)
+    avg_token_acc = total_token_acc / len(dataloader)
+    exact_match_acc = exact_matches / total_samples
+    return avg_loss, avg_token_acc, exact_match_acc
+def main():
+    print()
+    print("# VQA: Training with Staged Unfreezing")
+    print()
+    import random
+    import numpy as np
+    torch.manual_seed(42)
+    random.seed(42)
+    np.random.seed(42)
+    if torch.cuda.is_available(): torch.cuda.manual_seed_all(42)
+    DATA_DIR = r"./gen_vqa_v2"
+    CSV_PATH = os.path.join(DATA_DIR, "metadata.csv")
+    OUTPUT_DIR = r"./output2/feature_extraction"
+    CHECKPOINT_PATH = os.path.join(OUTPUT_DIR, "vqa_checkpoint.pt")
+    LOG_CSV = os.path.join(OUTPUT_DIR, "train_log.csv")
+    LOSS_GRAPH_PATH = os.path.join(OUTPUT_DIR, "loss_plot.png")
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+    batch_size = 64
+    learning_rate = 1e-4
+    num_epochs = 30
+    patience = 8
+    question_max_len = 20
+    answer_max_len = 12
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    print(device)
+    metadata = pd.read_csv(CSV_PATH)
+    print(f"Using: question_max_len={question_max_len}, answer_max_len={answer_max_len}")
+    vocab = Vocab()
+    vocab.build_vocab(metadata, min_freq=3)
+    answer_vocab_size = len(vocab.vocab)
+    print(f"Answer Vocab Size: {answer_vocab_size}")
+    word_freq = Counter()
+    for ans in metadata['answer']:
+        tokens = word_tokenize(ans.lower())
+        word_freq.update(tokens)
+    print("\nTop 20 most common answer words:")
+    for word, freq in word_freq.most_common(20):
+        print(f"  {word}: {freq}")
+    train_df, test_df = train_test_split(metadata, test_size=0.2, random_state=42)
+    val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)
+    print(f"\nTrain size: {len(train_df)}, Val size: {len(val_df)}, Test size: {len(test_df)}")
+    print()
+    model = VQAModel(
+        vocab_size=answer_vocab_size,
+        device=device,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        pad_token_id=vocab.pad_token_id,
+        bos_token_id=vocab.bos_token_id,
+        eos_token_id=vocab.eos_token_id,
+        unk_token_id=vocab.unk_token_id,
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    print("STAGE 1: Training decoder with frozen encoders")
+    print()
+    clip_processor = model.clip_preprocess
+    question_tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if question_tokenizer.pad_token is None:
+        question_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        model.gpt2_model.resize_token_embeddings(len(question_tokenizer))
+    train_dataset = AugmentedVQADataset(
+        train_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=clip_processor,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        augment=True
+    )
+    val_dataset = AugmentedVQADataset(
+        val_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=clip_processor,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        augment=False
+    )
+    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True)
+    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)
+    trainable_params = [p for p in model.parameters() if p.requires_grad]
+    optimizer = torch.optim.AdamW(trainable_params, lr=learning_rate, weight_decay=1e-4)
+    print(f"Trainable parameters: {sum(p.numel() for p in trainable_params):,}")
+    print()
+    scaler = torch.amp.GradScaler(device)
+    best_val_loss = np.inf
+    best_val_exact_match = 0.0
+    counter = 0
+    logs = []
+    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
+        optimizer, mode='max', factor=0.5, patience=4, verbose=True
+    )
+    for epoch in range(num_epochs):
+        print(f"\nEpoch {epoch+1}/{num_epochs}")
+        train_loss, train_token_acc = train_one_epoch(model, train_loader, optimizer, device, scaler, vocab)
+        val_loss, val_token_acc, val_exact_match = validate_one_epoch(model, val_loader, device, vocab)
+        print(f"Train Loss: {train_loss:.4f} | Train Token Acc: {train_token_acc:.4f}")
+        print(f"Val Loss: {val_loss:.4f} | Val Token Acc: {val_token_acc:.4f} | Val Exact Match: {val_exact_match:.4f}")
+        print(f"LR: {optimizer.param_groups[0]['lr']}")
+        scheduler.step(val_exact_match)
+        if val_exact_match > best_val_exact_match:
+            best_val_exact_match = val_exact_match
+            save_checkpoint(model, optimizer, epoch, vocab, CHECKPOINT_PATH)
+            print("Checkpoint saved!")
+            counter = 0
+        else:
+            counter += 1
+            print(f"No improvement in exact match for {counter} epochs.")
+        if epoch == 15 and not model.fine_tuning_mode:
+            print("\n" + "="*50)
+            print("STAGE 2: Unfreezing encoders for fine-tuning")
+            print("="*50)
+            model.unfreeze_clip_layers(num_layers=3)
+            model.unfreeze_gpt2_layers(num_layers=3)
+            clip_params = []
+            gpt2_params = []
+            other_params = []
+            for name, param in model.named_parameters():
+                if param.requires_grad:
+                    if 'clip_model' in name:
+                        clip_params.append(param)
+                    elif 'gpt2_model' in name:
+                        gpt2_params.append(param)
+                    else:
+                        other_params.append(param)
+            optimizer = torch.optim.AdamW([
+                {'params': clip_params, 'lr': 1e-6},
+                {'params': gpt2_params, 'lr': 1e-6},
+                {'params': other_params, 'lr': 5e-5}
+            ], weight_decay=1e-4)
+            scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
+                optimizer, mode='max', factor=0.5, patience=4, verbose=True
+            )
+            print()
+        if counter >= patience:
+            print(f"\nEarly stopping after {patience} epochs without improvement")
+        logs.append([epoch+1, train_loss, train_token_acc, val_loss, val_token_acc, val_exact_match, optimizer.param_groups[0]['lr']])
+    log_df = pd.DataFrame(logs, columns=["epoch","train_loss","train_token_acc","val_loss","val_token_acc","val_exact_match","lr"])
+    log_df.to_csv(LOG_CSV, index=False)
+    plot_losses([x[1] for x in logs], [x[3] for x in logs], save_path=LOSS_GRAPH_PATH)
+    print("Training complete!")
+    print(f"Best exact match accuracy: {best_val_exact_match:.4f}")
+if __name__ == "__main__":
+    main()

experiments/utils/preprocess.py ADDED Viewed

	@@ -0,0 +1,164 @@

+import os
+import pandas as pd
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+from PIL import Image
+from transformers import GPT2Tokenizer
+import matplotlib.pyplot as plt
+import numpy as np
+from collections import Counter
+from nltk.tokenize import word_tokenize
+from sklearn.model_selection import train_test_split
+from torchvision import transforms
+from model import VQAModel
+class Vocab:
+    def __init__(self):
+        self.vocab = None
+        self.vocab_size = None
+        self.word2idx = None
+        self.idx2word = None
+        self.pad = '<pad>'
+        self.bos = '<bos>'
+        self.eos = '<eos>'
+        self.unk = '<unk>'
+    def build_vocab(self, df, min_freq=1):
+        counter = Counter()
+        for ans in df['answer']:
+            tokens = word_tokenize(ans.lower())
+            counter.update(tokens)
+        vocab = sorted([word for word, freq in counter.items() if freq >= min_freq])
+        vocab = [self.pad, self.bos, self.eos, self.unk] + vocab
+        word2idx = {word: idx for idx, word in enumerate(vocab)}
+        idx2word = {idx: word for word, idx in word2idx.items()}
+        self.vocab = vocab
+        self.word2idx = word2idx
+        self.idx2word = idx2word
+        self.vocab_size = len(vocab)
+        self.pad_token_id = self.word2idx["<pad>"]
+        self.bos_token_id = self.word2idx["<bos>"]
+        self.eos_token_id = self.word2idx["<eos>"]
+        self.unk_token_id = self.word2idx["<unk>"]
+    def encoder(self, text, max_len):
+        tokens = word_tokenize(text.lower())
+        token_ids = [self.word2idx.get(token, self.unk_token_id) for token in tokens]
+        token_ids = [self.bos_token_id] + token_ids + [self.eos_token_id]
+        if len(token_ids) < max_len:
+            token_ids += [self.pad_token_id] * (max_len - len(token_ids))
+        else:
+            token_ids = token_ids[:max_len]
+        return token_ids
+    def decoder(self, token_ids):
+        tokens = []
+        for idx in token_ids:
+            if idx == self.eos_token_id:
+                break
+            if idx in (self.pad_token_id, self.bos_token_id):
+                continue
+            tokens.append(self.idx2word.get(idx, "<unk>"))
+        return ' '.join(tokens).strip()
+class AugmentedVQADataset(Dataset):
+    def __init__(self, df, img_dir, question_tokenizer, text_processor, clip_processor,
+                 question_max_len=32, answer_max_len=16, augment=True):
+        self.df = df
+        self.img_dir = img_dir
+        self.question_tokenizer = question_tokenizer
+        self.text_processor = text_processor
+        self.clip_processor = clip_processor
+        self.question_max_len = question_max_len
+        self.answer_max_len = answer_max_len
+        self.augment = augment
+        if augment:
+            self.transform = transforms.Compose([
+                transforms.RandomHorizontalFlip(p=0.5),
+                transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
+                transforms.RandomRotation(10),
+            ])
+        else:
+            self.transform = None
+    def __len__(self):
+        return len(self.df)
+    def __getitem__(self, idx):
+        row = self.df.iloc[idx]
+        img_path = os.path.join(self.img_dir, row['image_path'])
+        image = Image.open(img_path).convert('RGB')
+        question = row['question']
+        answer = row['answer']
+        if self.augment and self.transform:
+            image = self.transform(image)
+        question_tokenized = self.question_tokenizer(
+            question,
+            padding='max_length',
+            truncation=True,
+            max_length=self.question_max_len,
+            return_tensors='pt'
+        )
+        answer_ids = self.text_processor.encoder(answer, max_len=self.answer_max_len)
+        image = self.clip_processor(image)
+        return {
+            'image_path': img_path,
+            'image': image,
+            'question_ids': question_tokenized['input_ids'].squeeze(0),
+            'question_mask': question_tokenized['attention_mask'].squeeze(0),
+            'answer_ids': torch.tensor(answer_ids, dtype=torch.long)
+        }
+if __name__ == "__main__":
+    DATA_DIR = r"/home/devarajan8/Documents/vqa/gen_vqa_v2"
+    CSV_PATH = os.path.join(DATA_DIR, "metadata.csv")
+    batch_size = 16
+    question_max_len = 16
+    answer_max_len = 10
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    metadata = pd.read_csv(CSV_PATH)
+    vocab = Vocab()
+    vocab.build_vocab(metadata, min_freq=5)
+    answer_vocab_size = len(vocab.vocab)
+    print(f"Answer Vocab Size: {answer_vocab_size}")
+    train_df, test_df = train_test_split(metadata, test_size=0.2, random_state=42)
+    val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)
+    print(f"Train size: {len(train_df)}, Val size: {len(val_df)}, Test size: {len(test_df)}")
+    print()
+    model = VQAModel(
+        vocab_size=answer_vocab_size,
+        device=device,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        pad_token_id=vocab.pad_token_id,
+        bos_token_id=vocab.bos_token_id,
+        eos_token_id=vocab.eos_token_id,
+        unk_token_id=vocab.unk_token_id,
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    clip_processor = model.clip_preprocess
+    question_tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if question_tokenizer.pad_token is None:
+        question_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        model.gpt2_model.resize_token_embeddings(len(question_tokenizer))
+    train_dataset = AugmentedVQADataset(
+        train_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=clip_processor,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        augment=True
+    )
+    val_dataset = AugmentedVQADataset(
+        val_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=clip_processor,
+        question_max_len=question_max_len,
+        answer_max_len=answer_max_len,
+        augment=False
+    )
+    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
+    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
+    for batch in train_loader:
+        images = batch['image']
+        ques_ids = batch['question_ids']
+        attn_mask = batch['question_mask']
+        answers = batch['answer_ids']
+        print(f"Image: {images.shape}")
+        print(f"Question Ids: {ques_ids.shape}")
+        print(f"Attention Mask: {attn_mask.shape}")
+        print(f"Answer Ids: {answers.shape}")
+        print(answers[0])
+        break

experiments/utils/vocab.py ADDED Viewed

	@@ -0,0 +1,65 @@

+import os
+import pandas as pd
+from collections import Counter
+from nltk.tokenize import word_tokenize
+class Vocab:
+    def __init__(self):
+        self.vocab = None
+        self.vocab_size = None
+        self.word2idx = None
+        self.idx2word = None
+        self.pad = '<pad>'
+        self.bos = '<bos>'
+        self.eos = '<eos>'
+        self.unk = '<unk>'
+    def build_vocab(self, df, min_freq=1):
+        counter = Counter()
+        for ans in df['answer']:
+            tokens = word_tokenize(ans.lower())
+            counter.update(tokens)
+        vocab = sorted([word for word, freq in counter.items() if freq >= min_freq])
+        vocab = [self.pad, self.bos, self.eos, self.unk] + vocab
+        word2idx = {word: idx for idx, word in enumerate(vocab)}
+        idx2word = {idx: word for word, idx in word2idx.items()}
+        self.vocab = vocab
+        self.word2idx = word2idx
+        self.idx2word = idx2word
+        self.vocab_size = len(vocab)
+        self.pad_token_id = self.word2idx["<pad>"]
+        self.bos_token_id = self.word2idx["<bos>"]
+        self.eos_token_id = self.word2idx["<eos>"]
+        self.unk_token_id = self.word2idx["<unk>"]
+    def encoder(self, text, max_len):
+        tokens = word_tokenize(text.lower())
+        token_ids = [self.word2idx.get(token, self.unk_token_id) for token in tokens]
+        token_ids = [self.bos_token_id] + token_ids + [self.eos_token_id]
+        if len(token_ids) < max_len:
+            token_ids += [self.pad_token_id] * (max_len - len(token_ids))
+        else:
+            token_ids = token_ids[:max_len]
+        return token_ids
+    def decoder(self, token_ids):
+        tokens = []
+        for idx in token_ids:
+            if idx == self.eos_token_id:
+                break
+            if idx in (self.pad_token_id, self.bos_token_id):
+                continue
+            tokens.append(self.idx2word.get(idx, "<unk>"))
+        return ' '.join(tokens).strip()
+if __name__ == "__main__":
+    CSV_PATH = r"./gen_vqa_v2/metadata.csv"
+    answer_max_len = 10
+    metadata = pd.read_csv(CSV_PATH)
+    vocab = Vocab()
+    vocab.build_vocab(metadata, min_freq=5)
+    answer_vocab_size = len(vocab.vocab)
+    print(f"Answer Vocab Size: {answer_vocab_size}")
+    sample_answer = metadata['answer'].values
+    text = sample_answer[0]
+    print("")
+    encoded = vocab.encoder(text, answer_max_len)
+    decoded = vocab.decoder(encoded)
+    print(f"Sample Answer: {text}")
+    print(f"Encoded: {encoded}")
+    print(f"Decoded: {decoded}")

finetune.py ADDED Viewed

	@@ -0,0 +1,220 @@

+import os
+import random
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from transformers import GPT2Tokenizer
+from tqdm import tqdm
+from sklearn.model_selection import train_test_split
+from model import VQAModel
+from train import AugmentedVQADataset, Vocab, save_checkpoint, plot_losses
+def create_optimizer_with_differential_lr(model, clip_lr=5e-7, gpt_lr=5e-7, other_lr=3e-5):
+    clip_params, gpt_params, other_params = [], [], []
+    for name, param in model.named_parameters():
+        if param.requires_grad:
+            if 'clip_model' in name:
+                clip_params.append(param)
+            elif 'gpt2_model' in name:
+                gpt_params.append(param)
+            else:
+                other_params.append(param)
+    optimizer = torch.optim.AdamW([
+        {'params': clip_params, 'lr': clip_lr},
+        {'params': gpt_params, 'lr': gpt_lr},
+        {'params': other_params, 'lr': other_lr}
+    ], weight_decay=1e-4)
+    print(f"Optimizer: CLIP params: {len(clip_params)}, GPT-2 params: {len(gpt_params)}, Other params: {len(other_params)}")
+    return optimizer
+def train_one_epoch(model, dataloader, optimizer, device, vocab, scaler):
+    model.train()
+    total_loss = 0.0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id, label_smoothing=0.1)
+    for batch in tqdm(dataloader):
+        optimizer.zero_grad()
+        images = batch['image'].to(device)
+        questions = {
+            'input_ids': batch['question_ids'].to(device),
+            'attention_mask': batch['question_mask'].to(device)
+        }
+        answers = batch['answer_ids'].to(device)
+        with torch.amp.autocast(device):
+            logits = model(images, questions, answer_input_ids=answers)
+            shifted_logits = logits[:, :-1, :].contiguous()
+            shifted_answers = answers[:, 1:].contiguous()
+            loss = criterion(
+                shifted_logits.view(-1, shifted_logits.size(-1)),
+                shifted_answers.view(-1)
+            )
+        if torch.isnan(loss):
+            print("NaN loss detected, skipping batch.")
+            continue
+        scaler.scale(loss).backward()
+        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        scaler.step(optimizer)
+        scaler.update()
+        total_loss += loss.item()
+    return total_loss / len(dataloader)
+def validate_one_epoch(model, dataloader, device, vocab):
+    model.eval()
+    total_loss = 0.0
+    exact_matches = 0
+    total_samples = 0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id)
+    with torch.no_grad():
+        for batch in tqdm(dataloader):
+            images = batch['image'].to(device)
+            questions = {
+                'input_ids': batch['question_ids'].to(device),
+                'attention_mask': batch['question_mask'].to(device)
+            }
+            answers = batch['answer_ids'].to(device)
+            with torch.amp.autocast("cuda"):
+                logits = model(images, questions, answer_input_ids=answers)
+                shifted_logits = logits[:, :-1, :].contiguous()
+                shifted_answers = answers[:, 1:].contiguous()
+                loss = criterion(
+                    shifted_logits.view(-1, shifted_logits.size(-1)),
+                    shifted_answers.view(-1)
+                )
+            total_loss += loss.item()
+            generated = model(images, questions)
+            for pred, true in zip(generated, answers):
+                pred_text = vocab.decoder(pred.cpu().numpy())
+                true_text = vocab.decoder(true.cpu().numpy())
+                if pred_text.strip() == true_text.strip():
+                    exact_matches += 1
+                total_samples += 1
+    avg_loss = total_loss / len(dataloader)
+    exact_match_acc = exact_matches / total_samples
+    return avg_loss, exact_match_acc
+def filter_spatial_directional_data(df):
+    spatial_keywords = [
+        'right', 'left', 'above', 'below', 'top', 'bottom',
+        'front', 'behind', 'next to', 'beside', 'near',
+        'looking', 'facing', 'pointing', 'direction',
+        'where is', 'which side', 'what side'
+    ]
+    directional_answers = [
+        'up', 'down', 'left', 'right', 'forward', 'backward',
+        'north', 'south', 'east', 'west', 'straight', 'sideways'
+    ]
+    spatial_mask = df['question'].str.lower().str.contains('|'.join(spatial_keywords), na=False)
+    directional_mask = df['answer'].str.lower().str.contains('|'.join(directional_answers), na=False)
+    spatial_df = df[spatial_mask | directional_mask].copy()
+    print(f"Found {len(spatial_df)} spatial/directional samples out of {len(df)} total")
+    return spatial_df
+def main():
+    print("# VQA: Spatial-Enhanced Fine-Tuning")
+    torch.manual_seed(42)
+    np.random.seed(42)
+    random.seed(42)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(42)
+    DATA_DIR = r"./gen_vqa_v2"
+    CSV_PATH = os.path.join(DATA_DIR, "metadata.csv")
+    PRETRAINED_CHECKPOINT = "./output2/feature_extraction/vqa_checkpoint.pt"
+    OUTPUT_DIR = "./output2/spatial_finetuning"
+    FINE_TUNED_CHECKPOINT = os.path.join(OUTPUT_DIR, "vqa_checkpoint.pt")
+    LOG_CSV = os.path.join(OUTPUT_DIR, "train_log.csv")
+    LOSS_GRAPH_PATH = os.path.join(OUTPUT_DIR, "loss_plot.png")
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+    batch_size = 64
+    num_epochs = 50
+    patience = 8
+    clip_layers_to_unfreeze = 8
+    gpt_layers_to_unfreeze = 8
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    checkpoint = torch.load(PRETRAINED_CHECKPOINT, map_location=device)
+    metadata = pd.read_csv(CSV_PATH)
+    print(f"\nOriginal dataset size: {len(metadata)}")
+    spatial_data = filter_spatial_directional_data(metadata)
+    if len(spatial_data) < 1000:
+        print(f"\nWARNING: Only {len(spatial_data)} spatial samples found!")
+        print("Mixing 70% spatial data with 30% general data for balanced training")
+        general_data = metadata[~metadata.index.isin(spatial_data.index)].sample(n=min(len(spatial_data)//2, len(metadata)//3), random_state=42)
+        mixed_data = pd.concat([spatial_data, general_data]).sample(frac=1, random_state=42).reset_index(drop=True)
+    else:
+        print(f"Using {len(spatial_data)} spatial/directional samples")
+        mixed_data = spatial_data
+    vocab = Vocab()
+    vocab.vocab = checkpoint['vocab']
+    vocab.vocab_size = len(checkpoint['vocab'])
+    vocab.word2idx = checkpoint['word2idx']
+    vocab.idx2word = checkpoint['idx2word']
+    vocab.pad_token_id = checkpoint['pad_token_id']
+    vocab.bos_token_id = checkpoint['bos_token_id']
+    vocab.eos_token_id = checkpoint['eos_token_id']
+    vocab.unk_token_id = checkpoint['unk_token_id']
+    print(f"Answer vocabulary size: {len(vocab.vocab)}")
+    model = VQAModel(
+        vocab_size=len(checkpoint['vocab']),
+        device=device,
+        question_max_len=checkpoint.get('question_max_len', 20),
+        answer_max_len=checkpoint.get('answer_max_len', 12),
+        pad_token_id=checkpoint['pad_token_id'],
+        bos_token_id=checkpoint['bos_token_id'],
+        eos_token_id=checkpoint['eos_token_id'],
+        unk_token_id=checkpoint['unk_token_id'],
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    question_tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if question_tokenizer.pad_token is None:
+        question_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        model.gpt2_model.resize_token_embeddings(len(question_tokenizer))
+    model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    print("Pretrained model loaded successfully!\n")
+    print(f"UNFREEZING {clip_layers_to_unfreeze} CLIP LAYERS & {gpt_layers_to_unfreeze} GPT-2 LAYERS FOR SPATIAL UNDERSTANDING")
+    model.unfreeze_clip_layers(num_layers=clip_layers_to_unfreeze)
+    model.unfreeze_gpt2_layers(num_layers=gpt_layers_to_unfreeze)
+    train_df, test_df = train_test_split(mixed_data, test_size=0.2, random_state=42)
+    val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)
+    print(f"Train size: {len(train_df)}, Val size: {len(val_df)}, Test size: {len(test_df)}\n")
+    train_dataset = AugmentedVQADataset(train_df, DATA_DIR, question_tokenizer, vocab,
+                               clip_processor=model.clip_preprocess, augment=True,
+                               question_max_len=20, answer_max_len=12)
+    val_dataset = AugmentedVQADataset(val_df, DATA_DIR, question_tokenizer, vocab,
+                             clip_processor=model.clip_preprocess, augment=False,
+                             question_max_len=20, answer_max_len=12)
+    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True)
+    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)
+    optimizer = create_optimizer_with_differential_lr(
+        model,
+        clip_lr=3e-7,
+        gpt_lr=3e-7,
+        other_lr=2e-5
+    )
+    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=4, verbose=True)
+    scaler = torch.amp.GradScaler(device)
+    print("\nSTARTING SPATIAL-ENHANCED FINE-TUNING")
+    best_val_loss = np.inf
+    best_exact_match = 0.0
+    logs = []
+    counter = 0
+    for epoch in range(num_epochs):
+        print(f"\nSpatial Fine-tuning Epoch {epoch+1}/{num_epochs}")
+        train_loss = train_one_epoch(model, train_loader, optimizer, device, vocab, scaler)
+        val_loss, val_exact_match = validate_one_epoch(model, val_loader, device, vocab)
+        print(f"Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f} | Val Exact Match: {val_exact_match:.4f} | LR: {optimizer.param_groups[0]['lr']}")
+        scheduler.step(val_exact_match)
+        if val_exact_match > best_exact_match:
+            best_exact_match = val_exact_match
+            save_checkpoint(model, optimizer, epoch, vocab, FINE_TUNED_CHECKPOINT)
+            print("Checkpoint saved!")
+            counter = 0
+        else:
+            counter += 1
+            print(f"No improvement for {counter} epochs.")
+        if counter >= patience:
+            print(f"\nEarly stopping after {patience} epochs without improvement")
+            break
+        logs.append([epoch + 1, train_loss, val_loss, val_exact_match, optimizer.param_groups[0]['lr']])
+    pd.DataFrame(logs, columns=["epoch", "train_loss", "val_loss", "val_exact_match", "lr"]).to_csv(LOG_CSV, index=False)
+    plot_losses([x[1] for x in logs], [x[2] for x in logs], save_path=LOSS_GRAPH_PATH)
+    print("\nFINE-TUNING COMPLETE")
+    print(f"Best exact match: {best_exact_match:.4f}")
+    print(f"Model saved to: {FINE_TUNED_CHECKPOINT}")
+if __name__ == "__main__":
+    main()

finetune2.py ADDED Viewed

	@@ -0,0 +1,395 @@

+import os
+import random
+import numpy as np
+import pandas as pd
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from transformers import GPT2Tokenizer
+from tqdm import tqdm
+from sklearn.model_selection import train_test_split
+from model import VQAModel
+from model_spatial import VQAModelWithSpatialAdapter
+from train import AugmentedVQADataset, Vocab, save_checkpoint, plot_losses
+import math
+def filter_spatial_questions(df):
+    """
+    Filter dataset for spatial/directional questions.
+    Returns both spatial subset and general subset for mixed training.
+    """
+    spatial_keywords = [
+        'right', 'left', 'above', 'below', 'top', 'bottom',
+        'front', 'behind', 'next to', 'beside', 'near', 'between',
+        'in front', 'in back', 'across from', 'opposite',
+        'closest', 'farthest', 'nearest', 'furthest',
+        'where is', 'which side', 'what side', 'what direction',
+        'on the left', 'on the right', 'at the top', 'at the bottom'
+    ]
+    pattern = '|'.join(spatial_keywords)
+    spatial_mask = df['question'].str.lower().str.contains(pattern, na=False, regex=True)
+    spatial_df = df[spatial_mask].copy()
+    general_df = df[~spatial_mask].copy()
+    print(f"\n📊 Dataset Filtering Results:")
+    print(f"  Total samples: {len(df):,}")
+    print(f"  Spatial samples: {len(spatial_df):,} ({len(spatial_df)/len(df)*100:.1f}%)")
+    print(f"  General samples: {len(general_df):,} ({len(general_df)/len(df)*100:.1f}%)")
+    if len(spatial_df) > 0:
+        print(f"\n📝 Sample Spatial Questions:")
+        for i, row in spatial_df.sample(min(5, len(spatial_df))).iterrows():
+            print(f"  Q: {row['question']}")
+            print(f"  A: {row['answer']}\n")
+    return spatial_df, general_df
+def create_mixed_dataset(spatial_df, general_df, spatial_ratio=0.85, min_spatial_samples=1000):
+    """
+    Create mixed dataset with specified ratio of spatial to general questions.
+    Increased default to 85% spatial for better spatial learning.
+    """
+    if len(spatial_df) < min_spatial_samples:
+        print(f"\n⚠️  WARNING: Only {len(spatial_df)} spatial samples found!")
+        print(f"  Recommended minimum: {min_spatial_samples}")
+        print(f"  Mixing with general data to prevent catastrophic forgetting...")
+        num_spatial = len(spatial_df)
+        num_general = int(num_spatial * (1 - spatial_ratio) / spatial_ratio)
+        num_general = min(num_general, len(general_df))
+    else:
+        num_spatial = len(spatial_df)
+        num_general = int(num_spatial * (1 - spatial_ratio) / spatial_ratio)
+        num_general = min(num_general, len(general_df))
+    general_sample = general_df.sample(n=num_general, random_state=42)
+    mixed_df = pd.concat([spatial_df, general_sample]).sample(frac=1, random_state=42).reset_index(drop=True)
+    print(f"\n🔀 Mixed Dataset Created:")
+    print(f"  Spatial: {num_spatial:,} ({num_spatial/len(mixed_df)*100:.1f}%)")
+    print(f"  General: {num_general:,} ({num_general/len(mixed_df)*100:.1f}%)")
+    print(f"  Total: {len(mixed_df):,}")
+    return mixed_df
+def unfreeze_clip_layers(model, num_layers=4):
+    """
+    Unfreeze last N layers of CLIP for spatial feature learning.
+    """
+    total_blocks = len(model.clip_model.visual.transformer.resblocks)
+    for i, block in enumerate(model.clip_model.visual.transformer.resblocks):
+        if i >= total_blocks - num_layers:
+            for p in block.parameters():
+                p.requires_grad = True
+    if hasattr(model.clip_model.visual, "proj") and model.clip_model.visual.proj is not None:
+        if isinstance(model.clip_model.visual.proj, torch.nn.Parameter):
+            model.clip_model.visual.proj.requires_grad = True
+        else:
+            for p in model.clip_model.visual.proj.parameters():
+                p.requires_grad = True
+    if hasattr(model.clip_model.visual, "ln_post"):
+        for p in model.clip_model.visual.ln_post.parameters():
+            p.requires_grad = True
+    print(f"  ✓ Unfroze last {num_layers} CLIP layers")
+def freeze_base_model(model, unfreeze_clip_layers_count=4):
+    """
+    Freeze most of the model, unfreeze spatial adapter and last CLIP layers.
+    """
+    for param in model.clip_model.parameters():
+        param.requires_grad = False
+    unfreeze_clip_layers(model, num_layers=unfreeze_clip_layers_count)
+    for param in model.gpt2_model.parameters():
+        param.requires_grad = False
+    for param in model.decoder.parameters():
+        param.requires_grad = False
+    for param in model.spatial_adapter.parameters():
+        param.requires_grad = True
+    for param in model.spatial_context_proj.parameters():
+        param.requires_grad = True
+    for param in model.q_proj.parameters():
+        param.requires_grad = True
+    for param in model.spatial_fusion.parameters():
+        param.requires_grad = True
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    total_params = sum(p.numel() for p in model.parameters())
+    print(f"\n🔒 Model Freezing Applied:")
+    print(f"  Total parameters: {total_params:,}")
+    print(f"  Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.1f}%)")
+    print(f"  Frozen parameters: {total_params - trainable_params:,}")
+    return model
+def get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, min_lr=1e-7):
+    """
+    Create learning rate scheduler with warmup and cosine decay.
+    """
+    def lr_lambda(current_step):
+        if current_step < num_warmup_steps:
+            return float(current_step) / float(max(1, num_warmup_steps))
+        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))
+        return max(min_lr, 0.5 * (1.0 + math.cos(math.pi * progress)))
+    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
+def create_optimizer_with_differential_lr(model, base_lr=5e-5):
+    """
+    Create optimizer with differential learning rates for different components.
+    """
+    clip_params = []
+    spatial_adapter_params = []
+    other_params = []
+    for name, param in model.named_parameters():
+        if param.requires_grad:
+            if 'clip_model' in name:
+                clip_params.append(param)
+            elif 'spatial_adapter' in name:
+                spatial_adapter_params.append(param)
+            else:
+                other_params.append(param)
+    optimizer = torch.optim.AdamW([
+        {'params': clip_params, 'lr': base_lr * 0.1},
+        {'params': spatial_adapter_params, 'lr': base_lr},
+        {'params': other_params, 'lr': base_lr * 0.5}
+    ], weight_decay=1e-4)
+    print(f"\n⚙️  Optimizer Configuration:")
+    print(f"  CLIP params: {len(clip_params):,} (LR: {base_lr * 0.1:.2e})")
+    print(f"  Spatial adapter params: {len(spatial_adapter_params):,} (LR: {base_lr:.2e})")
+    print(f"  Other params: {len(other_params):,} (LR: {base_lr * 0.5:.2e})")
+    return optimizer
+def train_one_epoch(model, dataloader, optimizer, device, vocab, scaler):
+    """Training loop for one epoch"""
+    model.train()
+    total_loss = 0.0
+    total_token_acc = 0.0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id, label_smoothing=0.1)
+    for batch in tqdm(dataloader, desc="Training"):
+        optimizer.zero_grad()
+        images = batch['image'].to(device)
+        questions = {
+            'input_ids': batch['question_ids'].to(device),
+            'attention_mask': batch['question_mask'].to(device)
+        }
+        answers = batch['answer_ids'].to(device)
+        with torch.amp.autocast(device):
+            logits = model(images, questions, answer_input_ids=answers)
+            shifted_logits = logits[:, :-1, :].contiguous()
+            shifted_answers = answers[:, 1:].contiguous()
+            loss = criterion(
+                shifted_logits.view(-1, shifted_logits.size(-1)),
+                shifted_answers.view(-1)
+            )
+            predicted_tokens = shifted_logits.argmax(dim=-1)
+            correct = (predicted_tokens == shifted_answers).float()
+            mask = (shifted_answers != vocab.pad_token_id).float()
+            token_acc = (correct * mask).sum() / mask.sum()
+            total_token_acc += token_acc.item()
+        if torch.isnan(loss):
+            print("⚠️  NaN loss detected, skipping batch.")
+            continue
+        scaler.scale(loss).backward()
+        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        scaler.step(optimizer)
+        scaler.update()
+        total_loss += loss.item()
+    avg_loss = total_loss / len(dataloader)
+    avg_token_acc = total_token_acc / len(dataloader)
+    return avg_loss, avg_token_acc
+def validate_one_epoch(model, dataloader, device, vocab):
+    """Validation loop for one epoch"""
+    model.eval()
+    total_loss = 0.0
+    total_token_acc = 0.0
+    exact_matches = 0
+    total_samples = 0
+    criterion = nn.CrossEntropyLoss(ignore_index=vocab.pad_token_id)
+    with torch.no_grad():
+        for batch in tqdm(dataloader, desc="Validation"):
+            images = batch['image'].to(device)
+            questions = {
+                'input_ids': batch['question_ids'].to(device),
+                'attention_mask': batch['question_mask'].to(device)
+            }
+            answers = batch['answer_ids'].to(device)
+            with torch.amp.autocast(device):
+                logits = model(images, questions, answer_input_ids=answers)
+                shifted_logits = logits[:, :-1, :].contiguous()
+                shifted_answers = answers[:, 1:].contiguous()
+                loss = criterion(
+                    shifted_logits.view(-1, shifted_logits.size(-1)),
+                    shifted_answers.view(-1)
+                )
+                predicted_tokens = shifted_logits.argmax(dim=-1)
+                correct = (predicted_tokens == shifted_answers).float()
+                mask = (shifted_answers != vocab.pad_token_id).float()
+                token_acc = (correct * mask).sum() / mask.sum()
+                total_token_acc += token_acc.item()
+            total_loss += loss.item()
+            generated = model(images, questions)
+            for pred, true in zip(generated, answers):
+                pred_text = vocab.decoder(pred.cpu().numpy())
+                true_text = vocab.decoder(true.cpu().numpy())
+                if pred_text.strip() == true_text.strip():
+                    exact_matches += 1
+                total_samples += 1
+    avg_loss = total_loss / len(dataloader)
+    avg_token_acc = total_token_acc / len(dataloader)
+    exact_match_acc = exact_matches / total_samples
+    return avg_loss, avg_token_acc, exact_match_acc
+def main():
+    print("=" * 80)
+    print("🚀 VQA SPATIAL ADAPTER FINE-TUNING V2 (ENHANCED)")
+    print("=" * 80)
+    torch.manual_seed(42)
+    np.random.seed(42)
+    random.seed(42)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(42)
+    DATA_DIR = r"./gen_vqa_v2"
+    CSV_PATH = os.path.join(DATA_DIR, "metadata.csv")
+    PRETRAINED_CHECKPOINT = "./output2/continued_training/vqa_checkpoint.pt"
+    OUTPUT_DIR = "./output2/spatial_adapter_v2_2"
+    FINE_TUNED_CHECKPOINT = os.path.join(OUTPUT_DIR, "vqa_spatial_checkpoint.pt")
+    LOG_CSV = os.path.join(OUTPUT_DIR, "train_log.csv")
+    LOSS_GRAPH_PATH = os.path.join(OUTPUT_DIR, "loss_plot.png")
+    os.makedirs(OUTPUT_DIR, exist_ok=True)
+    batch_size = 64
+    base_learning_rate = 5e-5
+    num_epochs = 100
+    patience = 15
+    warmup_epochs = 3
+    spatial_ratio = 0.85
+    clip_layers_to_unfreeze = 6
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    print(f"\n⚙️  Enhanced Configuration:")
+    print(f"  Device: {device}")
+    print(f"  Batch size: {batch_size}")
+    print(f"  Base learning rate: {base_learning_rate:.2e}")
+    print(f"  Max epochs: {num_epochs} (increased from 20)")
+    print(f"  Warmup epochs: {warmup_epochs}")
+    print(f"  Early stopping patience: {patience}")
+    print(f"  Spatial ratio: {spatial_ratio:.0%} (increased from 70%)")
+    print(f"  CLIP layers to unfreeze: {clip_layers_to_unfreeze}")
+    print(f"\n📂 Loading dataset from: {CSV_PATH}")
+    metadata = pd.read_csv(CSV_PATH)
+    spatial_df, general_df = filter_spatial_questions(metadata)
+    mixed_data = create_mixed_dataset(spatial_df, general_df, spatial_ratio=spatial_ratio)
+    print(f"\n📥 Loading pretrained model from: {PRETRAINED_CHECKPOINT}")
+    checkpoint = torch.load(PRETRAINED_CHECKPOINT, map_location=device)
+    vocab = Vocab()
+    vocab.vocab = checkpoint['vocab']
+    vocab.vocab_size = len(checkpoint['vocab'])
+    vocab.word2idx = checkpoint['word2idx']
+    vocab.idx2word = checkpoint['idx2word']
+    vocab.pad_token_id = checkpoint['pad_token_id']
+    vocab.bos_token_id = checkpoint['bos_token_id']
+    vocab.eos_token_id = checkpoint['eos_token_id']
+    vocab.unk_token_id = checkpoint['unk_token_id']
+    print(f"  Vocabulary size: {len(vocab.vocab):,}")
+    question_tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if question_tokenizer.pad_token is None:
+        question_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+    base_model = VQAModel(
+        vocab_size=len(checkpoint['vocab']),
+        device=device,
+        question_max_len=checkpoint.get('question_max_len', 20),
+        answer_max_len=checkpoint.get('answer_max_len', 12),
+        pad_token_id=checkpoint['pad_token_id'],
+        bos_token_id=checkpoint['bos_token_id'],
+        eos_token_id=checkpoint['eos_token_id'],
+        unk_token_id=checkpoint['unk_token_id'],
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    base_model.gpt2_model.resize_token_embeddings(len(question_tokenizer))
+    base_model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    print("  ✓ Pretrained weights loaded")
+    print(f"\n🔧 Creating VQA model with spatial adapter...")
+    model = VQAModelWithSpatialAdapter(
+        base_model=base_model,
+        hidden_size=512,
+        num_heads=8,
+        dropout=0.3
+    ).to(device)
+    model = freeze_base_model(model, unfreeze_clip_layers_count=clip_layers_to_unfreeze)
+    train_df, test_df = train_test_split(mixed_data, test_size=0.2, random_state=42)
+    val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)
+    print(f"\n📊 Data Split:")
+    print(f"  Train: {len(train_df):,} samples")
+    print(f"  Validation: {len(val_df):,} samples")
+    print(f"  Test: {len(test_df):,} samples")
+    from torchvision import transforms
+    safe_augmentation = transforms.Compose([
+        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
+        transforms.RandomRotation(5),
+    ])
+    train_dataset = AugmentedVQADataset(
+        train_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=model.clip_preprocess,
+        augment=False,
+        question_max_len=20,
+        answer_max_len=12
+    )
+    val_dataset = AugmentedVQADataset(
+        val_df, DATA_DIR, question_tokenizer, vocab,
+        clip_processor=model.clip_preprocess,
+        augment=False,
+        question_max_len=20,
+        answer_max_len=12
+    )
+    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True)
+    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)
+    optimizer = create_optimizer_with_differential_lr(model, base_lr=base_learning_rate)
+    num_training_steps = len(train_loader) * num_epochs
+    num_warmup_steps = len(train_loader) * warmup_epochs
+    scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps)
+    print(f"\n📈 Learning Rate Schedule:")
+    print(f"  Warmup steps: {num_warmup_steps:,} ({warmup_epochs} epochs)")
+    print(f"  Total steps: {num_training_steps:,}")
+    print(f"  Schedule: Linear warmup → Cosine decay")
+    scaler = torch.amp.GradScaler(device)
+    print("\n" + "=" * 80)
+    print("🎯 STARTING ENHANCED SPATIAL ADAPTER FINE-TUNING")
+    print("=" * 80)
+    best_val_exact_match = 0.0
+    best_val_loss = np.inf
+    counter = 0
+    logs = []
+    for epoch in range(num_epochs):
+        print(f"\n📅 Epoch {epoch+1}/{num_epochs}")
+        print("-" * 80)
+        train_loss, train_token_acc = train_one_epoch(model, train_loader, optimizer, device, vocab, scaler)
+        val_loss, val_token_acc, val_exact_match = validate_one_epoch(model, val_loader, device, vocab)
+        current_lr = optimizer.param_groups[1]['lr']
+        print(f"\n📈 Metrics:")
+        print(f"  Train Loss: {train_loss:.4f} | Train Token Acc: {train_token_acc:.4f}")
+        print(f"  Val Loss: {val_loss:.4f} | Val Token Acc: {val_token_acc:.4f}")
+        print(f"  Val Exact Match: {val_exact_match:.4f}")
+        print(f"  Learning Rate: {current_lr:.2e}")
+        if val_exact_match > best_val_exact_match:
+            best_val_exact_match = val_exact_match
+            save_checkpoint(model, optimizer, epoch, vocab, FINE_TUNED_CHECKPOINT)
+            print(f"  ✅ New best model saved! (Exact Match: {val_exact_match:.4f})")
+            counter = 0
+        else:
+            counter += 1
+            print(f"  ⏳ No improvement for {counter} epoch(s)")
+        if counter >= patience:
+            print(f"\n⏹️  Early stopping triggered after {patience} epochs without improvement")
+            break
+        logs.append([
+            epoch + 1,
+            train_loss,
+            train_token_acc,
+            val_loss,
+            val_token_acc,
+            val_exact_match,
+            current_lr
+        ])
+        for _ in range(len(train_loader)):
+            scheduler.step()
+    log_df = pd.DataFrame(
+        logs,
+        columns=["epoch", "train_loss", "train_token_acc", "val_loss", "val_token_acc", "val_exact_match", "lr"]
+    )
+    log_df.to_csv(LOG_CSV, index=False)
+    plot_losses([x[1] for x in logs], [x[3] for x in logs], save_path=LOSS_GRAPH_PATH)
+    print("\n" + "=" * 80)
+    print("✅ ENHANCED FINE-TUNING COMPLETE")
+    print("=" * 80)
+    print(f"\n📊 Final Results:")
+    print(f"  Best Exact Match: {best_val_exact_match:.4f}")
+    print(f"  Total Epochs: {len(logs)}")
+    print(f"  Improvement from v1: {best_val_exact_match - 0.2037:.4f} ({(best_val_exact_match - 0.2037) / 0.2037 * 100:+.1f}%)")
+    print(f"\n💾 Outputs:")
+    print(f"  Model: {FINE_TUNED_CHECKPOINT}")
+    print(f"  Logs: {LOG_CSV}")
+    print(f"  Plot: {LOSS_GRAPH_PATH}")
+    print("\n🎉 Ready to test on spatial questions!")
+if __name__ == "__main__":
+    main()

genvqa-dataset.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import os
+import json
+import random
+import shutil
+import pandas as pd
+from tqdm import tqdm
+from collections import Counter
+IMAGES_DIR = r"../train2014"
+QUESTIONS_PATH = r"../v2_OpenEnded_mscoco_train2014_questions.json"
+ANNOTATIONS_PATH = r"../v2_mscoco_train2014_annotations.json"
+OUTPUT_DIR = "./gen_vqa_v2"
+os.makedirs(os.path.join(OUTPUT_DIR, "images"), exist_ok=True)
+print("Loading VQA v2 data...")
+with open(QUESTIONS_PATH, "r") as f:
+    questions = json.load(f)["questions"]
+with open(ANNOTATIONS_PATH, "r") as f:
+    annotations = json.load(f)["annotations"]
+qid_to_ann = {ann["question_id"]: ann for ann in annotations}
+print("Merging questions and answers...")
+merged_data = []
+answer_counter = Counter()
+EXCLUDED_ANSWERS = ['yes', 'no', 'unknown', 'none', 'n/a', 'cant tell', 'not sure']
+AMBIGUOUS_QUESTIONS = ['what is in the image', 'what is this', 'what is that', 'what do you see']
+for q in tqdm(questions, total=len(questions)):
+    ann = qid_to_ann.get(q["question_id"])
+    if not ann:
+        continue
+    answers = [a["answer"] for a in ann["answers"] if a["answer"].strip()]
+    if not answers:
+        continue
+    main_answer = max(set(answers), key=answers.count)
+    main_answer = main_answer.lower().strip()
+    question_text = q["question"].lower().strip()
+    if main_answer in EXCLUDED_ANSWERS:
+        continue
+    if any(ambig in question_text for ambig in AMBIGUOUS_QUESTIONS):
+        continue
+    if len(main_answer.split()) <= 5 and len(main_answer) <= 30:
+        merged_data.append({
+            "image_id": q["image_id"],
+            "question_id": q["question_id"],
+            "question": q["question"],
+            "answer": main_answer
+        })
+        answer_counter[main_answer] += 1
+print(f"Total valid Q-A pairs (after filtering): {len(merged_data)}")
+MIN_ANSWER_FREQ = 20
+frequent_answers = {ans for ans, count in answer_counter.items() if count >= MIN_ANSWER_FREQ}
+filtered_data = [item for item in merged_data if item["answer"] in frequent_answers]
+print(f"After frequency filtering (min_freq={MIN_ANSWER_FREQ}): {len(filtered_data)} pairs")
+MAX_SAMPLES_PER_ANSWER = 600
+answer_samples = {}
+for item in filtered_data:
+    ans = item["answer"]
+    if ans not in answer_samples:
+        answer_samples[ans] = []
+    if len(answer_samples[ans]) < MAX_SAMPLES_PER_ANSWER:
+        answer_samples[ans].append(item)
+balanced_data = []
+for samples in answer_samples.values():
+    balanced_data.extend(samples)
+random.shuffle(balanced_data)
+print(f"After balancing: {len(balanced_data)} pairs with {len(answer_samples)} unique answers")
+print("Copying selected images and saving data...")
+final_data = []
+for item in tqdm(balanced_data):
+    img_name = f"COCO_train2014_{item['image_id']:012d}.jpg"
+    src_path = os.path.join(IMAGES_DIR, img_name)
+    dst_path = os.path.join(OUTPUT_DIR, "images", img_name)
+    if os.path.exists(src_path):
+        shutil.copy(src_path, dst_path)
+        item["image_path"] = f"images/{img_name}"
+        final_data.append(item)
+print(f"Final dataset: {len(final_data)} pairs")
+with open(os.path.join(OUTPUT_DIR, "qa_pairs.json"), "w") as f:
+    json.dump(final_data, f, indent=2, ensure_ascii=False)
+pd.DataFrame(final_data).to_csv(os.path.join(OUTPUT_DIR, "metadata.csv"), index=False)
+print("Data preparation complete.")

groq_service.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Groq LLM Service for VQA Accessibility
+Generates descriptive 2-sentence narrations for blind users
+"""
+import os
+from typing import Dict, Optional
+from groq import Groq
+class GroqDescriptionService:
+    """Service to generate accessible descriptions using Groq LLM"""
+    def __init__(self, api_key: Optional[str] = None):
+        """
+        Initialize Groq service
+        Args:
+            api_key: Groq API key (if not provided, reads from GROQ_API_KEY env var)
+        """
+        self.api_key = api_key or os.getenv("GROQ_API_KEY")
+        if not self.api_key:
+            raise ValueError(
+                "Groq API key not found. Set GROQ_API_KEY environment variable "
+                "or pass api_key parameter"
+            )
+        self.client = Groq(api_key=self.api_key)
+        self.model = "llama-3.3-70b-versatile"
+    def generate_description(
+        self,
+        question: str,
+        answer: str,
+        max_retries: int = 2
+    ) -> Dict[str, str]:
+        """
+        Generate a 2-sentence accessible description for blind users
+        Args:
+            question: The question asked by the user
+            answer: The VQA model's answer
+            max_retries: Number of retry attempts on failure
+        Returns:
+            Dict with 'description' and 'status' keys
+        """
+        prompt = f"""You are an accessibility assistant helping blind users understand visual question answering results.
+Question asked: "{question}"
+Answer from VQA model: "{answer}"
+Task: Create a clear, natural 2-sentence description that:
+1. First sentence: Restates the question and provides the answer
+2. Second sentence: Adds helpful context or clarification
+Keep it concise, natural, and easy to understand when spoken aloud.
+Example:
+Question: "What color is the car?"
+Answer: "red"
+Description: "The question asks about the color of the car, and the answer is red. This indicates there is a red-colored vehicle visible in the image."
+Now generate the description:"""
+        for attempt in range(max_retries + 1):
+            try:
+                response = self.client.chat.completions.create(
+                    model=self.model,
+                    messages=[
+                        {
+                            "role": "system",
+                            "content": "You are a helpful accessibility assistant. Always respond with exactly 2 clear, natural sentences."
+                        },
+                        {
+                            "role": "user",
+                            "content": prompt
+                        }
+                    ],
+                    temperature=0.7,
+                    max_tokens=150,
+                    top_p=0.9
+                )
+                description = response.choices[0].message.content.strip()
+                if description.startswith("Description:"):
+                    description = description.replace("Description:", "").strip()
+                return {
+                    "description": description,
+                    "status": "success",
+                    "model": self.model
+                }
+            except Exception as e:
+                if attempt < max_retries:
+                    continue
+                else:
+                    fallback = f"The question asks: {question}. The answer is: {answer}."
+                    return {
+                        "description": fallback,
+                        "status": "fallback",
+                        "error": str(e)
+                    }
+    def generate_batch_descriptions(
+        self,
+        qa_pairs: list[Dict[str, str]]
+    ) -> list[Dict[str, str]]:
+        """
+        Generate descriptions for multiple Q&A pairs
+        Args:
+            qa_pairs: List of dicts with 'question' and 'answer' keys
+        Returns:
+            List of description results
+        """
+        results = []
+        for pair in qa_pairs:
+            result = self.generate_description(
+                question=pair.get("question", ""),
+                answer=pair.get("answer", "")
+            )
+            results.append(result)
+        return results
+_groq_service_instance = None
+def get_groq_service(api_key: Optional[str] = None) -> GroqDescriptionService:
+    """
+    Get or create Groq service singleton
+    Args:
+        api_key: Optional API key (uses env var if not provided)
+    Returns:
+        GroqDescriptionService instance
+    """
+    global _groq_service_instance
+    if _groq_service_instance is None:
+        _groq_service_instance = GroqDescriptionService(api_key=api_key)
+    return _groq_service_instance

knowledge_graph_service.py ADDED Viewed

	@@ -0,0 +1,291 @@

+"""
+Knowledge Graph Service for Neuro-Symbolic VQA
+Uses ConceptNet API to provide common-sense reasoning capabilities
+"""
+import requests
+import re
+from typing import Dict, List, Optional
+from functools import lru_cache
+import time
+class KnowledgeGraphService:
+    """
+    Lightweight ConceptNet API wrapper for common-sense reasoning.
+    Enhances VQA answers with external knowledge about object properties,
+    capabilities, uses, and relationships.
+    """
+    CONCEPTNET_API = "https://api.conceptnet.io"
+    # Common-sense question patterns
+    COMMONSENSE_PATTERNS = [
+        # Capability questions
+        (r'can .* (melt|freeze|fly|swim|float|sink|break|burn|explode)', 'CapableOf'),
+        (r'is .* able to', 'CapableOf'),
+        (r'does .* (float|sink)', 'CapableOf'),
+        # Property questions
+        (r'is .* (edible|poisonous|dangerous|safe|hot|cold|sweet|sour)', 'HasProperty'),
+        (r'is this (food|drink|toy|tool|weapon)', 'HasProperty'),
+        # Purpose questions
+        (r'what .* (used for|for)', 'UsedFor'),
+        (r'why .* (used|made)', 'UsedFor'),
+        (r'how .* use', 'UsedFor'),
+        # Composition questions
+        (r'what .* made (of|from)', 'MadeOf'),
+        (r'what .* (material|ingredient)', 'MadeOf'),
+        # Location questions
+        (r'where .* (found|located|kept|stored)', 'AtLocation'),
+        (r'where (do|does) .* (live|grow)', 'AtLocation'),
+    ]
+    def __init__(self, cache_size=100, timeout=5):
+        """
+        Initialize Knowledge Graph service.
+        Args:
+            cache_size: Number of API responses to cache
+            timeout: API request timeout in seconds
+        """
+        self.timeout = timeout
+        self.cache_size = cache_size
+        print("✅ Knowledge Graph service initialized (ConceptNet API)")
+    @lru_cache(maxsize=100)
+    def _query_conceptnet(self, concept: str, relation: str, limit: int = 10) -> Optional[Dict]:
+        """
+        Query ConceptNet API with caching.
+        Args:
+            concept: Concept to query (e.g., "ice_cream")
+            relation: Relation type (e.g., "CapableOf", "HasProperty")
+            limit: Maximum number of results
+        Returns:
+            API response dict or None if failed
+        """
+        try:
+            # Normalize concept (replace spaces with underscores)
+            concept = concept.lower().replace(' ', '_')
+            # Build API URL
+            url = f"{self.CONCEPTNET_API}/query"
+            params = {
+                'start': f'/c/en/{concept}',
+                'rel': f'/r/{relation}',
+                'limit': limit
+            }
+            # Make request
+            response = requests.get(url, params=params, timeout=self.timeout)
+            response.raise_for_status()
+            return response.json()
+        except requests.exceptions.Timeout:
+            print(f"⚠️  ConceptNet API timeout for {concept}")
+            return None
+        except requests.exceptions.RequestException as e:
+            print(f"⚠️  ConceptNet API error: {e}")
+            return None
+        except Exception as e:
+            print(f"⚠️  Unexpected error querying ConceptNet: {e}")
+            return None
+    def get_concept_properties(self, concept: str) -> Dict[str, List[str]]:
+        properties = {
+            'CapableOf': [],
+            'HasProperty': [],
+            'UsedFor': [],
+            'MadeOf': [],
+            'AtLocation': []
+        }
+        # Query each relation type
+        for relation in properties.keys():
+            data = self._query_conceptnet(concept, relation)
+            if data and 'edges' in data:
+                for edge in data['edges']:
+                    # Extract the end concept
+                    if 'end' in edge and 'label' in edge['end']:
+                        end_label = edge['end']['label']
+                        properties[relation].append(end_label)
+        return properties
+    def is_commonsense_question(self, question: str) -> bool:
+        """
+        Detect if a question requires common-sense reasoning.
+        Args:
+            question: Question string
+        Returns:
+            True if question needs external knowledge
+        """
+        q_lower = question.lower()
+        for pattern, _ in self.COMMONSENSE_PATTERNS:
+            if re.search(pattern, q_lower):
+                return True
+        return False
+    def _detect_question_type(self, question: str) -> Optional[str]:
+        """
+        Detect which ConceptNet relation the question is asking about.
+        Args:
+            question: Question string
+        Returns:
+            Relation type or None
+        """
+        q_lower = question.lower()
+        for pattern, relation in self.COMMONSENSE_PATTERNS:
+            if re.search(pattern, q_lower):
+                return relation
+        return None
+    def answer_commonsense_question(self, object_name: str, question: str) -> Optional[str]:
+        """
+        Answer a common-sense question using Knowledge Graph.
+        Args:
+            object_name: Object detected by VQA (e.g., "ice cream")
+            question: User's question
+        Returns:
+            Enhanced answer string or None
+        """
+        # Detect question type
+        relation = self._detect_question_type(question)
+        if not relation:
+            return None
+        # Query ConceptNet
+        data = self._query_conceptnet(object_name, relation, limit=5)
+        if not data or 'edges' not in data:
+            return None
+        # Extract relevant knowledge
+        knowledge = []
+        for edge in data['edges']:
+            if 'end' in edge and 'label' in edge['end']:
+                knowledge.append(edge['end']['label'])
+        if not knowledge:
+            return None
+        # Generate natural language answer based on question type
+        return self._synthesize_answer(object_name, question, relation, knowledge)
+    def _synthesize_answer(self, object_name: str, question: str,
+                          relation: str, knowledge: List[str]) -> str:
+        """
+        Synthesize natural language answer from knowledge.
+        Args:
+            object_name: Detected object
+            question: Original question
+            relation: Relation type
+            knowledge: List of related concepts from KG
+        Returns:
+            Natural language answer
+        """
+        q_lower = question.lower()
+        # Capability questions (can X do Y?)
+        if relation == 'CapableOf':
+            # Check if specific capability is mentioned
+            for capability in knowledge:
+                if capability in q_lower:
+                    return f"Yes, {object_name} can {capability}."
+            # General capability answer
+            if knowledge:
+                caps = ', '.join(knowledge[:3])
+                return f"{object_name.capitalize()} can {caps}."
+        # Property questions (is X Y?)
+        elif relation == 'HasProperty':
+            # Check for specific property
+            if 'edible' in q_lower:
+                if 'edible' in knowledge:
+                    return f"Yes, {object_name} is edible."
+                else:
+                    return f"No, {object_name} is not edible."
+            if 'dangerous' in q_lower or 'safe' in q_lower:
+                if any(prop in knowledge for prop in ['dangerous', 'harmful', 'poisonous']):
+                    return f"Caution: {object_name} may be dangerous."
+                else:
+                    return f"{object_name.capitalize()} is generally safe."
+            # General properties
+            if knowledge:
+                props = ', '.join(knowledge[:3])
+                return f"{object_name.capitalize()} is {props}."
+        # Purpose questions (what is X used for?)
+        elif relation == 'UsedFor':
+            if knowledge:
+                uses = ', '.join(knowledge[:3])
+                return f"{object_name.capitalize()} is used for {uses}."
+        # Composition questions (what is X made of?)
+        elif relation == 'MadeOf':
+            if knowledge:
+                materials = ', '.join(knowledge[:3])
+                return f"{object_name.capitalize()} is made of {materials}."
+        # Location questions (where is X found?)
+        elif relation == 'AtLocation':
+            if knowledge:
+                locations = ', '.join(knowledge[:2])
+                return f"{object_name.capitalize()} is typically found at {locations}."
+        return None
+# Test function
+if __name__ == "__main__":
+    print("=" * 80)
+    print("🧪 Testing Knowledge Graph Service")
+    print("=" * 80)
+    kg = KnowledgeGraphService()
+    # Test cases
+    test_cases = [
+        ("ice cream", "Can this melt?"),
+        ("apple", "Is this edible?"),
+        ("hammer", "What is this used for?"),
+        ("knife", "Is this dangerous?"),
+        ("bread", "What is this made of?"),
+    ]
+    for obj, question in test_cases:
+        print(f"\n📝 Object: {obj}")
+        print(f"❓ Question: {question}")
+        # Check if common-sense question
+        is_cs = kg.is_commonsense_question(question)
+        print(f"🔍 Common-sense: {is_cs}")
+        if is_cs:
+            # Get answer
+            answer = kg.answer_commonsense_question(obj, question)
+            print(f"💬 Answer: {answer}")
+        print("-" * 80)

llm_reasoning_service.py ADDED Viewed

	@@ -0,0 +1,292 @@

+"""
+LLM Reasoning Service for VQA
+Uses Groq LLM for Chain-of-Thought reasoning instead of hardcoded rules
+"""
+import os
+from typing import Dict, List, Optional, Any
+from groq import Groq
+import json
+class LLMReasoningService:
+    """
+    Service that uses Groq LLM for deductive reasoning from Wikidata facts.
+    Replaces hardcoded if/else rules with flexible Chain-of-Thought reasoning.
+    """
+    def __init__(self, api_key: Optional[str] = None, model: str = "llama-3.3-70b-versatile"):
+        """
+        Initialize LLM Reasoning service
+        Args:
+            api_key: Groq API key (if not provided, reads from GROQ_API_KEY env var)
+            model: Groq model to use for reasoning
+        """
+        self.api_key = api_key or os.getenv("GROQ_API_KEY")
+        if not self.api_key:
+            raise ValueError(
+                "Groq API key not found. Set GROQ_API_KEY environment variable "
+                "or pass api_key parameter"
+            )
+        self.client = Groq(api_key=self.api_key)
+        self.model = model
+        print(f"✅ LLM Reasoning Service initialized (model: {model})")
+    def reason_with_facts(
+        self,
+        object_name: str,
+        facts: Dict[str, Any],
+        question: str,
+        max_retries: int = 2
+    ) -> Dict[str, Any]:
+        """
+        Use LLM to reason about a question using Wikidata facts.
+        Args:
+            object_name: Name of the detected object (e.g., "candle")
+            facts: Dictionary of Wikidata facts about the object
+            question: User's question
+            max_retries: Number of retry attempts on failure
+        Returns:
+            Dict with 'answer', 'reasoning_chain', and 'confidence' keys
+        Example:
+            >>> service.reason_with_facts(
+            ...     object_name="candle",
+            ...     facts={"materials": ["wax"], "categories": ["light source"]},
+            ...     question="Can this melt?"
+            ... )
+            {
+                'answer': 'Yes, the candle can melt because it is made of wax...',
+                'reasoning_chain': [
+                    'The object is a candle',
+                    'It is made of wax',
+                    'Wax has a low melting point',
+                    'Therefore, yes, it can melt'
+                ],
+                'confidence': 0.95
+            }
+        """
+        prompt = self._build_reasoning_prompt(object_name, facts, question)
+        for attempt in range(max_retries + 1):
+            try:
+                response = self.client.chat.completions.create(
+                    model=self.model,
+                    messages=[
+                        {
+                            "role": "system",
+                            "content": """You are an expert reasoning assistant for a Visual Question Answering system.
+Your task is to use Chain-of-Thought reasoning to answer questions about objects based on factual knowledge.
+IMPORTANT: Respond in JSON format with this structure:
+{
+  "reasoning_chain": ["step 1", "step 2", "step 3"],
+  "answer": "final answer in natural language",
+  "confidence": 0.0-1.0
+}
+Keep reasoning steps clear and logical. The answer should be conversational and helpful."""
+                        },
+                        {
+                            "role": "user",
+                            "content": prompt
+                        }
+                    ],
+                    temperature=0.3,
+                    max_tokens=500,
+                    response_format={"type": "json_object"}
+                )
+                content = response.choices[0].message.content.strip()
+                result = json.loads(content)
+                if not all(key in result for key in ['reasoning_chain', 'answer', 'confidence']):
+                    raise ValueError("Invalid response structure from LLM")
+                return {
+                    'answer': result['answer'],
+                    'reasoning_chain': result['reasoning_chain'],
+                    'confidence': float(result['confidence']),
+                    'status': 'success',
+                    'model': self.model
+                }
+            except json.JSONDecodeError as e:
+                if attempt < max_retries:
+                    continue
+                else:
+                    return self._fallback_reasoning(object_name, facts, question)
+            except Exception as e:
+                if attempt < max_retries:
+                    continue
+                else:
+                    print(f"⚠️  LLM reasoning failed: {e}")
+                    return self._fallback_reasoning(object_name, facts, question)
+    def _build_reasoning_prompt(
+        self,
+        object_name: str,
+        facts: Dict[str, Any],
+        question: str
+    ) -> str:
+        """
+        Build a Chain-of-Thought reasoning prompt.
+        Args:
+            object_name: Name of the object
+            facts: Wikidata facts about the object
+            question: User's question
+        Returns:
+            Formatted prompt string
+        """
+        facts_text = self._format_facts(facts)
+        prompt = f"""Question: {question}
+Object Detected: {object_name}
+Available Facts from Knowledge Graph:
+{facts_text}
+Task: Use Chain-of-Thought reasoning to answer the question based on the available facts.
+Example of good reasoning:
+Question: "Can this melt?"
+Object: "ice cream"
+Facts: {{
+  "categories": ["frozen dessert", "food"],
+  "materials": ["milk", "sugar", "cream"]
+}}
+Reasoning:
+{{
+  "reasoning_chain": [
+    "The object is ice cream, which is a frozen dessert",
+    "Ice cream is made of milk, sugar, and cream",
+    "These ingredients are frozen to create ice cream",
+    "Frozen items melt when exposed to heat",
+    "Therefore, yes, ice cream can melt at room temperature"
+  ],
+  "answer": "Yes, ice cream can melt. It's a frozen dessert made from milk, sugar, and cream, which will melt when exposed to temperatures above freezing.",
+  "confidence": 0.95
+}}
+Now reason about the actual question above:"""
+        return prompt
+    def _format_facts(self, facts: Dict[str, Any]) -> str:
+        """Format facts dictionary into readable text."""
+        if not facts:
+            return "No specific facts available"
+        lines = []
+        for key, value in facts.items():
+            if isinstance(value, list):
+                if value:
+                    lines.append(f"  - {key}: {', '.join(str(v) for v in value)}")
+            elif value:
+                lines.append(f"  - {key}: {value}")
+        return "\n".join(lines) if lines else "No specific facts available"
+    def _fallback_reasoning(
+        self,
+        object_name: str,
+        facts: Dict[str, Any],
+        question: str
+    ) -> Dict[str, Any]:
+        """
+        Fallback reasoning when LLM fails.
+        Uses simple rule-based approach.
+        Args:
+            object_name: Name of the object
+            facts: Wikidata facts
+            question: User's question
+        Returns:
+            Fallback reasoning result
+        """
+        q_lower = question.lower()
+        if 'melt' in q_lower:
+            materials = facts.get('materials', [])
+            if any(m in ['wax', 'ice', 'chocolate', 'butter'] for m in materials):
+                return {
+                    'answer': f"Yes, {object_name} can melt as it contains materials with low melting points.",
+                    'reasoning_chain': [
+                        f"The {object_name} contains materials that can melt",
+                        "These materials have low melting points",
+                        "Therefore, it can melt when heated"
+                    ],
+                    'confidence': 0.7,
+                    'status': 'fallback'
+                }
+        if 'edible' in q_lower or 'eat' in q_lower:
+            categories = facts.get('categories', [])
+            if any('food' in str(c).lower() for c in categories):
+                return {
+                    'answer': f"Yes, {object_name} is edible as it is categorized as food.",
+                    'reasoning_chain': [
+                        f"The {object_name} is categorized as food",
+                        "Food items are generally edible",
+                        "Therefore, it is edible"
+                    ],
+                    'confidence': 0.8,
+                    'status': 'fallback'
+                }
+        return {
+            'answer': f"Based on the available information about {object_name}, I cannot provide a definitive answer to this question.",
+            'reasoning_chain': [
+                f"Analyzing {object_name}",
+                "Available facts are limited",
+                "Cannot make a confident conclusion"
+            ],
+            'confidence': 0.3,
+            'status': 'fallback_generic'
+        }
+    def batch_reason(
+        self,
+        reasoning_tasks: List[Dict[str, Any]]
+    ) -> List[Dict[str, Any]]:
+        """
+        Perform reasoning on multiple tasks.
+        Args:
+            reasoning_tasks: List of dicts with 'object_name', 'facts', 'question' keys
+        Returns:
+            List of reasoning results
+        """
+        results = []
+        for task in reasoning_tasks:
+            result = self.reason_with_facts(
+                object_name=task.get('object_name', ''),
+                facts=task.get('facts', {}),
+                question=task.get('question', '')
+            )
+            results.append(result)
+        return results
+_llm_reasoning_instance = None
+def get_llm_reasoning_service(api_key: Optional[str] = None) -> LLMReasoningService:
+    """
+    Get or create LLM Reasoning service singleton
+    Args:
+        api_key: Optional API key (uses env var if not provided)
+    Returns:
+        LLMReasoningService instance
+    """
+    global _llm_reasoning_instance
+    if _llm_reasoning_instance is None:
+        _llm_reasoning_instance = LLMReasoningService(api_key=api_key)
+    return _llm_reasoning_instance
+if __name__ == "__main__":
+    print("=" * 80)
+    print("🧪 Testing LLM Reasoning Service")
+    print("=" * 80)
+    try:
+        service = get_llm_reasoning_service()
+        print("\n📝 Test 1: Can a candle melt?")
+        result = service.reason_with_facts(
+            object_name="candle",
+            facts={
+                "materials": ["wax", "wick"],
+                "categories": ["light source", "household item"],
+                "uses": ["provide light", "decoration"]
+            },
+            question="Can this melt?"
+        )
+        print(f"Answer: {result['answer']}")
+        print(f"Reasoning Chain:")
+        for i, step in enumerate(result['reasoning_chain'], 1):
+            print(f"  {i}. {step}")
+        print(f"Confidence: {result['confidence']}")
+        print("\n📝 Test 2: Would ice cream survive in the desert?")
+        result = service.reason_with_facts(
+            object_name="ice cream",
+            facts={
+                "materials": ["milk", "sugar", "cream"],
+                "categories": ["frozen dessert", "food"],
+                "properties": ["cold", "frozen"]
+            },
+            question="Would this survive in the desert?"
+        )
+        print(f"Answer: {result['answer']}")
+        print(f"Reasoning Chain:")
+        for i, step in enumerate(result['reasoning_chain'], 1):
+            print(f"  {i}. {step}")
+        print(f"Confidence: {result['confidence']}")
+        print("\n" + "=" * 80)
+        print("✅ Tests completed!")
+    except ValueError as e:
+        print(f"\n❌ Error: {e}")
+        print("Please set GROQ_API_KEY environment variable")

model.py ADDED Viewed

	@@ -0,0 +1,224 @@

+import torch
+from torch import nn
+import clip
+from transformers import GPT2Model
+class AttentionDecoder(nn.Module):
+    def __init__(self, hidden_size, vocab_size, num_layers=1, dropout=0.3):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.embedding = nn.Embedding(vocab_size, hidden_size)
+        self.attention = nn.Linear(hidden_size * 2, 1)
+        self.gru = nn.GRU(
+            input_size=hidden_size * 2,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            batch_first=True,
+            dropout=dropout if num_layers > 1 else 0
+        )
+        self.ln_gru = nn.LayerNorm(hidden_size)
+        self.output = nn.Linear(hidden_size, vocab_size)
+    def forward(self, input_ids, context, hidden):
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(1)
+        embeddings = self.embedding(input_ids).float()
+        context_expanded = context.unsqueeze(1).expand(-1, embeddings.size(1), -1)
+        combined = torch.cat([embeddings, context_expanded], dim=-1)
+        attn_weights = torch.softmax(self.attention(combined), dim=1)
+        attended_context = (context_expanded * attn_weights).sum(dim=1, keepdim=True)
+        gru_input = torch.cat([embeddings, attended_context.expand(-1, embeddings.size(1), -1)], dim=-1)
+        gru_output, hidden = self.gru(gru_input, hidden)
+        gru_output = self.ln_gru(gru_output)
+        return self.output(gru_output), hidden
+class VQAModel(nn.Module):
+    def __init__(
+        self,
+        vocab_size=3600,
+        question_max_len=16,
+        answer_max_len=10,
+        hidden_size=512,
+        num_layers=2,
+        dropout=0.3,
+        device='cuda',
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        unk_token_id=3
+    ):
+        super().__init__()
+        self.device = device
+        self.question_max_len = question_max_len
+        self.answer_max_len = answer_max_len
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.fine_tuning_mode = False
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.unk_token_id = unk_token_id
+        self.clip_model, self.clip_preprocess = clip.load("ViT-B/16", device=device)
+        for p in self.clip_model.parameters():
+            p.requires_grad = False
+        self.gpt2_model = GPT2Model.from_pretrained("distilgpt2")
+        self.gpt2_model.to(device)
+        for p in self.gpt2_model.parameters():
+            p.requires_grad = False
+        self.img_proj = nn.Linear(512, hidden_size)
+        self.q_proj = nn.Linear(768, hidden_size)
+        self.gate_layer = nn.Linear(hidden_size*2, hidden_size)
+        self.fusion = nn.Sequential(
+            nn.Linear(hidden_size*3, hidden_size),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_size, hidden_size)
+        )
+        self.decoder = AttentionDecoder(hidden_size, vocab_size, num_layers, dropout)
+    def unfreeze_clip_layers(self, num_layers=2):
+        self.clip_model.train()
+        self.clip_model.visual.float()
+        total_blocks = len(self.clip_model.visual.transformer.resblocks)
+        for i, block in enumerate(self.clip_model.visual.transformer.resblocks):
+            if i >= total_blocks - num_layers:
+                for p in block.parameters():
+                    p.requires_grad = True
+        if hasattr(self.clip_model.visual, "proj") and self.clip_model.visual.proj is not None:
+            if isinstance(self.clip_model.visual.proj, torch.nn.Parameter):
+                self.clip_model.visual.proj.requires_grad = True
+            else:
+                for p in self.clip_model.visual.proj.parameters():
+                    p.requires_grad = True
+        if hasattr(self.clip_model.visual, "ln_post"):
+            for p in self.clip_model.visual.ln_post.parameters():
+                p.requires_grad = True
+        self.fine_tuning_mode = True
+        print(f"Unfrozen last {num_layers} CLIP layers")
+    def unfreeze_gpt2_layers(self, num_layers=1):
+        self.gpt2_model.train()
+        total_layers = len(self.gpt2_model.h)
+        for i, layer in enumerate(self.gpt2_model.h):
+            if i >= total_layers - num_layers:
+                for p in layer.parameters():
+                    p.requires_grad = True
+                    p.data = p.data.float()
+        for p in self.gpt2_model.ln_f.parameters():
+            p.requires_grad = True
+            p.data = p.data.float()
+        self.fine_tuning_mode = True
+        print(f"Unfrozen last {num_layers} GPT-2 layers")
+    def encode_image(self, images):
+        if self.fine_tuning_mode:
+            images = images.float()
+            img_features = self.clip_model.encode_image(images)
+        else:
+            with torch.no_grad():
+                img_features = self.clip_model.encode_image(images)
+        img_features = img_features / img_features.norm(dim=-1, keepdim=True)
+        return img_features.float()
+    def encode_question(self, input_ids, attention_mask):
+        if self.fine_tuning_mode:
+            outputs = self.gpt2_model(input_ids=input_ids, attention_mask=attention_mask)
+        else:
+            with torch.no_grad():
+                outputs = self.gpt2_model(input_ids=input_ids, attention_mask=attention_mask)
+        last_hidden = outputs.last_hidden_state
+        mask = attention_mask.unsqueeze(-1).to(last_hidden.dtype)
+        masked = last_hidden * mask
+        sum_hidden = masked.sum(dim=1)
+        lengths = mask.sum(dim=1).clamp(min=1e-6)
+        text_features = sum_hidden / lengths
+        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+        return text_features.float()
+    def fuse_features(self, img_features, q_features):
+        x = torch.cat([img_features, q_features], dim=-1)
+        gate = torch.sigmoid(self.gate_layer(x))
+        fused = gate * img_features + (1-gate) * q_features
+        fused = self.fusion(torch.cat([fused, x], dim=-1))
+        return fused
+    def forward(self, images, questions, answer_input_ids=None):
+        img_features = self.encode_image(images)
+        img_features = self.img_proj(img_features).float()
+        q_features = self.encode_question(questions["input_ids"], questions["attention_mask"])
+        q_features = self.q_proj(q_features).float()
+        batch_size = img_features.size(0)
+        context = self.fuse_features(img_features, q_features)
+        hidden = torch.zeros(self.num_layers, batch_size, self.hidden_size,
+                           device=self.device, dtype=torch.float)
+        if answer_input_ids is not None:
+            logits, _ = self.decoder(answer_input_ids, context, hidden)
+            return logits
+        else:
+            generated = torch.full((batch_size, self.answer_max_len), self.pad_token_id,
+                                 dtype=torch.long, device=self.device)
+            generated[:, 0] = self.bos_token_id
+            for t in range(1, self.answer_max_len):
+                current_input = generated[:, t-1]
+                logits, hidden = self.decoder(current_input, context, hidden)
+                next_tokens = logits.squeeze(1).argmax(dim=-1)
+                generated[:, t] = next_tokens
+                if (next_tokens == self.eos_token_id).all():
+                    break
+            return generated
+    def generate_with_beam_search(self, images, questions, beam_width=5):
+        batch_size = images.size(0)
+        all_results = []
+        for b in range(batch_size):
+            img = images[b:b+1]
+            q_ids = questions["input_ids"][b:b+1]
+            q_mask = questions["attention_mask"][b:b+1]
+            img_features = self.encode_image(img)
+            img_features = self.img_proj(img_features).float()
+            q_features = self.encode_question(q_ids, q_mask)
+            q_features = self.q_proj(q_features).float()
+            context = self.fuse_features(img_features, q_features)
+            initial_hidden = torch.zeros(self.num_layers, 1, self.hidden_size,
+                                         device=self.device, dtype=torch.float)
+            beams = [(
+                torch.full((1, 1), self.bos_token_id, dtype=torch.long, device=self.device),
+                0.0,
+                initial_hidden
+            )]
+            completed_beams = []
+            for t in range(1, self.answer_max_len):
+                candidates = []
+                for seq, score, hidden in beams:
+                    if seq[0, -1].item() == self.eos_token_id:
+                        completed_beams.append((seq, score))
+                        continue
+                    current_input = seq[:, -1]
+                    logits, new_hidden = self.decoder(current_input, context, hidden)
+                    log_probs = torch.log_softmax(logits.squeeze(1), dim=-1)
+                    top_log_probs, top_indices = torch.topk(log_probs[0], beam_width)
+                    for i in range(beam_width):
+                        next_token = top_indices[i].unsqueeze(0).unsqueeze(0)
+                        new_seq = torch.cat([seq, next_token], dim=1)
+                        new_score = score + top_log_probs[i].item()
+                        candidates.append((new_seq, new_score, new_hidden))
+                if len(candidates) == 0:
+                    break
+                beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
+            all_beams = completed_beams + [(seq, score) for seq, score, _ in beams]
+            if len(all_beams) == 0:
+                result = torch.full((1, self.answer_max_len), self.pad_token_id,
+                                dtype=torch.long, device=self.device)
+            else:
+                best_beam = max(all_beams, key=lambda x: x[1] / (x[0].size(1) ** 0.7))
+                result = torch.full((1, self.answer_max_len), self.pad_token_id,
+                                   dtype=torch.long, device=self.device)
+                seq_len = min(best_beam[0].size(1), self.answer_max_len)
+                result[:, :seq_len] = best_beam[0][:, :seq_len]
+            all_results.append(result)
+        return torch.cat(all_results, dim=0)
+if __name__ == "__main__":
+    device = "cuda"
+    model = VQAModel(device=device).to(device)
+    model.eval()
+    fake_image = torch.randn(1, 3, 224, 224).to(device)
+    fake_question_ids = torch.tensor([[1, 10, 20, 30, 2, 0, 0]]).to(device)
+    fake_question_mask = torch.tensor([[1, 1, 1, 1, 1, 0, 0]]).to(device)
+    question_batch = {
+        "input_ids": fake_question_ids,
+        "attention_mask": fake_question_mask
+    }
+    output = model(fake_image, question_batch)
+    print(output)

model_spatial.py ADDED Viewed

	@@ -0,0 +1,309 @@

+import torch
+from torch import nn
+import clip
+from transformers import GPT2Model
+import math
+class SpatialAdapter(nn.Module):
+    """
+    Spatial Adapter with Multi-Head Cross-Attention for spatial reasoning.
+    Processes CLIP patch features (14x14 grid) with question guidance.
+    """
+    def __init__(self, patch_dim=512, question_dim=512, hidden_dim=512, num_heads=8, dropout=0.3):
+        super().__init__()
+        self.hidden_dim = hidden_dim
+        self.num_heads = num_heads
+        self.head_dim = hidden_dim // num_heads
+        assert hidden_dim % num_heads == 0, "hidden_dim must be divisible by num_heads"
+        self.register_buffer('pos_encoding_2d', self._create_2d_positional_encoding(14, 14, patch_dim))
+        self.patch_proj = nn.Linear(patch_dim, hidden_dim)
+        self.question_proj = nn.Linear(question_dim, hidden_dim)
+        self.cross_attn_query = nn.Linear(hidden_dim, hidden_dim)
+        self.cross_attn_key = nn.Linear(hidden_dim, hidden_dim)
+        self.cross_attn_value = nn.Linear(hidden_dim, hidden_dim)
+        self.cross_attn_out = nn.Linear(hidden_dim, hidden_dim)
+        self.self_attn_query = nn.Linear(hidden_dim, hidden_dim)
+        self.self_attn_key = nn.Linear(hidden_dim, hidden_dim)
+        self.self_attn_value = nn.Linear(hidden_dim, hidden_dim)
+        self.self_attn_out = nn.Linear(hidden_dim, hidden_dim)
+        self.ffn = nn.Sequential(
+            nn.Linear(hidden_dim, hidden_dim * 4),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim * 4, hidden_dim),
+            nn.Dropout(dropout)
+        )
+        self.ln1 = nn.LayerNorm(hidden_dim)
+        self.ln2 = nn.LayerNorm(hidden_dim)
+        self.ln3 = nn.LayerNorm(hidden_dim)
+        self.dropout = nn.Dropout(dropout)
+    def _create_2d_positional_encoding(self, height, width, dim):
+        """Create 2D positional encoding for spatial grid"""
+        pos_h = torch.arange(height).unsqueeze(1).repeat(1, width).flatten()
+        pos_w = torch.arange(width).unsqueeze(0).repeat(height, 1).flatten()
+        pe = torch.zeros(height * width, dim)
+        div_term = torch.exp(torch.arange(0, dim, 2).float() * (-math.log(10000.0) / dim))
+        pe[:, 0:dim//2:2] = torch.sin(pos_h.unsqueeze(1) * div_term[:dim//4])
+        pe[:, 1:dim//2:2] = torch.cos(pos_h.unsqueeze(1) * div_term[:dim//4])
+        pe[:, dim//2::2] = torch.sin(pos_w.unsqueeze(1) * div_term[:dim//4])
+        pe[:, dim//2+1::2] = torch.cos(pos_w.unsqueeze(1) * div_term[:dim//4])
+        return pe.unsqueeze(0)
+    def _multi_head_attention(self, query, key, value, num_heads):
+        """Generic multi-head attention implementation"""
+        batch_size = query.size(0)
+        Q = query.view(batch_size, -1, num_heads, self.head_dim).transpose(1, 2)
+        K = key.view(batch_size, -1, num_heads, self.head_dim).transpose(1, 2)
+        V = value.view(batch_size, -1, num_heads, self.head_dim).transpose(1, 2)
+        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
+        attn_weights = torch.softmax(scores, dim=-1)
+        attn_weights = self.dropout(attn_weights)
+        context = torch.matmul(attn_weights, V)
+        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.hidden_dim)
+        return context, attn_weights
+    def forward(self, patch_features, question_features):
+        """
+        Args:
+            patch_features: [batch_size, num_patches, patch_dim] - CLIP patch features
+            question_features: [batch_size, question_dim] - Question encoding
+        Returns:
+            spatial_context: [batch_size, hidden_dim] - Spatially-aware context
+        """
+        batch_size, num_patches, _ = patch_features.shape
+        patch_features = patch_features + self.pos_encoding_2d[:, :num_patches, :].to(patch_features.device)
+        patches = self.patch_proj(patch_features)
+        question = self.question_proj(question_features.unsqueeze(1))
+        Q_cross = self.cross_attn_query(patches)
+        K_cross = self.cross_attn_key(question)
+        V_cross = self.cross_attn_value(question)
+        cross_context, _ = self._multi_head_attention(Q_cross, K_cross, V_cross, self.num_heads)
+        cross_out = self.cross_attn_out(cross_context)
+        patches = self.ln1(patches + self.dropout(cross_out))
+        Q_self = self.self_attn_query(patches)
+        K_self = self.self_attn_key(patches)
+        V_self = self.self_attn_value(patches)
+        self_context, _ = self._multi_head_attention(Q_self, K_self, V_self, self.num_heads)
+        self_out = self.self_attn_out(self_context)
+        patches = self.ln2(patches + self.dropout(self_out))
+        ffn_out = self.ffn(patches)
+        patches = self.ln3(patches + ffn_out)
+        attn_scores = torch.matmul(patches, question.transpose(1, 2))
+        attn_weights = torch.softmax(attn_scores, dim=1)
+        spatial_context = (patches * attn_weights).sum(dim=1)
+        return spatial_context
+class VQAModelWithSpatialAdapter(nn.Module):
+    """
+    Enhanced VQA Model with Spatial Adapter for spatial reasoning.
+    Uses patch-based CLIP features instead of global encoding.
+    """
+    def __init__(
+        self,
+        base_model,
+        hidden_size=512,
+        num_heads=8,
+        dropout=0.3
+    ):
+        super().__init__()
+        self.device = base_model.device
+        self.question_max_len = base_model.question_max_len
+        self.answer_max_len = base_model.answer_max_len
+        self.vocab_size = base_model.vocab_size
+        self.hidden_size = hidden_size
+        self.num_layers = base_model.num_layers
+        self.fine_tuning_mode = base_model.fine_tuning_mode
+        self.pad_token_id = base_model.pad_token_id
+        self.bos_token_id = base_model.bos_token_id
+        self.eos_token_id = base_model.eos_token_id
+        self.unk_token_id = base_model.unk_token_id
+        self.clip_model = base_model.clip_model
+        self.clip_preprocess = base_model.clip_preprocess
+        self.gpt2_model = base_model.gpt2_model
+        self.decoder = base_model.decoder
+        self.spatial_adapter = SpatialAdapter(
+            patch_dim=512,
+            question_dim=768,
+            hidden_dim=hidden_size,
+            num_heads=num_heads,
+            dropout=dropout
+        )
+        self.spatial_context_proj = nn.Linear(hidden_size, hidden_size)
+        self.q_proj = nn.Linear(768, hidden_size)
+        self.spatial_fusion = nn.Sequential(
+            nn.Linear(hidden_size * 2, hidden_size),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_size, hidden_size),
+            nn.LayerNorm(hidden_size)
+        )
+    def extract_clip_patch_features(self, images):
+        """
+        Extract patch features from CLIP instead of global features.
+        Returns: [batch_size, num_patches, patch_dim]
+        """
+        clip_dtype = self.clip_model.visual.conv1.weight.dtype
+        images = images.to(clip_dtype)
+        if self.fine_tuning_mode:
+            x = self.clip_model.visual.conv1(images)
+            x = x.reshape(x.shape[0], x.shape[1], -1)
+            x = x.permute(0, 2, 1)
+            class_token = self.clip_model.visual.class_embedding.to(x.dtype) + torch.zeros(
+                x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device
+            )
+            x = torch.cat([class_token, x], dim=1)
+            x = x + self.clip_model.visual.positional_embedding.to(x.dtype)
+            x = self.clip_model.visual.ln_pre(x)
+            x = x.permute(1, 0, 2)
+            x = self.clip_model.visual.transformer(x)
+            x = x.permute(1, 0, 2)
+            patch_features = x[:, 1:, :]
+            if hasattr(self.clip_model.visual, 'proj') and self.clip_model.visual.proj is not None:
+                if isinstance(self.clip_model.visual.proj, torch.nn.Parameter):
+                    patch_features = patch_features @ self.clip_model.visual.proj
+                else:
+                    patch_features = self.clip_model.visual.proj(patch_features)
+        else:
+            with torch.no_grad():
+                x = self.clip_model.visual.conv1(images)
+                x = x.reshape(x.shape[0], x.shape[1], -1)
+                x = x.permute(0, 2, 1)
+                class_token = self.clip_model.visual.class_embedding.to(x.dtype) + torch.zeros(
+                    x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device
+                )
+                x = torch.cat([class_token, x], dim=1)
+                x = x + self.clip_model.visual.positional_embedding.to(x.dtype)
+                x = self.clip_model.visual.ln_pre(x)
+                x = x.permute(1, 0, 2)
+                x = self.clip_model.visual.transformer(x)
+                x = x.permute(1, 0, 2)
+                patch_features = x[:, 1:, :]
+                if hasattr(self.clip_model.visual, 'proj') and self.clip_model.visual.proj is not None:
+                    if isinstance(self.clip_model.visual.proj, torch.nn.Parameter):
+                        patch_features = patch_features @ self.clip_model.visual.proj
+                    else:
+                        patch_features = self.clip_model.visual.proj(patch_features)
+        return patch_features.float()
+    def encode_question(self, input_ids, attention_mask):
+        """Encode question using GPT-2 (same as base model)"""
+        if self.fine_tuning_mode:
+            outputs = self.gpt2_model(input_ids=input_ids, attention_mask=attention_mask)
+        else:
+            with torch.no_grad():
+                outputs = self.gpt2_model(input_ids=input_ids, attention_mask=attention_mask)
+        last_hidden = outputs.last_hidden_state
+        mask = attention_mask.unsqueeze(-1).to(last_hidden.dtype)
+        masked = last_hidden * mask
+        sum_hidden = masked.sum(dim=1)
+        lengths = mask.sum(dim=1).clamp(min=1e-6)
+        text_features = sum_hidden / lengths
+        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+        return text_features.float()
+    def forward(self, images, questions, answer_input_ids=None):
+        """
+        Forward pass with spatial adapter.
+        """
+        patch_features = self.extract_clip_patch_features(images)
+        q_features = self.encode_question(questions["input_ids"], questions["attention_mask"])
+        spatial_context = self.spatial_adapter(patch_features, q_features)
+        spatial_context = self.spatial_context_proj(spatial_context)
+        q_projected = self.q_proj(q_features)
+        fused = self.spatial_fusion(torch.cat([spatial_context, q_projected], dim=-1))
+        batch_size = images.size(0)
+        hidden = torch.zeros(self.num_layers, batch_size, self.hidden_size,
+                           device=self.device, dtype=torch.float)
+        if answer_input_ids is not None:
+            logits, _ = self.decoder(answer_input_ids, fused, hidden)
+            return logits
+        else:
+            generated = torch.full((batch_size, self.answer_max_len), self.pad_token_id,
+                                 dtype=torch.long, device=self.device)
+            generated[:, 0] = self.bos_token_id
+            for t in range(1, self.answer_max_len):
+                current_input = generated[:, t-1]
+                logits, hidden = self.decoder(current_input, fused, hidden)
+                next_tokens = logits.squeeze(1).argmax(dim=-1)
+                generated[:, t] = next_tokens
+                if (next_tokens == self.eos_token_id).all():
+                    break
+            return generated
+    def generate_with_beam_search(self, images, questions, beam_width=5):
+        """Beam search generation (same as base model but with spatial features)"""
+        batch_size = images.size(0)
+        all_results = []
+        for b in range(batch_size):
+            img = images[b:b+1]
+            q_ids = questions["input_ids"][b:b+1]
+            q_mask = questions["attention_mask"][b:b+1]
+            patch_features = self.extract_clip_patch_features(img)
+            q_features = self.encode_question(q_ids, q_mask)
+            spatial_context = self.spatial_adapter(patch_features, q_features)
+            spatial_context = self.spatial_context_proj(spatial_context)
+            q_projected = self.q_proj(q_features)
+            context = self.spatial_fusion(torch.cat([spatial_context, q_projected], dim=-1))
+            initial_hidden = torch.zeros(self.num_layers, 1, self.hidden_size,
+                                         device=self.device, dtype=torch.float)
+            beams = [(
+                torch.full((1, 1), self.bos_token_id, dtype=torch.long, device=self.device),
+                0.0,
+                initial_hidden
+            )]
+            completed_beams = []
+            for t in range(1, self.answer_max_len):
+                candidates = []
+                for seq, score, hidden in beams:
+                    if seq[0, -1].item() == self.eos_token_id:
+                        completed_beams.append((seq, score))
+                        continue
+                    current_input = seq[:, -1]
+                    logits, new_hidden = self.decoder(current_input, context, hidden)
+                    log_probs = torch.log_softmax(logits.squeeze(1), dim=-1)
+                    top_log_probs, top_indices = torch.topk(log_probs[0], beam_width)
+                    for i in range(beam_width):
+                        next_token = top_indices[i].unsqueeze(0).unsqueeze(0)
+                        new_seq = torch.cat([seq, next_token], dim=1)
+                        new_score = score + top_log_probs[i].item()
+                        candidates.append((new_seq, new_score, new_hidden))
+                if len(candidates) == 0:
+                    break
+                beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
+            all_beams = completed_beams + [(seq, score) for seq, score, _ in beams]
+            if len(all_beams) == 0:
+                result = torch.full((1, self.answer_max_len), self.pad_token_id,
+                                dtype=torch.long, device=self.device)
+            else:
+                best_beam = max(all_beams, key=lambda x: x[1] / (x[0].size(1) ** 0.7))
+                result = torch.full((1, self.answer_max_len), self.pad_token_id,
+                                   dtype=torch.long, device=self.device)
+                seq_len = min(best_beam[0].size(1), self.answer_max_len)
+                result[:, :seq_len] = best_beam[0][:, :seq_len]
+            all_results.append(result)
+        return torch.cat(all_results, dim=0)
+if __name__ == "__main__":
+    print("Testing Spatial Adapter Architecture...")
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    from model import VQAModel
+    base_model = VQAModel(device=device).to(device)
+    spatial_model = VQAModelWithSpatialAdapter(base_model).to(device)
+    spatial_model.eval()
+    fake_image = torch.randn(2, 3, 224, 224).to(device)
+    fake_question_ids = torch.tensor([[1, 10, 20, 30, 2, 0, 0], [1, 15, 25, 35, 2, 0, 0]]).to(device)
+    fake_question_mask = torch.tensor([[1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 0, 0]]).to(device)
+    question_batch = {
+        "input_ids": fake_question_ids,
+        "attention_mask": fake_question_mask
+    }
+    print(f"\nInput shapes:")
+    print(f"  Images: {fake_image.shape}")
+    print(f"  Questions: {fake_question_ids.shape}")
+    with torch.no_grad():
+        patch_features = spatial_model.extract_clip_patch_features(fake_image)
+        print(f"\nPatch features shape: {patch_features.shape}")
+        print(f"  Expected: [2, 196, 512] (batch_size, num_patches, patch_dim)")
+        output = spatial_model(fake_image, question_batch)
+        print(f"\nGenerated output shape: {output.shape}")
+        print(f"  Expected: [2, {spatial_model.answer_max_len}]")
+    total_params = sum(p.numel() for p in spatial_model.parameters())
+    spatial_adapter_params = sum(p.numel() for p in spatial_model.spatial_adapter.parameters())
+    trainable_params = sum(p.numel() for p in spatial_model.parameters() if p.requires_grad)
+    print(f"\nParameter counts:")
+    print(f"  Total parameters: {total_params:,}")
+    print(f"  Spatial adapter parameters: {spatial_adapter_params:,}")
+    print(f"  Trainable parameters: {trainable_params:,}")
+    print("\n✓ Spatial adapter architecture test passed!")

models/__pycache__/model.cpython-312.pyc ADDED Viewed

Binary file (16.5 kB). View file

models/model.py ADDED Viewed

	@@ -0,0 +1,224 @@

+import torch
+from torch import nn
+import clip
+from transformers import GPT2Model
+class AttentionDecoder(nn.Module):
+    def __init__(self, hidden_size, vocab_size, num_layers=1, dropout=0.3):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.embedding = nn.Embedding(vocab_size, hidden_size)
+        self.attention = nn.Linear(hidden_size * 2, 1)
+        self.gru = nn.GRU(
+            input_size=hidden_size * 2,
+            hidden_size=hidden_size,
+            num_layers=num_layers,
+            batch_first=True,
+            dropout=dropout if num_layers > 1 else 0
+        )
+        self.ln_gru = nn.LayerNorm(hidden_size)
+        self.output = nn.Linear(hidden_size, vocab_size)
+    def forward(self, input_ids, context, hidden):
+        if input_ids.dim() == 1:
+            input_ids = input_ids.unsqueeze(1)
+        embeddings = self.embedding(input_ids).float()
+        context_expanded = context.unsqueeze(1).expand(-1, embeddings.size(1), -1)
+        combined = torch.cat([embeddings, context_expanded], dim=-1)
+        attn_weights = torch.softmax(self.attention(combined), dim=1)
+        attended_context = (context_expanded * attn_weights).sum(dim=1, keepdim=True)
+        gru_input = torch.cat([embeddings, attended_context.expand(-1, embeddings.size(1), -1)], dim=-1)
+        gru_output, hidden = self.gru(gru_input, hidden)
+        gru_output = self.ln_gru(gru_output)
+        return self.output(gru_output), hidden
+class VQAModel(nn.Module):
+    def __init__(
+        self,
+        vocab_size=3600,
+        question_max_len=16,
+        answer_max_len=10,
+        hidden_size=512,
+        num_layers=2,
+        dropout=0.3,
+        device='cuda',
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        unk_token_id=3
+    ):
+        super().__init__()
+        self.device = device
+        self.question_max_len = question_max_len
+        self.answer_max_len = answer_max_len
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.fine_tuning_mode = False
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.unk_token_id = unk_token_id
+        self.clip_model, self.clip_preprocess = clip.load("ViT-B/16", device=device)
+        for p in self.clip_model.parameters():
+            p.requires_grad = False
+        self.gpt2_model = GPT2Model.from_pretrained("distilgpt2")
+        self.gpt2_model.to(device)
+        for p in self.gpt2_model.parameters():
+            p.requires_grad = False
+        self.img_proj = nn.Linear(512, hidden_size)
+        self.q_proj = nn.Linear(768, hidden_size)
+        self.gate_layer = nn.Linear(hidden_size*2, hidden_size)
+        self.fusion = nn.Sequential(
+            nn.Linear(hidden_size*3, hidden_size),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_size, hidden_size)
+        )
+        self.decoder = AttentionDecoder(hidden_size, vocab_size, num_layers, dropout)
+    def unfreeze_clip_layers(self, num_layers=2):
+        self.clip_model.train()
+        self.clip_model.visual.float()
+        total_blocks = len(self.clip_model.visual.transformer.resblocks)
+        for i, block in enumerate(self.clip_model.visual.transformer.resblocks):
+            if i >= total_blocks - num_layers:
+                for p in block.parameters():
+                    p.requires_grad = True
+        if hasattr(self.clip_model.visual, "proj") and self.clip_model.visual.proj is not None:
+            if isinstance(self.clip_model.visual.proj, torch.nn.Parameter):
+                self.clip_model.visual.proj.requires_grad = True
+            else:
+                for p in self.clip_model.visual.proj.parameters():
+                    p.requires_grad = True
+        if hasattr(self.clip_model.visual, "ln_post"):
+            for p in self.clip_model.visual.ln_post.parameters():
+                p.requires_grad = True
+        self.fine_tuning_mode = True
+        print(f"Unfrozen last {num_layers} CLIP layers")
+    def unfreeze_gpt2_layers(self, num_layers=1):
+        self.gpt2_model.train()
+        total_layers = len(self.gpt2_model.h)
+        for i, layer in enumerate(self.gpt2_model.h):
+            if i >= total_layers - num_layers:
+                for p in layer.parameters():
+                    p.requires_grad = True
+                    p.data = p.data.float()
+        for p in self.gpt2_model.ln_f.parameters():
+            p.requires_grad = True
+            p.data = p.data.float()
+        self.fine_tuning_mode = True
+        print(f"Unfrozen last {num_layers} GPT-2 layers")
+    def encode_image(self, images):
+        if self.fine_tuning_mode:
+            images = images.float()
+            img_features = self.clip_model.encode_image(images)
+        else:
+            with torch.no_grad():
+                img_features = self.clip_model.encode_image(images)
+        img_features = img_features / img_features.norm(dim=-1, keepdim=True)
+        return img_features.float()
+    def encode_question(self, input_ids, attention_mask):
+        if self.fine_tuning_mode:
+            outputs = self.gpt2_model(input_ids=input_ids, attention_mask=attention_mask)
+        else:
+            with torch.no_grad():
+                outputs = self.gpt2_model(input_ids=input_ids, attention_mask=attention_mask)
+        last_hidden = outputs.last_hidden_state
+        mask = attention_mask.unsqueeze(-1).to(last_hidden.dtype)
+        masked = last_hidden * mask
+        sum_hidden = masked.sum(dim=1)
+        lengths = mask.sum(dim=1).clamp(min=1e-6)
+        text_features = sum_hidden / lengths
+        text_features = text_features / text_features.norm(dim=-1, keepdim=True)
+        return text_features.float()
+    def fuse_features(self, img_features, q_features):
+        x = torch.cat([img_features, q_features], dim=-1)
+        gate = torch.sigmoid(self.gate_layer(x))
+        fused = gate * img_features + (1-gate) * q_features
+        fused = self.fusion(torch.cat([fused, x], dim=-1))
+        return fused
+    def forward(self, images, questions, answer_input_ids=None):
+        img_features = self.encode_image(images)
+        img_features = self.img_proj(img_features).float()
+        q_features = self.encode_question(questions["input_ids"], questions["attention_mask"])
+        q_features = self.q_proj(q_features).float()
+        batch_size = img_features.size(0)
+        context = self.fuse_features(img_features, q_features)
+        hidden = torch.zeros(self.num_layers, batch_size, self.hidden_size,
+                           device=self.device, dtype=torch.float)
+        if answer_input_ids is not None:
+            logits, _ = self.decoder(answer_input_ids, context, hidden)
+            return logits
+        else:
+            generated = torch.full((batch_size, self.answer_max_len), self.pad_token_id,
+                                 dtype=torch.long, device=self.device)
+            generated[:, 0] = self.bos_token_id
+            for t in range(1, self.answer_max_len):
+                current_input = generated[:, t-1]
+                logits, hidden = self.decoder(current_input, context, hidden)
+                next_tokens = logits.squeeze(1).argmax(dim=-1)
+                generated[:, t] = next_tokens
+                if (next_tokens == self.eos_token_id).all():
+                    break
+            return generated
+    def generate_with_beam_search(self, images, questions, beam_width=5):
+        batch_size = images.size(0)
+        all_results = []
+        for b in range(batch_size):
+            img = images[b:b+1]
+            q_ids = questions["input_ids"][b:b+1]
+            q_mask = questions["attention_mask"][b:b+1]
+            img_features = self.encode_image(img)
+            img_features = self.img_proj(img_features).float()
+            q_features = self.encode_question(q_ids, q_mask)
+            q_features = self.q_proj(q_features).float()
+            context = self.fuse_features(img_features, q_features)
+            initial_hidden = torch.zeros(self.num_layers, 1, self.hidden_size,
+                                         device=self.device, dtype=torch.float)
+            beams = [(
+                torch.full((1, 1), self.bos_token_id, dtype=torch.long, device=self.device),
+                0.0,
+                initial_hidden
+            )]
+            completed_beams = []
+            for t in range(1, self.answer_max_len):
+                candidates = []
+                for seq, score, hidden in beams:
+                    if seq[0, -1].item() == self.eos_token_id:
+                        completed_beams.append((seq, score))
+                        continue
+                    current_input = seq[:, -1]
+                    logits, new_hidden = self.decoder(current_input, context, hidden)
+                    log_probs = torch.log_softmax(logits.squeeze(1), dim=-1)
+                    top_log_probs, top_indices = torch.topk(log_probs[0], beam_width)
+                    for i in range(beam_width):
+                        next_token = top_indices[i].unsqueeze(0).unsqueeze(0)
+                        new_seq = torch.cat([seq, next_token], dim=1)
+                        new_score = score + top_log_probs[i].item()
+                        candidates.append((new_seq, new_score, new_hidden))
+                if len(candidates) == 0:
+                    break
+                beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
+            all_beams = completed_beams + [(seq, score) for seq, score, _ in beams]
+            if len(all_beams) == 0:
+                result = torch.full((1, self.answer_max_len), self.pad_token_id,
+                                dtype=torch.long, device=self.device)
+            else:
+                best_beam = max(all_beams, key=lambda x: x[1] / (x[0].size(1) ** 0.7))
+                result = torch.full((1, self.answer_max_len), self.pad_token_id,
+                                   dtype=torch.long, device=self.device)
+                seq_len = min(best_beam[0].size(1), self.answer_max_len)
+                result[:, :seq_len] = best_beam[0][:, :seq_len]
+            all_results.append(result)
+        return torch.cat(all_results, dim=0)
+if __name__ == "__main__":
+    device = "cuda"
+    model = VQAModel(device=device).to(device)
+    model.eval()
+    fake_image = torch.randn(1, 3, 224, 224).to(device)
+    fake_question_ids = torch.tensor([[1, 10, 20, 30, 2, 0, 0]]).to(device)
+    fake_question_mask = torch.tensor([[1, 1, 1, 1, 1, 0, 0]]).to(device)
+    question_batch = {
+        "input_ids": fake_question_ids,
+        "attention_mask": fake_question_mask
+    }
+    output = model(fake_image, question_batch)
+    print(output)

quick_start.bat ADDED Viewed

	@@ -0,0 +1,71 @@

+@echo off
+REM Quick Start Script for VQA Mobile App
+REM This script helps you start the backend and frontend
+echo ========================================
+echo VQA Mobile App - Quick Start
+echo ========================================
+echo.
+REM Get current IP address
+echo [1/3] Checking your IP address...
+for /f "tokens=2 delims=:" %%a in ('ipconfig ^| findstr /c:"IPv4"') do (
+    set IP=%%a
+    set IP=!IP:~1!
+    echo Your IP: !IP!
+)
+echo.
+echo [2/3] Current Configuration:
+echo   Backend: http://10.215.4.143:8000
+echo   Frontend: ui/src/config/api.js
+echo.
+echo IMPORTANT: Make sure both laptop and mobile are on the SAME network!
+echo.
+echo [3/3] Choose an option:
+echo   1. Start Backend (Python)
+echo   2. Start Frontend (Expo)
+echo   3. Start Both (Opens 2 terminals)
+echo   4. Exit
+echo.
+choice /c 1234 /n /m "Enter your choice (1-4): "
+if errorlevel 4 goto :end
+if errorlevel 3 goto :both
+if errorlevel 2 goto :frontend
+if errorlevel 1 goto :backend
+:backend
+echo.
+echo Starting Backend Server...
+echo Make sure you have activated your Python environment!
+echo.
+python backend_api.py
+goto :end
+:frontend
+echo.
+echo Starting Expo Frontend...
+cd ui
+npx expo start
+goto :end
+:both
+echo.
+echo Starting both Backend and Frontend...
+echo Opening Backend in new window...
+start cmd /k "python backend_api.py"
+timeout /t 3 /nobreak >nul
+echo Opening Frontend in new window...
+start cmd /k "cd ui && npx expo start"
+echo.
+echo Both servers are starting in separate windows!
+goto :end
+:end
+echo.
+echo Done!
+pause

requirements_api.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+fastapi>=0.115.6
+uvicorn>=0.34.0
+python-multipart>=0.0.20
+pillow>=11.1.0
+torch>=2.0.0
+torchvision>=0.15.0
+transformers>=4.30.0
+ftfy
+regex
+tqdm
+git+https://github.com/openai/CLIP.git
+groq>=0.4.0
+python-dotenv>=1.0.0
+huggingface-hub

scores/feature.txt ADDED Viewed

	@@ -0,0 +1,77 @@

+================================================================================
+EVALUATION RESULTS
+================================================================================
+📊 Accuracy Metrics:
+  Exact Match Accuracy:  50.17% (63805/135256)
+  VQA Accuracy:          15.72%
+📊 ANLS Metrics:
+  Average ANLS (τ=0.5):  50.18%
+  ANLS Std Dev:          48.96%
+📊 Additional Statistics:
+  Total samples:         135256
+  Avg prediction length: 1.13 words
+  Avg GT length:         1.10 words
+================================================================================
+SAMPLE PREDICTIONS
+================================================================================
+🏆 Best Predictions (Highest ANLS):
+--------------------------------------------------------------------------------
+Ground Truth: tusks
+Prediction:   tusks
+ANLS:         1.0000
+Exact Match:  ✓
+Ground Truth: seagull
+Prediction:   seagull
+ANLS:         1.0000
+Exact Match:  ✓
+Ground Truth: bedroom
+Prediction:   bedroom
+ANLS:         1.0000
+Exact Match:  ✓
+Ground Truth: cake
+Prediction:   cake
+ANLS:         1.0000
+Exact Match:  ✓
+Ground Truth: short
+Prediction:   short
+ANLS:         1.0000
+Exact Match:  ✓
+================================================================================
+⚠️  Worst Predictions (Lowest ANLS):
+--------------------------------------------------------------------------------
+Ground Truth: mirror
+Prediction:   car
+ANLS:         0.0000
+Exact Match:  ✗
+Ground Truth: towel
+Prediction:   toy
+ANLS:         0.0000
+Exact Match:  ✗
+Ground Truth: book
+Prediction:   camera
+ANLS:         0.0000
+Exact Match:  ✗
+Ground Truth: usa
+Prediction:   england
+ANLS:         0.0000
+Exact Match:  ✗
+Ground Truth: red and yellow
+Prediction:   green
+ANLS:         0.0000
+Exact Match:  ✗

scores/score.py ADDED Viewed

	@@ -0,0 +1,300 @@

+import os
+import torch
+import pandas as pd
+from PIL import Image
+from transformers import GPT2Tokenizer
+from model import VQAModel
+from model_spatial import VQAModelWithSpatialAdapter
+from train import Vocab
+from tqdm import tqdm
+import numpy as np
+try:
+    from Levenshtein import distance as levenshtein_distance
+except ImportError:
+    print("Installing python-Levenshtein...")
+    import subprocess
+    subprocess.check_call(['pip', 'install', 'python-Levenshtein'])
+    from Levenshtein import distance as levenshtein_distance
+MODEL_TYPE = "feature"
+SPATIAL_CHECKPOINT = "./output2/spatial_adapter_v2_2/vqa_spatial_checkpoint.pt"
+FEATURE_CHECKPOINT = "./output2/feature_extraction/vqa_checkpoint.pt"
+CSV_PATH = "./gen_vqa_v2/metadata.csv"
+IMG_DIR = "./gen_vqa_v2"
+MAX_SAMPLES = None
+def load_spatial_model(checkpoint_path, device='cuda'):
+    checkpoint = torch.load(checkpoint_path, map_location=device)
+    vocab = Vocab()
+    vocab.vocab = checkpoint['vocab']
+    vocab.vocab_size = len(checkpoint['vocab'])
+    vocab.word2idx = checkpoint['word2idx']
+    vocab.idx2word = checkpoint['idx2word']
+    vocab.pad_token_id = checkpoint['pad_token_id']
+    vocab.bos_token_id = checkpoint['bos_token_id']
+    vocab.eos_token_id = checkpoint['eos_token_id']
+    vocab.unk_token_id = checkpoint['unk_token_id']
+    tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+    base_model = VQAModel(
+        vocab_size=len(checkpoint['vocab']),
+        device=device,
+        question_max_len=checkpoint.get('question_max_len', 20),
+        answer_max_len=checkpoint.get('answer_max_len', 12),
+        pad_token_id=checkpoint['pad_token_id'],
+        bos_token_id=checkpoint['bos_token_id'],
+        eos_token_id=checkpoint['eos_token_id'],
+        unk_token_id=checkpoint['unk_token_id'],
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    base_model.gpt2_model.resize_token_embeddings(len(tokenizer))
+    model = VQAModelWithSpatialAdapter(
+        base_model=base_model,
+        hidden_size=512,
+        num_heads=8,
+        dropout=0.3
+    ).to(device)
+    model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    model.eval()
+    return model, vocab, tokenizer
+def load_feature_model(checkpoint_path, device='cuda'):
+    checkpoint = torch.load(checkpoint_path, map_location=device)
+    vocab = Vocab()
+    vocab.vocab = checkpoint['vocab']
+    vocab.vocab_size = len(checkpoint['vocab'])
+    vocab.word2idx = checkpoint['word2idx']
+    vocab.idx2word = checkpoint['idx2word']
+    vocab.pad_token_id = checkpoint['pad_token_id']
+    vocab.bos_token_id = checkpoint['bos_token_id']
+    vocab.eos_token_id = checkpoint['eos_token_id']
+    vocab.unk_token_id = checkpoint['unk_token_id']
+    model = VQAModel(
+        vocab_size=len(checkpoint['vocab']),
+        device=device,
+        question_max_len=checkpoint.get('question_max_len', 20),
+        answer_max_len=checkpoint.get('answer_max_len', 12),
+        pad_token_id=checkpoint['pad_token_id'],
+        bos_token_id=checkpoint['bos_token_id'],
+        eos_token_id=checkpoint['eos_token_id'],
+        unk_token_id=checkpoint['unk_token_id'],
+        hidden_size=512,
+        num_layers=2
+    ).to(device)
+    tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
+    if tokenizer.pad_token is None:
+        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
+        model.gpt2_model.resize_token_embeddings(len(tokenizer))
+    model.load_state_dict(checkpoint['model_state_dict'], strict=False)
+    model.eval()
+    return model, vocab, tokenizer
+def generate_answer(model, vocab, tokenizer, image_path, question, device='cuda'):
+    image = Image.open(image_path).convert('RGB')
+    image = model.clip_preprocess(image).unsqueeze(0).to(device)
+    question_tokens = tokenizer(
+        question,
+        padding='max_length',
+        truncation=True,
+        max_length=model.question_max_len,
+        return_tensors='pt'
+    )
+    questions = {
+        'input_ids': question_tokens['input_ids'].to(device),
+        'attention_mask': question_tokens['attention_mask'].to(device)
+    }
+    with torch.no_grad():
+        if hasattr(model, 'generate_with_beam_search'):
+            generated = model.generate_with_beam_search(image, questions, beam_width=5)
+        else:
+            logits = model(image, questions)
+            generated = logits.argmax(dim=-1)
+    return vocab.decoder(generated[0].cpu().numpy())
+def exact_match_accuracy(predictions, ground_truths):
+    """
+    Calculate exact match accuracy (case-insensitive, stripped).
+    Args:
+        predictions: List of predicted answers
+        ground_truths: List of ground truth answers
+    Returns:
+        accuracy: Percentage of exact matches
+    """
+    matches = sum(1 for pred, gt in zip(predictions, ground_truths)
+                  if pred.strip().lower() == gt.strip().lower())
+    accuracy = (matches / len(predictions)) * 100 if predictions else 0
+    return accuracy, matches
+def vqa_accuracy(predictions, ground_truths_list):
+    """
+    VQA official metric: min(
+    Note: This assumes ground_truths_list is a list of lists,
+    where each inner list contains multiple human annotations.
+    If you only have one annotation per question, this reduces to exact match.
+    Args:
+        predictions: List of predicted answers
+        ground_truths_list: List of lists of ground truth answers
+    Returns:
+        vqa_score: VQA accuracy score (0-100)
+    """
+    if not isinstance(ground_truths_list[0], list):
+        ground_truths_list = [[gt] for gt in ground_truths_list]
+    scores = []
+    for pred, gt_list in zip(predictions, ground_truths_list):
+        pred_clean = pred.strip().lower()
+        matches = sum(1 for gt in gt_list if pred_clean == gt.strip().lower())
+        score = min(matches / 3.0, 1.0)
+        scores.append(score)
+    vqa_score = (sum(scores) / len(scores)) * 100 if scores else 0
+    return vqa_score
+def calculate_anls(prediction, ground_truth, threshold=0.5):
+    """
+    Calculate ANLS (Average Normalized Levenshtein Similarity) for a single pair.
+    Args:
+        prediction: Predicted answer string
+        ground_truth: Ground truth answer string
+        threshold: Minimum similarity threshold (default: 0.5)
+    Returns:
+        anls_score: ANLS score (0-1)
+    """
+    pred_clean = prediction.strip().lower()
+    gt_clean = ground_truth.strip().lower()
+    if len(gt_clean) == 0:
+        return 1.0 if len(pred_clean) == 0 else 0.0
+    dist = levenshtein_distance(pred_clean, gt_clean)
+    max_len = max(len(pred_clean), len(gt_clean))
+    if max_len == 0:
+        return 1.0
+    similarity = 1 - (dist / max_len)
+    anls = similarity if similarity >= threshold else 0.0
+    return anls
+def average_anls(predictions, ground_truths, threshold=0.5):
+    """
+    Calculate average ANLS across all predictions.
+    Args:
+        predictions: List of predicted answers
+        ground_truths: List of ground truth answers
+        threshold: Minimum similarity threshold
+    Returns:
+        avg_anls: Average ANLS score (0-100)
+    """
+    anls_scores = []
+    for pred, gt in zip(predictions, ground_truths):
+        score = calculate_anls(pred, gt, threshold)
+        anls_scores.append(score)
+    avg_anls = (sum(anls_scores) / len(anls_scores)) * 100 if anls_scores else 0
+    return avg_anls, anls_scores
+if __name__ == "__main__":
+    print("=" * 80)
+    print("VQA EVALUATION: ACCURACY + ANLS")
+    print("=" * 80)
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    print(f"\nDevice: {device}")
+    print(f"Model: {MODEL_TYPE.upper()}\n")
+    if MODEL_TYPE == "spatial":
+        model, vocab, tokenizer = load_spatial_model(SPATIAL_CHECKPOINT, device)
+    else:
+        model, vocab, tokenizer = load_feature_model(FEATURE_CHECKPOINT, device)
+    print("✓ Model loaded!\n")
+    df = pd.read_csv(CSV_PATH)
+    if MAX_SAMPLES:
+        df = df.head(MAX_SAMPLES)
+    print(f"Evaluating {len(df)} samples\n")
+    print("Generating predictions...")
+    predictions = []
+    ground_truths = []
+    for idx, row in tqdm(df.iterrows(), total=len(df)):
+        image_path = os.path.join(IMG_DIR, row['image_path'])
+        if not os.path.exists(image_path):
+            continue
+        try:
+            prediction = generate_answer(model, vocab, tokenizer,
+                                        image_path, row['question'], device)
+            ground_truth = row['answer']
+            predictions.append(prediction)
+            ground_truths.append(ground_truth)
+        except Exception as e:
+            continue
+    print(f"\n✓ Generated {len(predictions)} predictions\n")
+    print("Calculating metrics...\n")
+    exact_acc, exact_matches = exact_match_accuracy(predictions, ground_truths)
+    vqa_acc = vqa_accuracy(predictions, ground_truths)
+    anls_score, anls_scores = average_anls(predictions, ground_truths, threshold=0.5)
+    print("=" * 80)
+    print("EVALUATION RESULTS")
+    print("=" * 80)
+    print(f"\n📊 Accuracy Metrics:")
+    print(f"  Exact Match Accuracy:  {exact_acc:.2f}% ({exact_matches}/{len(predictions)})")
+    print(f"  VQA Accuracy:          {vqa_acc:.2f}%")
+    print(f"\n📊 ANLS Metrics:")
+    print(f"  Average ANLS (τ=0.5):  {anls_score:.2f}%")
+    print(f"  ANLS Std Dev:          {np.std(anls_scores)*100:.2f}%")
+    print(f"\n📊 Additional Statistics:")
+    print(f"  Total samples:         {len(predictions)}")
+    print(f"  Avg prediction length: {np.mean([len(p.split()) for p in predictions]):.2f} words")
+    print(f"  Avg GT length:         {np.mean([len(gt.split()) for gt in ground_truths]):.2f} words")
+    print("\n" + "=" * 80)
+    print("SAMPLE PREDICTIONS")
+    print("=" * 80)
+    sorted_indices = np.argsort(anls_scores)
+    print("\n🏆 Best Predictions (Highest ANLS):")
+    print("-" * 80)
+    for i in sorted_indices[-5:][::-1]:
+        print(f"\nGround Truth: {ground_truths[i]}")
+        print(f"Prediction:   {predictions[i]}")
+        print(f"ANLS:         {anls_scores[i]:.4f}")
+        print(f"Exact Match:  {'✓' if predictions[i].strip().lower() == ground_truths[i].strip().lower() else '✗'}")
+    print("\n" + "=" * 80)
+    print("⚠️  Worst Predictions (Lowest ANLS):")
+    print("-" * 80)
+    for i in sorted_indices[:5]:
+        print(f"\nGround Truth: {ground_truths[i]}")
+        print(f"Prediction:   {predictions[i]}")
+        print(f"ANLS:         {anls_scores[i]:.4f}")
+        print(f"Exact Match:  {'✓' if predictions[i].strip().lower() == ground_truths[i].strip().lower() else '✗'}")
+    print("\n" + "=" * 80)
+    print("✅ EVALUATION COMPLETE")
+    print("=" * 80)
+    with open(f"{MODEL_TYPE}.txt", "w", encoding="utf-8") as f:
+        f.write("=" * 80 + "\n")
+        f.write("EVALUATION RESULTS\n")
+        f.write("=" * 80 + "\n")
+        f.write("\n📊 Accuracy Metrics:\n")
+        f.write(f"  Exact Match Accuracy:  {exact_acc:.2f}% ({exact_matches}/{len(predictions)})\n")
+        f.write(f"  VQA Accuracy:          {vqa_acc:.2f}%\n")
+        f.write("\n📊 ANLS Metrics:\n")
+        f.write(f"  Average ANLS (τ=0.5):  {anls_score:.2f}%\n")
+        f.write(f"  ANLS Std Dev:          {np.std(anls_scores)*100:.2f}%\n")
+        f.write("\n📊 Additional Statistics:\n")
+        f.write(f"  Total samples:         {len(predictions)}\n")
+        f.write(f"  Avg prediction length: {np.mean([len(p.split()) for p in predictions]):.2f} words\n")
+        f.write(f"  Avg GT length:         {np.mean([len(gt.split()) for gt in ground_truths]):.2f} words\n")
+        f.write("\n" + "=" * 80 + "\n")
+        f.write("SAMPLE PREDICTIONS\n")
+        f.write("=" * 80 + "\n")
+        sorted_indices = np.argsort(anls_scores)
+        f.write("\n🏆 Best Predictions (Highest ANLS):\n")
+        f.write("-" * 80 + "\n")
+        for i in sorted_indices[-5:][::-1]:
+            f.write(f"\nGround Truth: {ground_truths[i]}\n")
+            f.write(f"Prediction:   {predictions[i]}\n")
+            f.write(f"ANLS:         {anls_scores[i]:.4f}\n")
+            f.write(
+                f"Exact Match:  {'✓' if predictions[i].strip().lower() == ground_truths[i].strip().lower() else '✗'}\n"
+            )
+        f.write("\n" + "=" * 80 + "\n")
+        f.write("⚠️  Worst Predictions (Lowest ANLS):\n")
+        f.write("-" * 80 + "\n")
+        for i in sorted_indices[:5]:
+            f.write(f"\nGround Truth: {ground_truths[i]}\n")
+            f.write(f"Prediction:   {predictions[i]}\n")
+            f.write(f"ANLS:         {anls_scores[i]:.4f}\n")
+            f.write(
+                f"Exact Match:  {'✓' if predictions[i].strip().lower() == ground_truths[i].strip().lower() else '✗'}\n"
+            )
+    results_df = pd.DataFrame({
+        'prediction': predictions,
+        'ground_truth': ground_truths,
+        'anls_score': anls_scores,
+        'exact_match': [pred.strip().lower() == gt.strip().lower()
+                       for pred, gt in zip(predictions, ground_truths)]
+    })
+    output_file = f"vqa_evaluation_{MODEL_TYPE}.csv"
+    results_df.to_csv(output_file, index=False)
+    print(f"\n💾 Results saved to: {output_file}")

scores/vqa_evaluation_feature.csv ADDED Viewed

The diff for this file is too large to render. See raw diff