Spaces:
Running
Running
| # Fix: Removed Hardcoded Patterns from Neuro-Symbolic VQA | |
| ## Problem Identified | |
| The `_detect_objects_with_clip()` method in `semantic_neurosymbolic_vqa.py` contained a **predefined list of object categories**, which is essentially pattern matching and defeats the purpose of a truly neuro-symbolic approach. | |
| ```python | |
| # β OLD CODE - Hardcoded categories (pattern matching!) | |
| object_categories = [ | |
| "food", "soup", "noodles", "rice", "meat", "vegetable", "fruit", | |
| "bowl", "plate", "cup", "glass", "spoon", "fork", "knife", ... | |
| ] | |
| ``` | |
| This is **not acceptable** because: | |
| - It limits detection to predefined categories only | |
| - It's essentially pattern matching, not true neural understanding | |
| - It violates the neuro-symbolic principle of learning from data | |
| ## Solution Applied | |
| ### 1. Deprecated `_detect_objects_with_clip()` | |
| The method now returns an empty list and warns that it's deprecated: | |
| ```python | |
| # β NEW CODE - No predefined lists! | |
| def _detect_objects_with_clip(self, image_features, image_path=None): | |
| """ | |
| NOTE: This method is deprecated in favor of using the VQA model | |
| directly from ensemble_vqa_app.py. | |
| """ | |
| print("β οΈ _detect_objects_with_clip is deprecated") | |
| print("β Use VQA model's _detect_multiple_objects() instead") | |
| return [] | |
| ``` | |
| ### 2. Updated `answer_with_clip_features()` | |
| Now **requires** objects to be provided by the VQA model: | |
| ```python | |
| # β Objects must come from VQA model, not predefined lists | |
| def answer_with_clip_features( | |
| self, | |
| image_features, | |
| question, | |
| image_path=None, | |
| detected_objects: List[str] = None # REQUIRED! | |
| ): | |
| if not detected_objects: | |
| print("β οΈ No objects provided - neuro-symbolic reasoning requires VQA-detected objects") | |
| return None | |
| ``` | |
| ### 3. Ensemble VQA Uses True VQA Detection | |
| The `ensemble_vqa_app.py` already uses `_detect_multiple_objects()` which: | |
| - Asks the VQA model **open-ended questions** like "What is this?" | |
| - Uses the model's learned knowledge, not predefined categories | |
| - Generates objects dynamically based on visual understanding | |
| ```python | |
| # β TRUE NEURO-SYMBOLIC APPROACH | |
| detected_objects = self._detect_multiple_objects(image, model, top_k=5) | |
| # This asks VQA model: "What is this?", "What food is this?", etc. | |
| # NO predefined categories! | |
| ``` | |
| ## Result | |
| β **Pure Neuro-Symbolic Pipeline**: | |
| 1. **VQA Model** detects objects using learned visual understanding (no predefined lists) | |
| 2. **Wikidata** provides factual knowledge about detected objects | |
| 3. **LLM** performs Chain-of-Thought reasoning on the facts | |
| 4. **No pattern matching** anywhere in the pipeline | |
| ## Files Modified | |
| - `semantic_neurosymbolic_vqa.py`: | |
| - Deprecated `_detect_objects_with_clip()` | |
| - Updated `answer_with_clip_features()` to require VQA-detected objects | |
| - Changed knowledge source from "CLIP + Wikidata" to "VQA + Wikidata" | |
| ## Verification | |
| The system now uses a **truly neuro-symbolic approach**: | |
| - β No hardcoded object categories | |
| - β No predefined patterns | |
| - β Pure learned visual understanding from VQA model | |
| - β Symbolic reasoning from Wikidata + LLM | |
| - β Chain-of-Thought transparency | |