Aikyam-Lab
/

gridvqa-models

@@ -1,11 +1,12 @@
 ---
 language:
 - en
 tags:
 - vision-language
 - mdetr
 - xai
-license: mit
 model_index:
 - name: mdetr-gridvqa-pure
   task: visual-question-answering
@@ -17,6 +18,9 @@ model_index:
 This repository contains two paired reference models, **M_pure** and **M_spur**, built on identical transformer architectures (**MDETR**). These models, coupled with their corresponding datasets, together form a diagnostic framework to evaluate if Multimodal Explainable AI (MxAI) methods genuinely capture cross-modal synergy or simply report shallow feature correlations.
 ## Model Descriptions
 ### 1. M_pure (The Faithful Spatial Reasoner)
@@ -39,4 +43,18 @@ These models are released explicitly to stress-test vision-language explainabili
 | Evaluation Metric | M_pure on D_pure | M_spur on D_spur | M_spur on D_pure |
 | :--- | :---: | :---: | :---: |
 | **Global Accuracy** | >99% | 100% | **Catastrophic Failure** (8%-14% on multi-hop) |
-| **Causal Pathway** | True Spatial Relations | Bag-of-Words Shortcut | Unimodal Feature Collapse |

 ---
 language:
 - en
+license: mit
+pipeline_tag: image-text-to-text
 tags:
 - vision-language
 - mdetr
 - xai
 model_index:
 - name: mdetr-gridvqa-pure
   task: visual-question-answering
 This repository contains two paired reference models, **M_pure** and **M_spur**, built on identical transformer architectures (**MDETR**). These models, coupled with their corresponding datasets, together form a diagnostic framework to evaluate if Multimodal Explainable AI (MxAI) methods genuinely capture cross-modal synergy or simply report shallow feature correlations.
+This model is presented in the paper [GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods](https://huggingface.co/papers/2606.14740).
+The official training and evaluation code can be found in the [GitHub Repository](https://github.com/AikyamLab/grid-vqax).
 ## Model Descriptions
 ### 1. M_pure (The Faithful Spatial Reasoner)
 | Evaluation Metric | M_pure on D_pure | M_spur on D_spur | M_spur on D_pure |
 | :--- | :---: | :---: | :---: |
 | **Global Accuracy** | >99% | 100% | **Catastrophic Failure** (8%-14% on multi-hop) |
+| **Causal Pathway** | True Spatial Relations | Bag-of-Words Shortcut | Unimodal Feature Collapse |
+## Citation
+```bibtex
+@misc{belsare2026gridvqaxframeworkevaluatingmultimodal,
+      title={GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods},
+      author={Sujay Belsare and Sudarshan Nikhil and Sushant Kumar and Ponnurangam Kumaraguru and Chirag Agarwal},
+      year={2026},
+      eprint={2606.14740},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2606.14740},
+}
+```