Aikyam-Lab
/

gridvqa-models

vision-language

Model card Files Files and versions

gridvqa-models / README.md

snaidata's picture

Create README.md

305ac69 verified 5 days ago

|

History Blame Contribute Delete

2.59 kB

	---
	language:
	- en
	tags:
	- vision-language
	- mdetr
	- xai
	license: mit
	model_index:
	- name: mdetr-gridvqa-pure
	task: visual-question-answering
	- name: mdetr-gridvqa-spurious
	task: visual-question-answering
	---

	# GridVQA-X Models

	This repository contains two paired reference models, M_pure and M_spur, built on identical transformer architectures (MDETR). These models, coupled with their corresponding datasets, together form a diagnostic framework to evaluate if Multimodal Explainable AI (MxAI) methods genuinely capture cross-modal synergy or simply report shallow feature correlations.

	## Model Descriptions

	### 1. M_pure (The Faithful Spatial Reasoner)
	* Training Distribution: Trained exclusively on the D_pure dataset.
	* Behavioral Dynamics: Trained via explanation-guided dynamics using a two-phase optimization. Phase 1 forces explicit visual-textual token alignment via L1 and generalized IoU losses. Phase 2 handles Question Answering using class-frequency weighted cross-entropy to completely eliminate answer prior biases.
	* Capabilities: Successfully internalizes true causal spatial-relational synergy, achieving robust accuracy across both clean and heavily distractor-crowded grids.

	### 2. M_spur (The Shortcut / Bag-of-Words Model)
	* Training Distribution: Trained exclusively on the D_spur dataset.
	* Behavioral Dynamics: Structurally forced to rely on cross-modal shortcuts during training. It skips relational spatial geometry entirely and maps keywords directly to target visual volume.
	* Capabilities: Achieves perfect accuracy (1.000) on its native spurious distribution, but fails catastrophically when evaluated on D_pure multi-hop queries.

	## Intended Diagnostic Use
	These models are released explicitly to stress-test vision-language explainability algorithms (e.g., DIME, MultiSHAP, MultiViz, EMAP, InterSHAP):
	* The Litmus Test: A faithful explainer must output completely different attribution heatmaps or synergy scalars for M_pure and M_spur on the same input question.
	* The Reality Check: If your explainer highlights identical spatial regions for both models, it suffers from "model blindness" or is simply behaving as a superficial object detector.

	## Performance Benchmark Metrics

	\| Evaluation Metric \| M_pure on D_pure \| M_spur on D_spur \| M_spur on D_pure \|
	\| :--- \| :---: \| :---: \| :---: \|
	\| Global Accuracy \| >99% \| 100% \| Catastrophic Failure (8%-14% on multi-hop) \|
	\| Causal Pathway \| True Spatial Relations \| Bag-of-Words Shortcut \| Unimodal Feature Collapse \|