sareena commited on
Commit
b252405
·
verified ·
1 Parent(s): 6dad701

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -5
README.md CHANGED
@@ -68,6 +68,39 @@ This setup preserved general reasoning ability while improving spatial accuracy.
68
 
69
  # Evaluation
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
 
73
  # Usage and Intended Uses
@@ -127,11 +160,8 @@ more similar to the downstream evaluations.m
127
  *“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”*
128
  *arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292)
129
 
130
- 5. **Shi, Zhengxiang**, Qiang Zhang, and Aldo Lipani.
131
- *"StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts."* arXiv preprint arXiv:2204.08292 (2022).
132
-
133
- 6. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang.
134
  *"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021).
135
 
136
- 7. **Weston, Jason**, Antoine Bordes, Sumit Chopra, and Tomas Mikolov.
137
  *"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015).
 
68
 
69
  # Evaluation
70
 
71
+ ## Benchmark Tasks
72
+
73
+ ### SpatialQA Benchmark
74
+
75
+ This benchmark provides more realistic question answer pairs than the
76
+ stepgame benchmark. It contains place names instead of abstracted concepts of letters
77
+ and answers, but still requires the same multi-step geospatial reasoning capability.
78
+ It complements stepGame by testing broader spatial logic and more realistic
79
+ scenarios
80
+
81
+ ### bAbI Dataset (Task 17)
82
+
83
+ This benchmark was introduced by Facebook AI Research. It includes 20
84
+ synthetic question-answering tasks designed to evaluate various reasoning abilities in
85
+ models. Tasks 17 ("Positional Reasoning") specifically assesses
86
+ spatial reasoning through textual descriptions. Pathfinding in combination with spatial reasoning will potentially help assess
87
+ the model’s performance in tasks such as calculating routes, which would be a common
88
+ application for a geospatial reasoning fine-tuned model.
89
+
90
+
91
+ ### MMLU benchmark geography and environmental subset
92
+
93
+ This benchmark is a comprehensive evaluation suite designed to assess a model's
94
+ reasoning and knowledge across a wide range of subjects. When focusing on geography
95
+ and environmental science subsets, the benchmark offers an opportunity to test both
96
+ domain-specific knowledge and broader reasoning abilities relevant to geospatial tasks.
97
+ Input/Output: This benchmark consists of multiple-choice questions covering topics
98
+ from elementary to advanced levels. This benchmark is intended to assess the model’s general performance
99
+ post-processing, and its ability to apply knowledge across subjects most relevant to its
100
+ fine-tuning task.
101
+
102
+
103
+ ## Comparison Models
104
 
105
 
106
  # Usage and Intended Uses
 
160
  *“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”*
161
  *arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292)
162
 
163
+ 5. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang.
 
 
 
164
  *"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021).
165
 
166
+ 6. **Weston, Jason**, Antoine Bordes, Sumit Chopra, and Tomas Mikolov.
167
  *"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015).