Update README.md
Browse files
README.md
CHANGED
|
@@ -68,6 +68,39 @@ This setup preserved general reasoning ability while improving spatial accuracy.
|
|
| 68 |
|
| 69 |
# Evaluation
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
|
| 73 |
# Usage and Intended Uses
|
|
@@ -127,11 +160,8 @@ more similar to the downstream evaluations.m
|
|
| 127 |
*“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”*
|
| 128 |
*arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292)
|
| 129 |
|
| 130 |
-
5. **
|
| 131 |
-
*"StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts."* arXiv preprint arXiv:2204.08292 (2022).
|
| 132 |
-
|
| 133 |
-
6. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang.
|
| 134 |
*"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021).
|
| 135 |
|
| 136 |
-
|
| 137 |
*"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015).
|
|
|
|
| 68 |
|
| 69 |
# Evaluation
|
| 70 |
|
| 71 |
+
## Benchmark Tasks
|
| 72 |
+
|
| 73 |
+
### SpatialQA Benchmark
|
| 74 |
+
|
| 75 |
+
This benchmark provides more realistic question answer pairs than the
|
| 76 |
+
stepgame benchmark. It contains place names instead of abstracted concepts of letters
|
| 77 |
+
and answers, but still requires the same multi-step geospatial reasoning capability.
|
| 78 |
+
It complements stepGame by testing broader spatial logic and more realistic
|
| 79 |
+
scenarios
|
| 80 |
+
|
| 81 |
+
### bAbI Dataset (Task 17)
|
| 82 |
+
|
| 83 |
+
This benchmark was introduced by Facebook AI Research. It includes 20
|
| 84 |
+
synthetic question-answering tasks designed to evaluate various reasoning abilities in
|
| 85 |
+
models. Tasks 17 ("Positional Reasoning") specifically assesses
|
| 86 |
+
spatial reasoning through textual descriptions. Pathfinding in combination with spatial reasoning will potentially help assess
|
| 87 |
+
the model’s performance in tasks such as calculating routes, which would be a common
|
| 88 |
+
application for a geospatial reasoning fine-tuned model.
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
### MMLU benchmark geography and environmental subset
|
| 92 |
+
|
| 93 |
+
This benchmark is a comprehensive evaluation suite designed to assess a model's
|
| 94 |
+
reasoning and knowledge across a wide range of subjects. When focusing on geography
|
| 95 |
+
and environmental science subsets, the benchmark offers an opportunity to test both
|
| 96 |
+
domain-specific knowledge and broader reasoning abilities relevant to geospatial tasks.
|
| 97 |
+
Input/Output: This benchmark consists of multiple-choice questions covering topics
|
| 98 |
+
from elementary to advanced levels. This benchmark is intended to assess the model’s general performance
|
| 99 |
+
post-processing, and its ability to apply knowledge across subjects most relevant to its
|
| 100 |
+
fine-tuning task.
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
## Comparison Models
|
| 104 |
|
| 105 |
|
| 106 |
# Usage and Intended Uses
|
|
|
|
| 160 |
*“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”*
|
| 161 |
*arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292)
|
| 162 |
|
| 163 |
+
5. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang.
|
|
|
|
|
|
|
|
|
|
| 164 |
*"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021).
|
| 165 |
|
| 166 |
+
6. **Weston, Jason**, Antoine Bordes, Sumit Chopra, and Tomas Mikolov.
|
| 167 |
*"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015).
|