sareena
/

spatial_lora_mistral

Model card Files Files and versions

sareena commited on Apr 30, 2025

Commit

b252405

·

verified ·

1 Parent(s): 6dad701

Update README.md

Files changed (1) hide show

README.md +35 -5

README.md CHANGED Viewed

@@ -68,6 +68,39 @@ This setup preserved general reasoning ability while improving spatial accuracy.
 # Evaluation
 # Usage and Intended Uses
@@ -127,11 +160,8 @@ more similar to the downstream evaluations.m
    *“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”*
    *arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292)
-5. **Shi, Zhengxiang**, Qiang Zhang, and Aldo Lipani.
-   *"StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts."* arXiv preprint arXiv:2204.08292 (2022).
-6. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang.
    *"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021).
-7. **Weston, Jason**, Antoine Bordes, Sumit Chopra, and Tomas Mikolov.
    *"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015).

 # Evaluation
+## Benchmark Tasks
+### SpatialQA Benchmark
+This benchmark provides more realistic question answer pairs than the
+stepgame benchmark. It contains place names instead of abstracted concepts of letters
+and answers, but still requires the same multi-step geospatial reasoning capability.
+It complements stepGame by testing broader spatial logic and more realistic
+scenarios
+### bAbI Dataset (Task 17)
+This benchmark was introduced by Facebook AI Research. It includes 20
+synthetic question-answering tasks designed to evaluate various reasoning abilities in
+models. Tasks 17 ("Positional Reasoning") specifically assesses
+spatial reasoning through textual descriptions. Pathfinding in combination with spatial reasoning will potentially help assess
+the model’s performance in tasks such as calculating routes, which would be a common
+application for a geospatial reasoning fine-tuned model.
+### MMLU benchmark geography and environmental subset
+This benchmark is a comprehensive evaluation suite designed to assess a model's
+reasoning and knowledge across a wide range of subjects. When focusing on geography
+and environmental science subsets, the benchmark offers an opportunity to test both
+domain-specific knowledge and broader reasoning abilities relevant to geospatial tasks.
+Input/Output: This benchmark consists of multiple-choice questions covering topics
+from elementary to advanced levels. This benchmark is intended to assess the model’s general performance
+post-processing, and its ability to apply knowledge across subjects most relevant to its
+fine-tuning task.
+## Comparison Models
 # Usage and Intended Uses
    *“StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts.”*
    *arXiv.Org*, 18 Apr. 2022. [https://arxiv.org/abs/2204.08292](https://arxiv.org/abs/2204.08292)
+5. **Wang, Mila**, Xiang Lorraine Li, and William Yang Wang.
    *"SpatialEval: A Benchmark for Spatial Reasoning Evaluation."* arXiv preprint arXiv:2104.08635 (2021).
+6. **Weston, Jason**, Antoine Bordes, Sumit Chopra, and Tomas Mikolov.
    *"Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks."* arXiv preprint arXiv:1502.05698 (2015).