Geospatial (Web Search) Query Detector

A binary SetFit classifier that distinguishes geospatial from non-geospatial web search queries. Trained on 1,200 gold-labelled MS MARCO web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336

Achieves F1 = 0.931 on a held-out test set of 800 samples (421 non-spatial, 379 spatial), with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.

What counts as a geospatial query?

As per Mai et al. (2021) and Kefalidis et al. (2024), a query is geospatial if it requires qualitative or quantitative geographic knowledge of Earth-bound features to be answered.

This is usually the case if the query involves:

  • A geographic entity (named place on Earth: city, country, river, POI, address)
  • A geographic concept (place type: city, lake, mountain, park, building)
  • A spatial relation (near, within, north of, between, borders, crosses, distance)

Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract 'where' questions; queries needing no geographic knowledge.

Model details

  • Sentence Transformer body: BAAI/bge-small-en-v1.5
  • Classification head: LogisticRegression
  • Training data: 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
  • Labels: 1 = geospatial, 0 = non-geospatial

Usage

from setfit import SetFitModel

model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
preds = model([
  "nearest hospital",
  "far from the truth",
  "close to my heart",
  "flood risk in this area"
])
# => [1, 0, 0, 1]

Training

Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3, then manually verified. The SetFit model was trained for 3 epochs with batch size 64 and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation, then retrained on the full gold dataset (1,200 samples) for production inference.

Downloads last month
54
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilyankou/is-geospatial-query

Finetuned
(356)
this model

Paper for ilyankou/is-geospatial-query