Spatial Web Search Query Classifier

A binary SetFit classifier that distinguishes spatial from non-spatial web search queries. Trained on a gold-annotated sample of MS MARCO and used to identify 104,288 spatial queries (10.3%) across the full 1.01M-query corpus.

Accuracy / F1: 0.986 on a held-out balanced test set (76 negative, 72 positive).

What counts as spatial?

A query is spatial if its answer is geographically variant and requires reasoning about geographic primitives (location, distance, or direction) or topological relationships (adjacency, containment, or connectivity). This includes implicitly spatial queries such as costs and prices in a specific area, not just those containing a toponym.

Model details

  • Sentence Transformer body: BAAI/bge-small-en-v1.5
  • Classification head: LogisticRegression
  • Training data: 1,473 gold-labelled MS MARCO queries (755 non-spatial, 718 spatial), sampled via K-means centroids across the full embedding space for representativeness
  • Labels: 1 = spatial, 0 = non-spatial

Usage

from setfit import SetFitModel

model = SetFitModel.from_pretrained("ilyankou/spatial-classifier")
preds = model([
  "nearest hospital",
  "far from the truth",
  "close to my heart",
  "flood risk in this area"
])
# => [1, 0, 0, 1]

Training

Weak labels were generated by running Llama 3.1 five times per query at temperature 0.2, then manually verified. The SetFit model was trained for one epoch with batch size 64 and learning rate 1e-5, then retrained on the full gold dataset for production inference.

Downloads last month
1
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ilyankou/spatial-classifier

Finetuned
(296)
this model