Spatial Web Search Query Classifier

A binary SetFit classifier that distinguishes spatial from non-spatial web search queries. Trained on a gold-annotated sample of MS MARCO and used to identify 104,288 spatial queries (10.3%) across the full 1.01M-query corpus.

Accuracy / F1: 0.986 on a held-out balanced test set (76 negative, 72 positive).

What counts as spatial?

A query is spatial if its answer is geographically variant and requires reasoning about geographic primitives (location, distance, or direction) or topological relationships (adjacency, containment, or connectivity). This includes implicitly spatial queries such as costs and prices in a specific area, not just those containing a toponym.

Model details

Sentence Transformer body: BAAI/bge-small-en-v1.5
Classification head: LogisticRegression
Training data: 1,473 gold-labelled MS MARCO queries (755 non-spatial, 718 spatial), sampled via K-means centroids across the full embedding space for representativeness
Labels: 1 = spatial, 0 = non-spatial

Usage

from setfit import SetFitModel

model = SetFitModel.from_pretrained("ilyankou/spatial-classifier")
preds = model([
  "nearest hospital",
  "far from the truth",
  "close to my heart",
  "flood risk in this area"
])
# => [1, 0, 0, 1]

Training

Weak labels were generated by running Llama 3.1 five times per query at temperature 0.2, then manually verified. The SetFit model was trained for one epoch with batch size 64 and learning rate 1e-5, then retrained on the full gold dataset for production inference.

Downloads last month: 1

Safetensors

Model size

33.4M params

Tensor type

F32

Model tree for ilyankou/spatial-classifier

Base model

BAAI/bge-small-en-v1.5

Finetuned

(296)

this model