Spatial Web Search Query Classifier
A binary SetFit classifier that distinguishes spatial from non-spatial web search queries. Trained on a gold-annotated sample of MS MARCO and used to identify 104,288 spatial queries (10.3%) across the full 1.01M-query corpus.
Accuracy / F1: 0.986 on a held-out balanced test set (76 negative, 72 positive).
What counts as spatial?
A query is spatial if its answer is geographically variant and requires reasoning about geographic primitives (location, distance, or direction) or topological relationships (adjacency, containment, or connectivity). This includes implicitly spatial queries such as costs and prices in a specific area, not just those containing a toponym.
Model details
- Sentence Transformer body: BAAI/bge-small-en-v1.5
- Classification head: LogisticRegression
- Training data: 1,473 gold-labelled MS MARCO queries (755 non-spatial, 718 spatial), sampled via K-means centroids across the full embedding space for representativeness
- Labels:
1= spatial,0= non-spatial
Usage
from setfit import SetFitModel
model = SetFitModel.from_pretrained("ilyankou/spatial-classifier")
preds = model([
"nearest hospital",
"far from the truth",
"close to my heart",
"flood risk in this area"
])
# => [1, 0, 0, 1]
Training
Weak labels were generated by running Llama 3.1 five times per query at temperature 0.2, then manually verified. The SetFit model was trained for one epoch with batch size 64 and learning rate 1e-5, then retrained on the full gold dataset for production inference.
- Downloads last month
- 1
Model tree for ilyankou/spatial-classifier
Base model
BAAI/bge-small-en-v1.5