ilyankou commited on
Commit
365a23c
·
verified ·
1 Parent(s): 0a6c7fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -174
README.md CHANGED
@@ -3,7 +3,8 @@ tags:
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
- - generated_from_setfit_trainer
 
7
  widget:
8
  - text: hotel in geneva airport
9
  - text: what payroll deduction is mpp
@@ -12,197 +13,41 @@ widget:
12
  - text: what's the weather in roseburg
13
  metrics:
14
  - accuracy
 
15
  pipeline_tag: text-classification
16
  library_name: setfit
17
  inference: true
18
  base_model: BAAI/bge-small-en-v1.5
19
  ---
20
 
21
- # SetFit with BAAI/bge-small-en-v1.5
22
 
23
- This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
24
 
25
- The model has been trained using an efficient few-shot learning technique that involves:
26
 
27
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
28
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
29
 
30
- ## Model Details
31
 
32
- ### Model Description
33
- - **Model Type:** SetFit
34
- - **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
35
- - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
36
- - **Maximum Sequence Length:** 512 tokens
37
- - **Number of Classes:** 2 classes
38
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
39
- <!-- - **Language:** Unknown -->
40
- <!-- - **License:** Unknown -->
41
-
42
- ### Model Sources
43
-
44
- - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
45
- - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
46
- - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
47
-
48
- ### Model Labels
49
- | Label | Examples |
50
- |:------|:-----------------------------------------------------------------------------------------------------------------------------------------------------|
51
- | 1 | <ul><li>'how far is palms casino from the airport in las vegas'</li><li>'anarkali bazar lahore'</li><li>'what county is alma nebraska in?'</li></ul> |
52
- | 0 | <ul><li>'what is symptom of bipolar disorder'</li><li>'early symptoms of shingles outbreak'</li><li>'bnsf total employees'</li></ul> |
53
 
54
- ## Uses
55
 
56
- ### Direct Use for Inference
57
-
58
- First install the SetFit library:
59
-
60
- ```bash
61
- pip install setfit
62
- ```
63
 
64
- Then you can load this model and run inference.
65
 
66
  ```python
67
  from setfit import SetFitModel
68
 
69
- # Download from the 🤗 Hub
70
- model = SetFitModel.from_pretrained("setfit_model_id")
71
- # Run inference
72
- preds = model("weather in erlanger ky")
73
  ```
74
 
75
- <!--
76
- ### Downstream Use
77
-
78
- *List how someone could finetune this model on their own dataset.*
79
- -->
80
-
81
- <!--
82
- ### Out-of-Scope Use
83
-
84
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
85
- -->
86
-
87
- <!--
88
- ## Bias, Risks and Limitations
89
-
90
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
91
- -->
92
-
93
- <!--
94
- ### Recommendations
95
-
96
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
97
- -->
98
-
99
- ## Training Details
100
-
101
- ### Training Set Metrics
102
- | Training set | Min | Median | Max |
103
- |:-------------|:----|:-------|:----|
104
- | Word count | 2 | 6.3028 | 21 |
105
-
106
- | Label | Training Sample Count |
107
- |:------|:----------------------|
108
- | 0 | 755 |
109
- | 1 | 718 |
110
-
111
- ### Training Hyperparameters
112
- - batch_size: (64, 64)
113
- - num_epochs: (1, 1)
114
- - max_steps: -1
115
- - sampling_strategy: oversampling
116
- - body_learning_rate: (1e-05, 1e-05)
117
- - head_learning_rate: 0.01
118
- - loss: CosineSimilarityLoss
119
- - distance_metric: cosine_distance
120
- - margin: 0.25
121
- - end_to_end: False
122
- - use_amp: False
123
- - warmup_proportion: 0.1
124
- - l2_weight: 0.01
125
- - seed: 42
126
- - eval_max_steps: -1
127
- - load_best_model_at_end: False
128
-
129
- ### Training Results
130
- | Epoch | Step | Training Loss | Validation Loss |
131
- |:------:|:-----:|:-------------:|:---------------:|
132
- | 0.0001 | 1 | 0.2507 | - |
133
- | 0.0294 | 500 | 0.1803 | - |
134
- | 0.0589 | 1000 | 0.0135 | - |
135
- | 0.0883 | 1500 | 0.0021 | - |
136
- | 0.1178 | 2000 | 0.001 | - |
137
- | 0.1472 | 2500 | 0.0007 | - |
138
- | 0.1766 | 3000 | 0.0005 | - |
139
- | 0.2061 | 3500 | 0.0004 | - |
140
- | 0.2355 | 4000 | 0.0004 | - |
141
- | 0.2649 | 4500 | 0.0003 | - |
142
- | 0.2944 | 5000 | 0.0003 | - |
143
- | 0.3238 | 5500 | 0.0003 | - |
144
- | 0.3533 | 6000 | 0.0003 | - |
145
- | 0.3827 | 6500 | 0.0002 | - |
146
- | 0.4121 | 7000 | 0.0003 | - |
147
- | 0.4416 | 7500 | 0.0002 | - |
148
- | 0.4710 | 8000 | 0.0002 | - |
149
- | 0.5004 | 8500 | 0.0002 | - |
150
- | 0.5299 | 9000 | 0.0002 | - |
151
- | 0.5593 | 9500 | 0.0002 | - |
152
- | 0.5888 | 10000 | 0.0002 | - |
153
- | 0.6182 | 10500 | 0.0002 | - |
154
- | 0.6476 | 11000 | 0.0001 | - |
155
- | 0.6771 | 11500 | 0.0001 | - |
156
- | 0.7065 | 12000 | 0.0001 | - |
157
- | 0.7359 | 12500 | 0.0001 | - |
158
- | 0.7654 | 13000 | 0.0001 | - |
159
- | 0.7948 | 13500 | 0.0001 | - |
160
- | 0.8243 | 14000 | 0.0001 | - |
161
- | 0.8537 | 14500 | 0.0001 | - |
162
- | 0.8831 | 15000 | 0.0001 | - |
163
- | 0.9126 | 15500 | 0.0001 | - |
164
- | 0.9420 | 16000 | 0.0001 | - |
165
- | 0.9714 | 16500 | 0.0001 | - |
166
-
167
- ### Framework Versions
168
- - Python: 3.11.5
169
- - SetFit: 1.1.2
170
- - Sentence Transformers: 4.0.2
171
- - Transformers: 4.55.2
172
- - PyTorch: 2.8.0
173
- - Datasets: 2.15.0
174
- - Tokenizers: 0.21.1
175
-
176
- ## Citation
177
-
178
- ### BibTeX
179
- ```bibtex
180
- @article{https://doi.org/10.48550/arxiv.2209.11055,
181
- doi = {10.48550/ARXIV.2209.11055},
182
- url = {https://arxiv.org/abs/2209.11055},
183
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
184
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
185
- title = {Efficient Few-Shot Learning Without Prompts},
186
- publisher = {arXiv},
187
- year = {2022},
188
- copyright = {Creative Commons Attribution 4.0 International}
189
- }
190
- ```
191
-
192
- <!--
193
- ## Glossary
194
-
195
- *Clearly define terms in order to be accessible across audiences.*
196
- -->
197
-
198
- <!--
199
- ## Model Card Authors
200
-
201
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
202
- -->
203
-
204
- <!--
205
- ## Model Card Contact
206
 
207
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
208
- -->
 
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
+ - geospatial
7
+ - spatial-queries
8
  widget:
9
  - text: hotel in geneva airport
10
  - text: what payroll deduction is mpp
 
13
  - text: what's the weather in roseburg
14
  metrics:
15
  - accuracy
16
+ - f1
17
  pipeline_tag: text-classification
18
  library_name: setfit
19
  inference: true
20
  base_model: BAAI/bge-small-en-v1.5
21
  ---
22
 
23
+ # Spatial Web Search Query Classifier
24
 
25
+ A binary [SetFit](https://github.com/huggingface/setfit) classifier that distinguishes spatial from non-spatial web search queries. Trained on a gold-annotated sample of [MS MARCO](https://microsoft.github.io/msmarco/) and used to identify 104,288 spatial queries (10.3%) across the full 1.01M-query corpus.
26
 
27
+ **Accuracy / F1: 0.986** on a held-out balanced test set (76 negative, 72 positive).
28
 
 
 
29
 
30
+ ## What counts as spatial?
31
 
32
+ A query is spatial if its answer is geographically variant and requires reasoning about geographic primitives (location, distance, or direction) or topological relationships (adjacency, containment, or connectivity). This includes implicitly spatial queries such as costs and prices in a specific area — not just those containing a toponym.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ ## Model details
35
 
36
+ - **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
37
+ - **Classification head:** LogisticRegression
38
+ - **Training data:** 1,473 gold-labelled MS MARCO queries (755 non-spatial, 718 spatial), sampled via K-means centroids across the full embedding space for representativeness
39
+ - **Labels:** `1` = spatial, `0` = non-spatial
 
 
 
40
 
41
+ ## Usage
42
 
43
  ```python
44
  from setfit import SetFitModel
45
 
46
+ model = SetFitModel.from_pretrained("TODO")
47
+ preds = model(["weather in erlanger ky", "what is symptom of bipolar disorder"])
48
+ # => [1, 0]
 
49
  ```
50
 
51
+ ## Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ Weak labels were generated by running Llama 3.1 five times per query at temperature 0.2, then manually verified. The SetFit model was trained for one epoch with batch size 64 and learning rate 1e-5, then retrained on the full gold dataset for production inference.