thepian commited on
Commit
5948af9
·
verified ·
1 Parent(s): 002068e

Fine-tuned on product search domain (brand, product name, origin)

Browse files
README.md CHANGED
@@ -14,7 +14,7 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  This model was trained from scratch on the None dataset.
16
  It achieves the following results on the evaluation set:
17
- - Loss: 0.1166
18
 
19
  ## Model description
20
 
@@ -46,8 +46,8 @@ The following hyperparameters were used during training:
46
 
47
  | Training Loss | Epoch | Step | Validation Loss |
48
  |:-------------:|:-----:|:----:|:---------------:|
49
- | 0.0615 | 1.0 | 798 | 0.1166 |
50
- | 0.0404 | 2.0 | 1596 | 0.1166 |
51
 
52
 
53
  ### Framework versions
 
14
 
15
  This model was trained from scratch on the None dataset.
16
  It achieves the following results on the evaluation set:
17
+ - Loss: 0.1188
18
 
19
  ## Model description
20
 
 
46
 
47
  | Training Loss | Epoch | Step | Validation Loss |
48
  |:-------------:|:-----:|:----:|:---------------:|
49
+ | 0.0812 | 1.0 | 1137 | 0.1426 |
50
+ | 0.0569 | 2.0 | 2274 | 0.1188 |
51
 
52
 
53
  ### Framework versions
best/README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - it
7
+ - es
8
+ - nl
9
+ - da
10
+ - sv
11
+ - no
12
+ - pl
13
+ license: apache-2.0
14
+ tags:
15
+ - token-classification
16
+ - ner
17
+ - product-search
18
+ - query-understanding
19
+ base_model: bltlab/queryner-bert-base-uncased
20
+ datasets:
21
+ - bltlab/queryner
22
+ - thepian/eco-products-ner-fixtures
23
+ pipeline_tag: token-classification
24
+ ---
25
+
26
+ # queryner-eco-ner
27
+
28
+ Named entity recognition for product search queries. Identifies **brand**, **product category**, **product name**, and **origin** spans in free-text queries.
29
+
30
+ Fine-tuned from [bltlab/queryner-bert-base-uncased](https://huggingface.co/bltlab/queryner-bert-base-uncased), which was trained on Amazon ESCI queries. This model extends it with domain-specific vocabulary drawn from a European product database — brand names, multilingual product titles, and origin countries.
31
+
32
+ ## Labels
33
+
34
+ The model predicts the full 17-type label set from the base queryner model. The four types most relevant to product search are:
35
+
36
+ | Label | HF tag | Example span |
37
+ |---|---|---|
38
+ | Brand | `B-creator` / `I-creator` | `Ecover`, `Dr. Bronner's` |
39
+ | Product category | `B-core_product_type` / `I-core_product_type` | `washing up liquid`, `shampoo` |
40
+ | Product name | `B-product_name` / `I-product_name` | `Skin Food`, `Men 48H Deodorant` |
41
+ | Origin | `B-origin` / `I-origin` | `Germany`, `Italy` |
42
+
43
+ All other queryner types (`modifier`, `department`, `UoM`, `color`, `material`, etc.) are preserved from the base model.
44
+
45
+ ## Usage
46
+
47
+ ```python
48
+ from transformers import pipeline
49
+
50
+ ner = pipeline("token-classification", model="thepian/queryner-eco-ner", aggregation_strategy="simple")
51
+
52
+ results = ner("Ecover washing up liquid without palm oil")
53
+ # [{'entity_group': 'creator', 'word': 'Ecover', ...},
54
+ # {'entity_group': 'core_product_type', 'word': 'washing up liquid', ...}]
55
+
56
+ results = ner("organic olive oil from Italy under €15")
57
+ # [{'entity_group': 'core_product_type', 'word': 'olive oil', ...},
58
+ # {'entity_group': 'origin', 'word': 'Italy', ...}]
59
+ ```
60
+
61
+ ## Training data
62
+
63
+ 19,179 examples from three sources:
64
+
65
+ | Source | Examples | Notes |
66
+ |---|---|---|
67
+ | [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) | 9,140 | Amazon ESCI queries; all 17 label types |
68
+ | Local domain fixtures | ~1,000 | Hand-annotated product search queries |
69
+ | Synthetic DB fixtures | ~9,000 | Template-generated from brand/category/product vocabulary |
70
+
71
+ Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal.
72
+
73
+ ## Label balance and product name vs category
74
+
75
+ The two most commonly confused labels are `core_product_type` (product category) and `product_name`
76
+ (specific named product). The model's only reliable cue for distinguishing them is positional:
77
+ text following a known brand is a candidate for `product_name`, while standalone noun phrases are
78
+ typically `core_product_type`. This positional signal is structural, not lexical — "Dove shampoo"
79
+ and "Dove Skin Food" look identical to the model at the template level.
80
+
81
+ ### Why category dominates in training (~2:1 target)
82
+
83
+ Real product search queries are category-heavy by a large margin. Most users type "shampoo",
84
+ "olive oil", or "washing powder", not "Fuji Green Tea Refreshingly Hydrating Conditioner".
85
+ Training data should approximate inference-time distribution; over-representing `product_name`
86
+ creates a mismatch that degrades category precision on the majority of queries.
87
+
88
+ The base model (bltlab/queryner-bert-base-uncased) was trained on Amazon ESCI queries, which
89
+ are also category-heavy. The marginal value of additional `core_product_type` examples is lower
90
+ than the marginal value of `product_name` examples, but collapsing to 1:1 risks the model
91
+ labeling any noun phrase after a brand as `product_name` — including generic category words like
92
+ "shampoo" or "washing up liquid".
93
+
94
+ **Current ratio: ~2.3:1 (core_product_type : product_name). Target: ~2:1.**
95
+
96
+ ### Why going below 2:1 requires better data, not just more examples
97
+
98
+ Increasing `product_name` examples without addressing lexical quality introduces contradictory
99
+ signal:
100
+
101
+ - A product named "Shampoo" and a category called "shampoo" become competing labels for the
102
+ same string. The model cannot resolve this without knowing whether the token is generic or
103
+ specific — information that is not present in the query.
104
+ - The category cross-reference filter (dropping product names that are exact English category
105
+ matches) addresses the worst cases, but morphological variants ("Shampoos", "Crème") and
106
+ multi-language overlaps remain.
107
+
108
+ To move significantly below 2:1 safely, the `product_name` training data would need to satisfy:
109
+
110
+ | Requirement | Why |
111
+ |---|---|
112
+ | Lexically distinct from category vocabulary | Prevents the model learning a single label for identical strings |
113
+ | High word-count names (3+ tokens) | Single and two-token product names are indistinguishable from short category slugs by surface form alone |
114
+ | Brand diversity | The positional cue (brand precedes product name) only generalises if many different brands are paired with many different product names — a narrow brand set leads to brand-specific memorisation |
115
+ | Multilingual coverage proportional to expected query mix | Training on English product names only means the model will underperform on French/German/Italian queries even though multilingual product names exist in the DB |
116
+ | Minimal repetition | A product name seen 20 times with the same brand drowns signal from rarer names |
117
+
118
+ Until those conditions are met, `product_name_ratio` should stay at 0.25–0.30 and the 2:1
119
+ overall ratio maintained by generating more total synthetic examples rather than increasing the
120
+ ratio.
121
+
122
+ ---
123
+
124
+ ## Training procedure
125
+
126
+ - Base model: `bltlab/queryner-bert-base-uncased`
127
+ - Tokenizer: BERT WordPiece; subword tokens after the first in each word are masked (`-100`)
128
+ - Max sequence length: 128
129
+ - Label set: collected from training data (all 17 queryner types preserved)
130
+ - Optimiser: AdamW, weight decay 0.01, warmup ratio 0.1
131
+ - Segmented training: brand/product/origin first, then certification O-token signal at lower LR
132
+
133
+ Typical segment configuration:
134
+
135
+ ```
136
+ Segment 1: epochs=3, lr=3e-5 (base → domain)
137
+ Segment 2: epochs=2, lr=1e-5 (add cert O-token signal)
138
+ Segment 3: epochs=2, lr=5e-6 (product name ratio increase)
139
+ ```
140
+
141
+ ## Evaluation
142
+
143
+ Evaluated on held-out domain fixtures with exact and partial span matching:
144
+
145
+ | Label | Precision | Recall | F1 |
146
+ |---|---|---|---|
147
+ | brand | — | — | — |
148
+ | product category | — | — | — |
149
+ | product name | — | — | — |
150
+ | origin | — | — | — |
151
+ | **overall** | — | — | — |
152
+
153
+ *(Results updated after each training segment.)*
154
+
155
+ ## Limitations
156
+
157
+ - Extraction patterns are primarily English; avoidance frames in other languages (`ohne`, `sans`, `senza`) are not NER targets — they are handled by a separate parser
158
+ - Multilingual product names are included in training but evaluation is English-only
159
+ - Origin recognition covers ~13 European countries drawn from product records; global coverage is partial
160
+ - Barcode and price extraction are not NER tasks — handled by a dedicated parser
161
+
162
+ ## Citation
163
+
164
+ If you use this model, please cite the base model:
165
+
166
+ ```
167
+ @misc{queryner,
168
+ author = {Björklund, Love and Ljunglöf, Peter},
169
+ title = {QueryNER: Named Entity Recognition for Product Search Queries},
170
+ year = {2024},
171
+ publisher = {HuggingFace},
172
+ url = {https://huggingface.co/bltlab/queryner-bert-base-uncased}
173
+ }
174
+ ```
best/model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:06220fead842b7ee23b9ef53e53a7bed60f66cc201f11aa40a378afda005d048
3
  size 435697596
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ae80a09a730d7ed9c622d3941d055dc6aa3e78b4a8946027b1158df63646758
3
  size 435697596
best/training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fe8266bcc4a9718cc85fc6ffcffd2413ad4ccaa3a5b8460931e69b3a8ccc8471
3
  size 5969
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c538014e65617630cb084588ec3ddf553c7fa06585fc03a0affc214c7993da69
3
  size 5969
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:06220fead842b7ee23b9ef53e53a7bed60f66cc201f11aa40a378afda005d048
3
  size 435697596
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ae80a09a730d7ed9c622d3941d055dc6aa3e78b4a8946027b1158df63646758
3
  size 435697596
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fe8266bcc4a9718cc85fc6ffcffd2413ad4ccaa3a5b8460931e69b3a8ccc8471
3
  size 5969
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c538014e65617630cb084588ec3ddf553c7fa06585fc03a0affc214c7993da69
3
  size 5969