Fine-tuned on product search domain (brand, product name, origin)

Browse files

Files changed (6) hide show

README.md +3 -3
best/README.md +174 -0
best/model.safetensors +1 -1
best/training_args.bin +1 -1
model.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ should probably proofread and complete it, then remove this comment. -->
 This model was trained from scratch on the None dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.1166
 ## Model description
@@ -46,8 +46,8 @@ The following hyperparameters were used during training:
 | Training Loss | Epoch | Step | Validation Loss |
 |:-------------:|:-----:|:----:|:---------------:|
-| 0.0615        | 1.0   | 798  | 0.1166          |
-| 0.0404        | 2.0   | 1596 | 0.1166          |
 ### Framework versions

 This model was trained from scratch on the None dataset.
 It achieves the following results on the evaluation set:
+- Loss: 0.1188
 ## Model description
 | Training Loss | Epoch | Step | Validation Loss |
 |:-------------:|:-----:|:----:|:---------------:|
+| 0.0812        | 1.0   | 1137 | 0.1426          |
+| 0.0569        | 2.0   | 2274 | 0.1188          |
 ### Framework versions

best/README.md ADDED Viewed

	@@ -0,0 +1,174 @@

+---
+language:
+  - en
+  - de
+  - fr
+  - it
+  - es
+  - nl
+  - da
+  - sv
+  - no
+  - pl
+license: apache-2.0
+tags:
+  - token-classification
+  - ner
+  - product-search
+  - query-understanding
+base_model: bltlab/queryner-bert-base-uncased
+datasets:
+  - bltlab/queryner
+  - thepian/eco-products-ner-fixtures
+pipeline_tag: token-classification
+---
+# queryner-eco-ner
+Named entity recognition for product search queries. Identifies **brand**, **product category**, **product name**, and **origin** spans in free-text queries.
+Fine-tuned from [bltlab/queryner-bert-base-uncased](https://huggingface.co/bltlab/queryner-bert-base-uncased), which was trained on Amazon ESCI queries. This model extends it with domain-specific vocabulary drawn from a European product database — brand names, multilingual product titles, and origin countries.
+## Labels
+The model predicts the full 17-type label set from the base queryner model. The four types most relevant to product search are:
+| Label | HF tag | Example span |
+|---|---|---|
+| Brand | `B-creator` / `I-creator` | `Ecover`, `Dr. Bronner's` |
+| Product category | `B-core_product_type` / `I-core_product_type` | `washing up liquid`, `shampoo` |
+| Product name | `B-product_name` / `I-product_name` | `Skin Food`, `Men 48H Deodorant` |
+| Origin | `B-origin` / `I-origin` | `Germany`, `Italy` |
+All other queryner types (`modifier`, `department`, `UoM`, `color`, `material`, etc.) are preserved from the base model.
+## Usage
+```python
+from transformers import pipeline
+ner = pipeline("token-classification", model="thepian/queryner-eco-ner", aggregation_strategy="simple")
+results = ner("Ecover washing up liquid without palm oil")
+# [{'entity_group': 'creator', 'word': 'Ecover', ...},
+#  {'entity_group': 'core_product_type', 'word': 'washing up liquid', ...}]
+results = ner("organic olive oil from Italy under €15")
+# [{'entity_group': 'core_product_type', 'word': 'olive oil', ...},
+#  {'entity_group': 'origin', 'word': 'Italy', ...}]
+```
+## Training data
+19,179 examples from three sources:
+| Source | Examples | Notes |
+|---|---|---|
+| [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) | 9,140 | Amazon ESCI queries; all 17 label types |
+| Local domain fixtures | ~1,000 | Hand-annotated product search queries |
+| Synthetic DB fixtures | ~9,000 | Template-generated from brand/category/product vocabulary |
+Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal.
+## Label balance and product name vs category
+The two most commonly confused labels are `core_product_type` (product category) and `product_name`
+(specific named product). The model's only reliable cue for distinguishing them is positional:
+text following a known brand is a candidate for `product_name`, while standalone noun phrases are
+typically `core_product_type`. This positional signal is structural, not lexical — "Dove shampoo"
+and "Dove Skin Food" look identical to the model at the template level.
+### Why category dominates in training (~2:1 target)
+Real product search queries are category-heavy by a large margin. Most users type "shampoo",
+"olive oil", or "washing powder", not "Fuji Green Tea Refreshingly Hydrating Conditioner".
+Training data should approximate inference-time distribution; over-representing `product_name`
+creates a mismatch that degrades category precision on the majority of queries.
+The base model (bltlab/queryner-bert-base-uncased) was trained on Amazon ESCI queries, which
+are also category-heavy. The marginal value of additional `core_product_type` examples is lower
+than the marginal value of `product_name` examples, but collapsing to 1:1 risks the model
+labeling any noun phrase after a brand as `product_name` — including generic category words like
+"shampoo" or "washing up liquid".
+**Current ratio: ~2.3:1 (core_product_type : product_name). Target: ~2:1.**
+### Why going below 2:1 requires better data, not just more examples
+Increasing `product_name` examples without addressing lexical quality introduces contradictory
+signal:
+- A product named "Shampoo" and a category called "shampoo" become competing labels for the
+  same string. The model cannot resolve this without knowing whether the token is generic or
+  specific — information that is not present in the query.
+- The category cross-reference filter (dropping product names that are exact English category
+  matches) addresses the worst cases, but morphological variants ("Shampoos", "Crème") and
+  multi-language overlaps remain.
+To move significantly below 2:1 safely, the `product_name` training data would need to satisfy:
+| Requirement | Why |
+|---|---|
+| Lexically distinct from category vocabulary | Prevents the model learning a single label for identical strings |
+| High word-count names (3+ tokens) | Single and two-token product names are indistinguishable from short category slugs by surface form alone |
+| Brand diversity | The positional cue (brand precedes product name) only generalises if many different brands are paired with many different product names — a narrow brand set leads to brand-specific memorisation |
+| Multilingual coverage proportional to expected query mix | Training on English product names only means the model will underperform on French/German/Italian queries even though multilingual product names exist in the DB |
+| Minimal repetition | A product name seen 20 times with the same brand drowns signal from rarer names |
+Until those conditions are met, `product_name_ratio` should stay at 0.25–0.30 and the 2:1
+overall ratio maintained by generating more total synthetic examples rather than increasing the
+ratio.
+---
+## Training procedure
+- Base model: `bltlab/queryner-bert-base-uncased`
+- Tokenizer: BERT WordPiece; subword tokens after the first in each word are masked (`-100`)
+- Max sequence length: 128
+- Label set: collected from training data (all 17 queryner types preserved)
+- Optimiser: AdamW, weight decay 0.01, warmup ratio 0.1
+- Segmented training: brand/product/origin first, then certification O-token signal at lower LR
+Typical segment configuration:
+```
+Segment 1: epochs=3, lr=3e-5   (base → domain)
+Segment 2: epochs=2, lr=1e-5   (add cert O-token signal)
+Segment 3: epochs=2, lr=5e-6   (product name ratio increase)
+```
+## Evaluation
+Evaluated on held-out domain fixtures with exact and partial span matching:
+| Label | Precision | Recall | F1 |
+|---|---|---|---|
+| brand | — | — | — |
+| product category | — | — | — |
+| product name | — | — | — |
+| origin | — | — | — |
+| **overall** | — | — | — |
+*(Results updated after each training segment.)*
+## Limitations
+- Extraction patterns are primarily English; avoidance frames in other languages (`ohne`, `sans`, `senza`) are not NER targets — they are handled by a separate parser
+- Multilingual product names are included in training but evaluation is English-only
+- Origin recognition covers ~13 European countries drawn from product records; global coverage is partial
+- Barcode and price extraction are not NER tasks — handled by a dedicated parser
+## Citation
+If you use this model, please cite the base model:
+```
+@misc{queryner,
+  author = {Björklund, Love and Ljunglöf, Peter},
+  title = {QueryNER: Named Entity Recognition for Product Search Queries},
+  year = {2024},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/bltlab/queryner-bert-base-uncased}
+}
+```

best/model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:06220fead842b7ee23b9ef53e53a7bed60f66cc201f11aa40a378afda005d048
 size 435697596

 version https://git-lfs.github.com/spec/v1
+oid sha256:2ae80a09a730d7ed9c622d3941d055dc6aa3e78b4a8946027b1158df63646758
 size 435697596

best/training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fe8266bcc4a9718cc85fc6ffcffd2413ad4ccaa3a5b8460931e69b3a8ccc8471
 size 5969

 version https://git-lfs.github.com/spec/v1
+oid sha256:c538014e65617630cb084588ec3ddf553c7fa06585fc03a0affc214c7993da69
 size 5969

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:06220fead842b7ee23b9ef53e53a7bed60f66cc201f11aa40a378afda005d048
 size 435697596

 version https://git-lfs.github.com/spec/v1
+oid sha256:2ae80a09a730d7ed9c622d3941d055dc6aa3e78b4a8946027b1158df63646758
 size 435697596

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fe8266bcc4a9718cc85fc6ffcffd2413ad4ccaa3a5b8460931e69b3a8ccc8471
 size 5969

 version https://git-lfs.github.com/spec/v1
+oid sha256:c538014e65617630cb084588ec3ddf553c7fa06585fc03a0affc214c7993da69
 size 5969