checkpoints / best /README.md
thepian's picture
Fine-tuned on product search domain (brand, product name, origin)
d714a62 verified
---
language:
- en
- de
- fr
- it
- es
- nl
- da
- sv
- "no"
- pl
license: apache-2.0
tags:
- token-classification
- ner
- product-search
- query-understanding
base_model: bltlab/queryner-bert-base-uncased
datasets:
- bltlab/queryner
- thepian/eco-products-ner-fixtures
pipeline_tag: token-classification
---
# queryner-eco-ner
Named entity recognition for product search queries. Identifies **brand**, **product category**, **product name**, and **origin** spans in free-text queries.
Fine-tuned from [bltlab/queryner-bert-base-uncased](https://huggingface.co/bltlab/queryner-bert-base-uncased), which was trained on Amazon ESCI queries. This model extends it with domain-specific vocabulary drawn from a European product database — brand names, multilingual product titles, and origin countries.
## Labels
The model predicts the full 17-type label set from the base queryner model. The four types most relevant to product search are:
| Label | HF tag | Example span |
|---|---|---|
| Brand | `B-creator` / `I-creator` | `Ecover`, `Dr. Bronner's` |
| Product category | `B-core_product_type` / `I-core_product_type` | `washing up liquid`, `shampoo` |
| Product name | `B-product_name` / `I-product_name` | `Skin Food`, `Men 48H Deodorant` |
| Origin | `B-origin` / `I-origin` | `Germany`, `Italy` |
All other queryner types (`modifier`, `department`, `UoM`, `color`, `material`, etc.) are preserved from the base model.
## Usage
```python
from transformers import pipeline
ner = pipeline("token-classification", model="thepian/queryner-eco-ner", aggregation_strategy="simple")
results = ner("Ecover washing up liquid without palm oil")
# [{'entity_group': 'creator', 'word': 'Ecover', ...},
# {'entity_group': 'core_product_type', 'word': 'washing up liquid', ...}]
results = ner("organic olive oil from Italy under €15")
# [{'entity_group': 'core_product_type', 'word': 'olive oil', ...},
# {'entity_group': 'origin', 'word': 'Italy', ...}]
```
## Training data
20,203 examples from three sources:
| Source | Examples | Notes |
|---|---|---|
| [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) | 9,140 | Amazon ESCI queries; all 17 label types |
| Local domain fixtures | ~1,063 | Hand-annotated product search queries (incl. substitute-frame fixtures) |
| Synthetic DB fixtures | ~10,000 | Template-generated from brand/category/product vocabulary; includes 1,000 substitute-frame (multilingual) |
Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal.
## Label balance and product name vs category
The two most commonly confused labels are `core_product_type` (product category) and `product_name`
(specific named product). The model's only reliable cue for distinguishing them is positional:
text following a known brand is a candidate for `product_name`, while standalone noun phrases are
typically `core_product_type`. This positional signal is structural, not lexical — "Dove shampoo"
and "Dove Skin Food" look identical to the model at the template level.
### Why category dominates in training (~2:1 target)
Real product search queries are category-heavy by a large margin. Most users type "shampoo",
"olive oil", or "washing powder", not "Fuji Green Tea Refreshingly Hydrating Conditioner".
Training data should approximate inference-time distribution; over-representing `product_name`
creates a mismatch that degrades category precision on the majority of queries.
The base model (bltlab/queryner-bert-base-uncased) was trained on Amazon ESCI queries, which
are also category-heavy. The marginal value of additional `core_product_type` examples is lower
than the marginal value of `product_name` examples, but collapsing to 1:1 risks the model
labeling any noun phrase after a brand as `product_name` — including generic category words like
"shampoo" or "washing up liquid".
**Current ratio: ~2.3:1 (core_product_type : product_name). Target: ~2:1.**
### Why going below 2:1 requires better data, not just more examples
Increasing `product_name` examples without addressing lexical quality introduces contradictory
signal:
- A product named "Shampoo" and a category called "shampoo" become competing labels for the
same string. The model cannot resolve this without knowing whether the token is generic or
specific — information that is not present in the query.
- The category cross-reference filter (dropping product names that are exact English category
matches) addresses the worst cases, but morphological variants ("Shampoos", "Crème") and
multi-language overlaps remain.
To move significantly below 2:1 safely, the `product_name` training data would need to satisfy:
| Requirement | Why |
|---|---|
| Lexically distinct from category vocabulary | Prevents the model learning a single label for identical strings |
| High word-count names (3+ tokens) | Single and two-token product names are indistinguishable from short category slugs by surface form alone |
| Brand diversity | The positional cue (brand precedes product name) only generalises if many different brands are paired with many different product names — a narrow brand set leads to brand-specific memorisation |
| Multilingual coverage proportional to expected query mix | Training on English product names only means the model will underperform on French/German/Italian queries even though multilingual product names exist in the DB |
| Minimal repetition | A product name seen 20 times with the same brand drowns signal from rarer names |
Until those conditions are met, `product_name_ratio` should stay at 0.25–0.30 and the 2:1
overall ratio maintained by generating more total synthetic examples rather than increasing the
ratio.
---
## Training procedure
- Base model: `bltlab/queryner-bert-base-uncased`
- Tokenizer: BERT WordPiece; subword tokens after the first in each word are masked (`-100`)
- Max sequence length: 128
- Label set: collected from training data (all 17 queryner types preserved)
- Optimiser: AdamW, weight decay 0.01, warmup ratio 0.1
- Segmented training: brand/product/origin first, then certification O-token signal at lower LR
Typical segment configuration:
```
Segment 1: epochs=3, lr=3e-5 (base → domain)
Segment 2: epochs=2, lr=1e-5 (add cert O-token signal)
Segment 3: epochs=2, lr=5e-6 (product name ratio increase)
Segment 4: epochs=2, lr=5e-6 (substitute-frame + multilingual, brand F1 0.698 → 0.897)
```
## Evaluation
Evaluated on 63 held-out domain fixtures (39 general + 24 substitute-frame / multilingual) with exact and partial span matching.
**Segment 4** — 2 epochs, lr=5e-6, base=segment 3 checkpoint, 20,203 training examples (incl. substitute-frame):
| Label | P (partial) | R (partial) | F1 (partial) | F1 (exact) |
|---|---|---|---|---|
| brand | 0.929 | 0.867 | **0.897** | **0.897** |
| product category | 0.895 | 0.962 | **0.927** | 0.891 |
| product name | 0.875 | 0.700 | 0.778 | 0.556 |
| origin | 1.000 | 0.917 | **0.957** | **0.957** |
| **overall** | **0.915** | **0.900** | **0.908** | 0.874 |
Key remaining gaps:
- `Dr. Bronner's` apostrophe: tokenizer splits `'` → span predicted as `"dr. bronner ' s"`. Needs pre-tokenization normalization.
- Ecover brand FN (4 fixtures): underrepresented in training vocabulary; missed even in substitute-frame context.
- German origin `Deutschland` not recognized — training uses English country names only.
- Umlaut span mismatch: `Spülmittel` lowercased to `spulmittel` by BERT WordPiece.
## Limitations
- Extraction patterns are primarily English; avoidance frames in other languages (`ohne`, `sans`, `senza`) are not NER targets — they are handled by a separate parser
- Multilingual product names are included in training but evaluation is English-only
- Origin recognition covers ~13 European countries drawn from product records; global coverage is partial
- Barcode and price extraction are not NER tasks — handled by a dedicated parser
## Citation
If you use this model, please cite the base model:
```
@misc{queryner,
author = {Björklund, Love and Ljunglöf, Peter},
title = {QueryNER: Named Entity Recognition for Product Search Queries},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/bltlab/queryner-bert-base-uncased}
}
```