Fine-tuned on product search domain (brand, product name, origin)

d714a62 verified about 2 months ago

8.62 kB

	---
	language:
	- en
	- de
	- fr
	- it
	- es
	- nl
	- da
	- sv
	- "no"
	- pl
	license: apache-2.0
	tags:
	- token-classification
	- ner
	- product-search
	- query-understanding
	base_model: bltlab/queryner-bert-base-uncased
	datasets:
	- bltlab/queryner
	- thepian/eco-products-ner-fixtures
	pipeline_tag: token-classification
	---

	# queryner-eco-ner

	Named entity recognition for product search queries. Identifies brand, product category, product name, and origin spans in free-text queries.

	Fine-tuned from [bltlab/queryner-bert-base-uncased](https://huggingface.co/bltlab/queryner-bert-base-uncased), which was trained on Amazon ESCI queries. This model extends it with domain-specific vocabulary drawn from a European product database — brand names, multilingual product titles, and origin countries.

	## Labels

	The model predicts the full 17-type label set from the base queryner model. The four types most relevant to product search are:

	\| Label \| HF tag \| Example span \|
	\|---\|---\|---\|
	\| Brand \| `B-creator` / `I-creator` \| `Ecover`, `Dr. Bronner's` \|
	\| Product category \| `B-core_product_type` / `I-core_product_type` \| `washing up liquid`, `shampoo` \|
	\| Product name \| `B-product_name` / `I-product_name` \| `Skin Food`, `Men 48H Deodorant` \|
	\| Origin \| `B-origin` / `I-origin` \| `Germany`, `Italy` \|

	All other queryner types (`modifier`, `department`, `UoM`, `color`, `material`, etc.) are preserved from the base model.

	## Usage

	```python
	from transformers import pipeline

	ner = pipeline("token-classification", model="thepian/queryner-eco-ner", aggregation_strategy="simple")

	results = ner("Ecover washing up liquid without palm oil")
	# [{'entity_group': 'creator', 'word': 'Ecover', ...},
	# {'entity_group': 'core_product_type', 'word': 'washing up liquid', ...}]

	results = ner("organic olive oil from Italy under €15")
	# [{'entity_group': 'core_product_type', 'word': 'olive oil', ...},
	# {'entity_group': 'origin', 'word': 'Italy', ...}]
	```

	## Training data

	20,203 examples from three sources:

	\| Source \| Examples \| Notes \|
	\|---\|---\|---\|
	\| [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) \| 9,140 \| Amazon ESCI queries; all 17 label types \|
	\| Local domain fixtures \| ~1,063 \| Hand-annotated product search queries (incl. substitute-frame fixtures) \|
	\| Synthetic DB fixtures \| ~10,000 \| Template-generated from brand/category/product vocabulary; includes 1,000 substitute-frame (multilingual) \|

	Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal.

	## Label balance and product name vs category

	The two most commonly confused labels are `core_product_type` (product category) and `product_name`
	(specific named product). The model's only reliable cue for distinguishing them is positional:
	text following a known brand is a candidate for `product_name`, while standalone noun phrases are
	typically `core_product_type`. This positional signal is structural, not lexical — "Dove shampoo"
	and "Dove Skin Food" look identical to the model at the template level.

	### Why category dominates in training (~2:1 target)

	Real product search queries are category-heavy by a large margin. Most users type "shampoo",
	"olive oil", or "washing powder", not "Fuji Green Tea Refreshingly Hydrating Conditioner".
	Training data should approximate inference-time distribution; over-representing `product_name`
	creates a mismatch that degrades category precision on the majority of queries.

	The base model (bltlab/queryner-bert-base-uncased) was trained on Amazon ESCI queries, which
	are also category-heavy. The marginal value of additional `core_product_type` examples is lower
	than the marginal value of `product_name` examples, but collapsing to 1:1 risks the model
	labeling any noun phrase after a brand as `product_name` — including generic category words like
	"shampoo" or "washing up liquid".

	Current ratio: ~2.3:1 (core_product_type : product_name). Target: ~2:1.

	### Why going below 2:1 requires better data, not just more examples

	Increasing `product_name` examples without addressing lexical quality introduces contradictory
	signal:

	- A product named "Shampoo" and a category called "shampoo" become competing labels for the
	same string. The model cannot resolve this without knowing whether the token is generic or
	specific — information that is not present in the query.
	- The category cross-reference filter (dropping product names that are exact English category
	matches) addresses the worst cases, but morphological variants ("Shampoos", "Crème") and
	multi-language overlaps remain.

	To move significantly below 2:1 safely, the `product_name` training data would need to satisfy:

	\| Requirement \| Why \|
	\|---\|---\|
	\| Lexically distinct from category vocabulary \| Prevents the model learning a single label for identical strings \|
	\| High word-count names (3+ tokens) \| Single and two-token product names are indistinguishable from short category slugs by surface form alone \|
	\| Brand diversity \| The positional cue (brand precedes product name) only generalises if many different brands are paired with many different product names — a narrow brand set leads to brand-specific memorisation \|
	\| Multilingual coverage proportional to expected query mix \| Training on English product names only means the model will underperform on French/German/Italian queries even though multilingual product names exist in the DB \|
	\| Minimal repetition \| A product name seen 20 times with the same brand drowns signal from rarer names \|

	Until those conditions are met, `product_name_ratio` should stay at 0.25–0.30 and the 2:1
	overall ratio maintained by generating more total synthetic examples rather than increasing the
	ratio.

	---

	## Training procedure

	- Base model: `bltlab/queryner-bert-base-uncased`
	- Tokenizer: BERT WordPiece; subword tokens after the first in each word are masked (`-100`)
	- Max sequence length: 128
	- Label set: collected from training data (all 17 queryner types preserved)
	- Optimiser: AdamW, weight decay 0.01, warmup ratio 0.1
	- Segmented training: brand/product/origin first, then certification O-token signal at lower LR

	Typical segment configuration:

	```
	Segment 1: epochs=3, lr=3e-5 (base → domain)
	Segment 2: epochs=2, lr=1e-5 (add cert O-token signal)
	Segment 3: epochs=2, lr=5e-6 (product name ratio increase)
	Segment 4: epochs=2, lr=5e-6 (substitute-frame + multilingual, brand F1 0.698 → 0.897)
	```

	## Evaluation

	Evaluated on 63 held-out domain fixtures (39 general + 24 substitute-frame / multilingual) with exact and partial span matching.

	Segment 4 — 2 epochs, lr=5e-6, base=segment 3 checkpoint, 20,203 training examples (incl. substitute-frame):

	\| Label \| P (partial) \| R (partial) \| F1 (partial) \| F1 (exact) \|
	\|---\|---\|---\|---\|---\|
	\| brand \| 0.929 \| 0.867 \| 0.897 \| 0.897 \|
	\| product category \| 0.895 \| 0.962 \| 0.927 \| 0.891 \|
	\| product name \| 0.875 \| 0.700 \| 0.778 \| 0.556 \|
	\| origin \| 1.000 \| 0.917 \| 0.957 \| 0.957 \|
	\| overall \| 0.915 \| 0.900 \| 0.908 \| 0.874 \|

	Key remaining gaps:
	- `Dr. Bronner's` apostrophe: tokenizer splits `'` → span predicted as `"dr. bronner ' s"`. Needs pre-tokenization normalization.
	- Ecover brand FN (4 fixtures): underrepresented in training vocabulary; missed even in substitute-frame context.
	- German origin `Deutschland` not recognized — training uses English country names only.
	- Umlaut span mismatch: `Spülmittel` lowercased to `spulmittel` by BERT WordPiece.

	## Limitations

	- Extraction patterns are primarily English; avoidance frames in other languages (`ohne`, `sans`, `senza`) are not NER targets — they are handled by a separate parser
	- Multilingual product names are included in training but evaluation is English-only
	- Origin recognition covers ~13 European countries drawn from product records; global coverage is partial
	- Barcode and price extraction are not NER tasks — handled by a dedicated parser

	## Citation

	If you use this model, please cite the base model:

	```
	@misc{queryner,
	author = {Björklund, Love and Ljunglöf, Peter},
	title = {QueryNER: Named Entity Recognition for Product Search Queries},
	year = {2024},
	publisher = {HuggingFace},
	url = {https://huggingface.co/bltlab/queryner-bert-base-uncased}
	}
	```