Instructions to use thepian/checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thepian/checkpoints with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="thepian/checkpoints")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("thepian/checkpoints") model = AutoModelForTokenClassification.from_pretrained("thepian/checkpoints") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - de | |
| - fr | |
| - it | |
| - es | |
| - nl | |
| - da | |
| - sv | |
| - "no" | |
| - pl | |
| license: apache-2.0 | |
| tags: | |
| - token-classification | |
| - ner | |
| - product-search | |
| - query-understanding | |
| base_model: bltlab/queryner-bert-base-uncased | |
| datasets: | |
| - bltlab/queryner | |
| - thepian/eco-products-ner-fixtures | |
| pipeline_tag: token-classification | |
| # queryner-eco-ner | |
| Named entity recognition for product search queries. Identifies **brand**, **product category**, **product name**, and **origin** spans in free-text queries. | |
| Fine-tuned from [bltlab/queryner-bert-base-uncased](https://huggingface.co/bltlab/queryner-bert-base-uncased), which was trained on Amazon ESCI queries. This model extends it with domain-specific vocabulary drawn from a European product database — brand names, multilingual product titles, and origin countries. | |
| ## Labels | |
| The model predicts the full 17-type label set from the base queryner model. The four types most relevant to product search are: | |
| | Label | HF tag | Example span | | |
| |---|---|---| | |
| | Brand | `B-creator` / `I-creator` | `Ecover`, `Dr. Bronner's` | | |
| | Product category | `B-core_product_type` / `I-core_product_type` | `washing up liquid`, `shampoo` | | |
| | Product name | `B-product_name` / `I-product_name` | `Skin Food`, `Men 48H Deodorant` | | |
| | Origin | `B-origin` / `I-origin` | `Germany`, `Italy` | | |
| All other queryner types (`modifier`, `department`, `UoM`, `color`, `material`, etc.) are preserved from the base model. | |
| ## Usage | |
| ```python | |
| from transformers import pipeline | |
| ner = pipeline("token-classification", model="thepian/queryner-eco-ner", aggregation_strategy="simple") | |
| results = ner("Ecover washing up liquid without palm oil") | |
| # [{'entity_group': 'creator', 'word': 'Ecover', ...}, | |
| # {'entity_group': 'core_product_type', 'word': 'washing up liquid', ...}] | |
| results = ner("organic olive oil from Italy under €15") | |
| # [{'entity_group': 'core_product_type', 'word': 'olive oil', ...}, | |
| # {'entity_group': 'origin', 'word': 'Italy', ...}] | |
| ``` | |
| ## Training data | |
| 20,203 examples from three sources: | |
| | Source | Examples | Notes | | |
| |---|---|---| | |
| | [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) | 9,140 | Amazon ESCI queries; all 17 label types | | |
| | Local domain fixtures | ~1,063 | Hand-annotated product search queries (incl. substitute-frame fixtures) | | |
| | Synthetic DB fixtures | ~10,000 | Template-generated from brand/category/product vocabulary; includes 1,000 substitute-frame (multilingual) | | |
| Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal. | |
| ## Label balance and product name vs category | |
| The two most commonly confused labels are `core_product_type` (product category) and `product_name` | |
| (specific named product). The model's only reliable cue for distinguishing them is positional: | |
| text following a known brand is a candidate for `product_name`, while standalone noun phrases are | |
| typically `core_product_type`. This positional signal is structural, not lexical — "Dove shampoo" | |
| and "Dove Skin Food" look identical to the model at the template level. | |
| ### Why category dominates in training (~2:1 target) | |
| Real product search queries are category-heavy by a large margin. Most users type "shampoo", | |
| "olive oil", or "washing powder", not "Fuji Green Tea Refreshingly Hydrating Conditioner". | |
| Training data should approximate inference-time distribution; over-representing `product_name` | |
| creates a mismatch that degrades category precision on the majority of queries. | |
| The base model (bltlab/queryner-bert-base-uncased) was trained on Amazon ESCI queries, which | |
| are also category-heavy. The marginal value of additional `core_product_type` examples is lower | |
| than the marginal value of `product_name` examples, but collapsing to 1:1 risks the model | |
| labeling any noun phrase after a brand as `product_name` — including generic category words like | |
| "shampoo" or "washing up liquid". | |
| **Current ratio: ~2.3:1 (core_product_type : product_name). Target: ~2:1.** | |
| ### Why going below 2:1 requires better data, not just more examples | |
| Increasing `product_name` examples without addressing lexical quality introduces contradictory | |
| signal: | |
| - A product named "Shampoo" and a category called "shampoo" become competing labels for the | |
| same string. The model cannot resolve this without knowing whether the token is generic or | |
| specific — information that is not present in the query. | |
| - The category cross-reference filter (dropping product names that are exact English category | |
| matches) addresses the worst cases, but morphological variants ("Shampoos", "Crème") and | |
| multi-language overlaps remain. | |
| To move significantly below 2:1 safely, the `product_name` training data would need to satisfy: | |
| | Requirement | Why | | |
| |---|---| | |
| | Lexically distinct from category vocabulary | Prevents the model learning a single label for identical strings | | |
| | High word-count names (3+ tokens) | Single and two-token product names are indistinguishable from short category slugs by surface form alone | | |
| | Brand diversity | The positional cue (brand precedes product name) only generalises if many different brands are paired with many different product names — a narrow brand set leads to brand-specific memorisation | | |
| | Multilingual coverage proportional to expected query mix | Training on English product names only means the model will underperform on French/German/Italian queries even though multilingual product names exist in the DB | | |
| | Minimal repetition | A product name seen 20 times with the same brand drowns signal from rarer names | | |
| Until those conditions are met, `product_name_ratio` should stay at 0.25–0.30 and the 2:1 | |
| overall ratio maintained by generating more total synthetic examples rather than increasing the | |
| ratio. | |
| --- | |
| ## Training procedure | |
| - Base model: `bltlab/queryner-bert-base-uncased` | |
| - Tokenizer: BERT WordPiece; subword tokens after the first in each word are masked (`-100`) | |
| - Max sequence length: 128 | |
| - Label set: collected from training data (all 17 queryner types preserved) | |
| - Optimiser: AdamW, weight decay 0.01, warmup ratio 0.1 | |
| - Segmented training: brand/product/origin first, then certification O-token signal at lower LR | |
| Typical segment configuration: | |
| ``` | |
| Segment 1: epochs=3, lr=3e-5 (base → domain) | |
| Segment 2: epochs=2, lr=1e-5 (add cert O-token signal) | |
| Segment 3: epochs=2, lr=5e-6 (product name ratio increase) | |
| Segment 4: epochs=2, lr=5e-6 (substitute-frame + multilingual, brand F1 0.698 → 0.897) | |
| ``` | |
| ## Evaluation | |
| Evaluated on 63 held-out domain fixtures (39 general + 24 substitute-frame / multilingual) with exact and partial span matching. | |
| **Segment 4** — 2 epochs, lr=5e-6, base=segment 3 checkpoint, 20,203 training examples (incl. substitute-frame): | |
| | Label | P (partial) | R (partial) | F1 (partial) | F1 (exact) | | |
| |---|---|---|---|---| | |
| | brand | 0.929 | 0.867 | **0.897** | **0.897** | | |
| | product category | 0.895 | 0.962 | **0.927** | 0.891 | | |
| | product name | 0.875 | 0.700 | 0.778 | 0.556 | | |
| | origin | 1.000 | 0.917 | **0.957** | **0.957** | | |
| | **overall** | **0.915** | **0.900** | **0.908** | 0.874 | | |
| Key remaining gaps: | |
| - `Dr. Bronner's` apostrophe: tokenizer splits `'` → span predicted as `"dr. bronner ' s"`. Needs pre-tokenization normalization. | |
| - Ecover brand FN (4 fixtures): underrepresented in training vocabulary; missed even in substitute-frame context. | |
| - German origin `Deutschland` not recognized — training uses English country names only. | |
| - Umlaut span mismatch: `Spülmittel` lowercased to `spulmittel` by BERT WordPiece. | |
| ## Limitations | |
| - Extraction patterns are primarily English; avoidance frames in other languages (`ohne`, `sans`, `senza`) are not NER targets — they are handled by a separate parser | |
| - Multilingual product names are included in training but evaluation is English-only | |
| - Origin recognition covers ~13 European countries drawn from product records; global coverage is partial | |
| - Barcode and price extraction are not NER tasks — handled by a dedicated parser | |
| ## Citation | |
| If you use this model, please cite the base model: | |
| ``` | |
| @misc{queryner, | |
| author = {Björklund, Love and Ljunglöf, Peter}, | |
| title = {QueryNER: Named Entity Recognition for Product Search Queries}, | |
| year = {2024}, | |
| publisher = {HuggingFace}, | |
| url = {https://huggingface.co/bltlab/queryner-bert-base-uncased} | |
| } | |
| ``` | |