| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | pipeline_tag: token-classification |
| | tags: |
| | - named-entity-recognition |
| | - ner |
| | - span-ner |
| | - globalpointer |
| | - pytorch |
| | library_name: transformers |
| | model_name: EcomBert_NER_V1 |
| | --- |
| | |
| | # EcomBert_NER_V1 |
| |
|
| | ## Model description |
| |
|
| | `EcomBert_NER_V1` is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head. |
| |
|
| | This repository exports and loads the model using a lightweight HuggingFace-style folder layout: |
| |
|
| | - `config.json` |
| | - `pytorch_model.bin` |
| | - tokenizer files saved by `transformers.AutoTokenizer.save_pretrained(...)` |
| |
|
| | **Parameter size**: ~0.4B parameters (as configured/reported for this model card). |
| |
|
| | ## Intended uses & limitations |
| |
|
| | ### Intended uses |
| |
|
| | - Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets). |
| | - Offline batch inference and evaluation. |
| |
|
| | ### Limitations |
| |
|
| | - This is a span-scoring model: it predicts `(label, start, end)` spans. Overlapping spans are possible. |
| | - Output quality depends heavily on: |
| | - the training dataset schema and label definitions |
| | - the decision threshold (`threshold`) |
| | - tokenization behavior (subword boundaries) |
| | - Long inputs will be truncated to `max_length`. |
| |
|
| | ## How to use |
| |
|
| | ### 1) Train and export |
| |
|
| | During training, the best checkpoint is exported to a HuggingFace-style directory (by default `checkpoints/hf_export`). |
| |
|
| | Example: |
| |
|
| | ```bash |
| | python train.py \ |
| | --splits_dir ./data2/splits \ |
| | --output_dir checkpoints \ |
| | --model_name bert-base-chinese \ |
| | --hf_export_dir hf_export |
| | ``` |
| |
|
| | This produces: |
| |
|
| | - `checkpoints/hf_export/config.json` |
| | - `checkpoints/hf_export/pytorch_model.bin` |
| | - `checkpoints/hf_export/tokenizer.*` |
| |
|
| | ### 2) Inference (CLI) |
| |
|
| | ```bash |
| | python infer.py \ |
| | --model_dir checkpoints/hf_export \ |
| | --text "Apple released a new iPhone in California." |
| | ``` |
| |
|
| | You can optionally override the threshold: |
| |
|
| | ```bash |
| | python infer.py \ |
| | --model_dir checkpoints/hf_export \ |
| | --text "Apple released a new iPhone in California." \ |
| | --threshold 0.55 |
| | ``` |
| |
|
| | ### 3) Inference (Python) |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer |
| | from model import EcomBertNER |
| | |
| | model_dir = "checkpoints/hf_export" |
| | |
| | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| | model, cfg = EcomBertNER.from_pretrained(model_dir, device=device) |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_dir) |
| | text = "Apple released a new iPhone in California." |
| | |
| | enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True) |
| | input_ids = enc["input_ids"].to(device) |
| | attention_mask = enc["attention_mask"].to(device) |
| | |
| | o = model(input_ids=input_ids, attention_mask=attention_mask) |
| | logits = o["logits"][0] # (C, L, L) |
| | probs = torch.sigmoid(logits) |
| | threshold = float(cfg.get("threshold", 0.5)) |
| | |
| | hits = (probs > threshold).nonzero(as_tuple=False) |
| | print(hits[:10]) |
| | ``` |
| |
|
| | ## Few-shot examples |
| |
|
| | The model predicts spans over the following **23 labels**: |
| |
|
| | | Label | Description | |
| | |---|---| |
| | | `MAIN_PRODUCT` | Primary product being searched/described | |
| | | `SUB_PRODUCT` | Secondary / accessory product | |
| | | `BRAND` | Brand name | |
| | | `MODEL` | Model number or name | |
| | | `IP` | IP / licensed character / franchise | |
| | | `MATERIAL` | Material composition | |
| | | `COLOR` | Color attribute | |
| | | `SHAPE` | Shape attribute | |
| | | `PATTERN` | Pattern or print | |
| | | `STYLE` | Style descriptor | |
| | | `FUNCTION` | Function or use-case | |
| | | `ATTRIBUTE` | Other product attribute | |
| | | `COMPATIBILITY` | Compatible device / platform | |
| | | `CROWD` | Target audience | |
| | | `OCCASION` | Use occasion or scene | |
| | | `LOCATION` | Geographic / location reference | |
| | | `MEASUREMENT` | Size, dimension, capacity | |
| | | `TIME` | Time reference | |
| | | `QUANTITY` | Count or amount | |
| | | `SALE` | Promotion or sale information | |
| | | `SHOP` | Shop or seller name | |
| | | `CONJ` | Conjunction linking entities | |
| | | `PREP` | Preposition linking entities | |
| |
|
| | --- |
| |
|
| | ### Example 1 |
| |
|
| | **Input**: |
| |
|
| | ``` |
| | "Nike running shoes for men, breathable mesh upper, size 42" |
| | ``` |
| |
|
| | **Expected entities**: |
| |
|
| | - `BRAND`: "Nike" |
| | - `MAIN_PRODUCT`: "running shoes" |
| | - `CROWD`: "men" |
| | - `MATERIAL`: "breathable mesh" |
| | - `MEASUREMENT`: "size 42" |
| |
|
| | --- |
| |
|
| | ### Example 2 |
| |
|
| | **Input**: |
| |
|
| | ``` |
| | "iPhone 15 Pro compatible leather case, black, for outdoor use" |
| | ``` |
| |
|
| | **Expected entities**: |
| |
|
| | - `COMPATIBILITY`: "iPhone 15 Pro" |
| | - `MAIN_PRODUCT`: "leather case" |
| | - `MATERIAL`: "leather" |
| | - `COLOR`: "black" |
| | - `OCCASION`: "outdoor use" |
| |
|
| | --- |
| |
|
| | ### Example 3 |
| |
|
| | **Input**: |
| |
|
| | ``` |
| | "Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free" |
| | ``` |
| |
|
| | **Expected entities**: |
| |
|
| | - `IP`: "Disney Mickey" |
| | - `PATTERN`: "Mickey pattern" |
| | - `CROWD`: "kids" |
| | - `MATERIAL`: "cotton" |
| | - `MAIN_PRODUCT`: "pajamas" |
| | - `QUANTITY`: "3-piece set" |
| | - `SALE`: "buy 2 get 1 free" |
| |
|
| | ## Training data |
| |
|
| | Not provided in this repository model card. |
| |
|
| | ## Evaluation |
| |
|
| | This repository includes `evaluate.py` for evaluating `.pt` checkpoints produced during training. |
| |
|
| | ## Environmental impact |
| |
|
| | Not measured. |
| |
|
| | ## Citation |
| |
|
| | If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup. |
| |
|