--- license: apache-2.0 language: - en pipeline_tag: token-classification tags: - named-entity-recognition - ner - span-ner - globalpointer - pytorch library_name: transformers model_name: EcomBert_NER_V1 --- # EcomBert_NER_V1 ## Model description `EcomBert_NER_V1` is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head. This repository exports and loads the model using a lightweight HuggingFace-style folder layout: - `config.json` - `pytorch_model.bin` - tokenizer files saved by `transformers.AutoTokenizer.save_pretrained(...)` **Parameter size**: ~0.4B parameters (as configured/reported for this model card). ## Intended uses & limitations ### Intended uses - Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets). - Offline batch inference and evaluation. ### Limitations - This is a span-scoring model: it predicts `(label, start, end)` spans. Overlapping spans are possible. - Output quality depends heavily on: - the training dataset schema and label definitions - the decision threshold (`threshold`) - tokenization behavior (subword boundaries) - Long inputs will be truncated to `max_length`. ## How to use ### 1) Train and export During training, the best checkpoint is exported to a HuggingFace-style directory (by default `checkpoints/hf_export`). Example: ```bash python train.py \ --splits_dir ./data2/splits \ --output_dir checkpoints \ --model_name bert-base-chinese \ --hf_export_dir hf_export ``` This produces: - `checkpoints/hf_export/config.json` - `checkpoints/hf_export/pytorch_model.bin` - `checkpoints/hf_export/tokenizer.*` ### 2) Inference (CLI) ```bash python infer.py \ --model_dir checkpoints/hf_export \ --text "Apple released a new iPhone in California." ``` You can optionally override the threshold: ```bash python infer.py \ --model_dir checkpoints/hf_export \ --text "Apple released a new iPhone in California." \ --threshold 0.55 ``` ### 3) Inference (Python) ```python import torch from transformers import AutoTokenizer from model import EcomBertNER model_dir = "checkpoints/hf_export" device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model, cfg = EcomBertNER.from_pretrained(model_dir, device=device) tokenizer = AutoTokenizer.from_pretrained(model_dir) text = "Apple released a new iPhone in California." enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True) input_ids = enc["input_ids"].to(device) attention_mask = enc["attention_mask"].to(device) o = model(input_ids=input_ids, attention_mask=attention_mask) logits = o["logits"][0] # (C, L, L) probs = torch.sigmoid(logits) threshold = float(cfg.get("threshold", 0.5)) hits = (probs > threshold).nonzero(as_tuple=False) print(hits[:10]) ``` ## Few-shot examples The model predicts spans over the following **23 labels**: | Label | Description | |---|---| | `MAIN_PRODUCT` | Primary product being searched/described | | `SUB_PRODUCT` | Secondary / accessory product | | `BRAND` | Brand name | | `MODEL` | Model number or name | | `IP` | IP / licensed character / franchise | | `MATERIAL` | Material composition | | `COLOR` | Color attribute | | `SHAPE` | Shape attribute | | `PATTERN` | Pattern or print | | `STYLE` | Style descriptor | | `FUNCTION` | Function or use-case | | `ATTRIBUTE` | Other product attribute | | `COMPATIBILITY` | Compatible device / platform | | `CROWD` | Target audience | | `OCCASION` | Use occasion or scene | | `LOCATION` | Geographic / location reference | | `MEASUREMENT` | Size, dimension, capacity | | `TIME` | Time reference | | `QUANTITY` | Count or amount | | `SALE` | Promotion or sale information | | `SHOP` | Shop or seller name | | `CONJ` | Conjunction linking entities | | `PREP` | Preposition linking entities | --- ### Example 1 **Input**: ``` "Nike running shoes for men, breathable mesh upper, size 42" ``` **Expected entities**: - `BRAND`: "Nike" - `MAIN_PRODUCT`: "running shoes" - `CROWD`: "men" - `MATERIAL`: "breathable mesh" - `MEASUREMENT`: "size 42" --- ### Example 2 **Input**: ``` "iPhone 15 Pro compatible leather case, black, for outdoor use" ``` **Expected entities**: - `COMPATIBILITY`: "iPhone 15 Pro" - `MAIN_PRODUCT`: "leather case" - `MATERIAL`: "leather" - `COLOR`: "black" - `OCCASION`: "outdoor use" --- ### Example 3 **Input**: ``` "Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free" ``` **Expected entities**: - `IP`: "Disney Mickey" - `PATTERN`: "Mickey pattern" - `CROWD`: "kids" - `MATERIAL`: "cotton" - `MAIN_PRODUCT`: "pajamas" - `QUANTITY`: "3-piece set" - `SALE`: "buy 2 get 1 free" ## Training data Not provided in this repository model card. ## Evaluation This repository includes `evaluate.py` for evaluating `.pt` checkpoints produced during training. ## Environmental impact Not measured. ## Citation If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup.