EcomBert_NER_V1
Model description
EcomBert_NER_V1 is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head.
This repository exports and loads the model using a lightweight HuggingFace-style folder layout:
config.jsonpytorch_model.bin- tokenizer files saved by
transformers.AutoTokenizer.save_pretrained(...)
Parameter size: ~0.4B parameters (as configured/reported for this model card).
Intended uses & limitations
Intended uses
- Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets).
- Offline batch inference and evaluation.
Limitations
- This is a span-scoring model: it predicts
(label, start, end)spans. Overlapping spans are possible. - Output quality depends heavily on:
- the training dataset schema and label definitions
- the decision threshold (
threshold) - tokenization behavior (subword boundaries)
- Long inputs will be truncated to
max_length.
How to use
1) Train and export
During training, the best checkpoint is exported to a HuggingFace-style directory (by default checkpoints/hf_export).
Example:
python train.py \
--splits_dir ./data2/splits \
--output_dir checkpoints \
--model_name bert-base-chinese \
--hf_export_dir hf_export
This produces:
checkpoints/hf_export/config.jsoncheckpoints/hf_export/pytorch_model.bincheckpoints/hf_export/tokenizer.*
2) Inference (CLI)
python infer.py \
--model_dir checkpoints/hf_export \
--text "Apple released a new iPhone in California."
You can optionally override the threshold:
python infer.py \
--model_dir checkpoints/hf_export \
--text "Apple released a new iPhone in California." \
--threshold 0.55
3) Inference (Python)
import torch
from transformers import AutoTokenizer
from model import EcomBertNER
model_dir = "checkpoints/hf_export"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, cfg = EcomBertNER.from_pretrained(model_dir, device=device)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
text = "Apple released a new iPhone in California."
enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
input_ids = enc["input_ids"].to(device)
attention_mask = enc["attention_mask"].to(device)
o = model(input_ids=input_ids, attention_mask=attention_mask)
logits = o["logits"][0] # (C, L, L)
probs = torch.sigmoid(logits)
threshold = float(cfg.get("threshold", 0.5))
hits = (probs > threshold).nonzero(as_tuple=False)
print(hits[:10])
Few-shot examples
The model predicts spans over the following 23 labels:
| Label | Description |
|---|---|
MAIN_PRODUCT |
Primary product being searched/described |
SUB_PRODUCT |
Secondary / accessory product |
BRAND |
Brand name |
MODEL |
Model number or name |
IP |
IP / licensed character / franchise |
MATERIAL |
Material composition |
COLOR |
Color attribute |
SHAPE |
Shape attribute |
PATTERN |
Pattern or print |
STYLE |
Style descriptor |
FUNCTION |
Function or use-case |
ATTRIBUTE |
Other product attribute |
COMPATIBILITY |
Compatible device / platform |
CROWD |
Target audience |
OCCASION |
Use occasion or scene |
LOCATION |
Geographic / location reference |
MEASUREMENT |
Size, dimension, capacity |
TIME |
Time reference |
QUANTITY |
Count or amount |
SALE |
Promotion or sale information |
SHOP |
Shop or seller name |
CONJ |
Conjunction linking entities |
PREP |
Preposition linking entities |
Example 1
Input:
"Nike running shoes for men, breathable mesh upper, size 42"
Expected entities:
BRAND: "Nike"MAIN_PRODUCT: "running shoes"CROWD: "men"MATERIAL: "breathable mesh"MEASUREMENT: "size 42"
Example 2
Input:
"iPhone 15 Pro compatible leather case, black, for outdoor use"
Expected entities:
COMPATIBILITY: "iPhone 15 Pro"MAIN_PRODUCT: "leather case"MATERIAL: "leather"COLOR: "black"OCCASION: "outdoor use"
Example 3
Input:
"Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free"
Expected entities:
IP: "Disney Mickey"PATTERN: "Mickey pattern"CROWD: "kids"MATERIAL: "cotton"MAIN_PRODUCT: "pajamas"QUANTITY: "3-piece set"SALE: "buy 2 get 1 free"
Training data
Not provided in this repository model card.
Evaluation
This repository includes evaluate.py for evaluating .pt checkpoints produced during training.
Environmental impact
Not measured.
Citation
If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup.
- Downloads last month
- 16