EcomBert_NER_V1

Model description

EcomBert_NER_V1 is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head.

This repository exports and loads the model using a lightweight HuggingFace-style folder layout:

config.json
pytorch_model.bin
tokenizer files saved by transformers.AutoTokenizer.save_pretrained(...)

Parameter size: ~0.4B parameters (as configured/reported for this model card).

Intended uses & limitations

Intended uses

Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets).
Offline batch inference and evaluation.

Limitations

This is a span-scoring model: it predicts (label, start, end) spans. Overlapping spans are possible.
Output quality depends heavily on:
- the training dataset schema and label definitions
- the decision threshold (threshold)
- tokenization behavior (subword boundaries)
Long inputs will be truncated to max_length.

How to use

1) Train and export

During training, the best checkpoint is exported to a HuggingFace-style directory (by default checkpoints/hf_export).

Example:

python train.py \
  --splits_dir ./data2/splits \
  --output_dir checkpoints \
  --model_name bert-base-chinese \
  --hf_export_dir hf_export

This produces:

checkpoints/hf_export/config.json
checkpoints/hf_export/pytorch_model.bin
checkpoints/hf_export/tokenizer.*

2) Inference (CLI)

python infer.py \
  --model_dir checkpoints/hf_export \
  --text "Apple released a new iPhone in California."

You can optionally override the threshold:

python infer.py \
  --model_dir checkpoints/hf_export \
  --text "Apple released a new iPhone in California." \
  --threshold 0.55

3) Inference (Python)

import torch
from transformers import AutoTokenizer
from model import EcomBertNER

model_dir = "checkpoints/hf_export"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, cfg = EcomBertNER.from_pretrained(model_dir, device=device)

tokenizer = AutoTokenizer.from_pretrained(model_dir)
text = "Apple released a new iPhone in California."

enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
input_ids = enc["input_ids"].to(device)
attention_mask = enc["attention_mask"].to(device)

o = model(input_ids=input_ids, attention_mask=attention_mask)
logits = o["logits"][0]  # (C, L, L)
probs = torch.sigmoid(logits)
threshold = float(cfg.get("threshold", 0.5))

hits = (probs > threshold).nonzero(as_tuple=False)
print(hits[:10])

Few-shot examples

The model predicts spans over the following 23 labels:

Label	Description
`MAIN_PRODUCT`	Primary product being searched/described
`SUB_PRODUCT`	Secondary / accessory product
`BRAND`	Brand name
`MODEL`	Model number or name
`IP`	IP / licensed character / franchise
`MATERIAL`	Material composition
`COLOR`	Color attribute
`SHAPE`	Shape attribute
`PATTERN`	Pattern or print
`STYLE`	Style descriptor
`FUNCTION`	Function or use-case
`ATTRIBUTE`	Other product attribute
`COMPATIBILITY`	Compatible device / platform
`CROWD`	Target audience
`OCCASION`	Use occasion or scene
`LOCATION`	Geographic / location reference
`MEASUREMENT`	Size, dimension, capacity
`TIME`	Time reference
`QUANTITY`	Count or amount
`SALE`	Promotion or sale information
`SHOP`	Shop or seller name
`CONJ`	Conjunction linking entities
`PREP`	Preposition linking entities

Example 1

Input:

"Nike running shoes for men, breathable mesh upper, size 42"

Expected entities:

BRAND: "Nike"
MAIN_PRODUCT: "running shoes"
CROWD: "men"
MATERIAL: "breathable mesh"
MEASUREMENT: "size 42"

Example 2

Input:

"iPhone 15 Pro compatible leather case, black, for outdoor use"

Expected entities:

COMPATIBILITY: "iPhone 15 Pro"
MAIN_PRODUCT: "leather case"
MATERIAL: "leather"
COLOR: "black"
OCCASION: "outdoor use"

Example 3

Input:

"Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free"

Expected entities:

IP: "Disney Mickey"
PATTERN: "Mickey pattern"
CROWD: "kids"
MATERIAL: "cotton"
MAIN_PRODUCT: "pajamas"
QUANTITY: "3-piece set"
SALE: "buy 2 get 1 free"

Training data

Not provided in this repository model card.

Evaluation

This repository includes evaluate.py for evaluating .pt checkpoints produced during training.

Environmental impact

Not measured.

Citation

If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup.

Downloads last month: 2