EcomBert_NER_V1

Model description

EcomBert_NER_V1 is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head.

This repository exports and loads the model using a lightweight HuggingFace-style folder layout:

  • config.json
  • pytorch_model.bin
  • tokenizer files saved by transformers.AutoTokenizer.save_pretrained(...)

Parameter size: ~0.4B parameters (as configured/reported for this model card).

Intended uses & limitations

Intended uses

  • Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets).
  • Offline batch inference and evaluation.

Limitations

  • This is a span-scoring model: it predicts (label, start, end) spans. Overlapping spans are possible.
  • Output quality depends heavily on:
    • the training dataset schema and label definitions
    • the decision threshold (threshold)
    • tokenization behavior (subword boundaries)
  • Long inputs will be truncated to max_length.

How to use

1) Train and export

During training, the best checkpoint is exported to a HuggingFace-style directory (by default checkpoints/hf_export).

Example:

python train.py \
  --splits_dir ./data2/splits \
  --output_dir checkpoints \
  --model_name bert-base-chinese \
  --hf_export_dir hf_export

This produces:

  • checkpoints/hf_export/config.json
  • checkpoints/hf_export/pytorch_model.bin
  • checkpoints/hf_export/tokenizer.*

2) Inference (CLI)

python infer.py \
  --model_dir checkpoints/hf_export \
  --text "Apple released a new iPhone in California."

You can optionally override the threshold:

python infer.py \
  --model_dir checkpoints/hf_export \
  --text "Apple released a new iPhone in California." \
  --threshold 0.55

3) Inference (Python)

import torch
from transformers import AutoTokenizer
from model import EcomBertNER

model_dir = "checkpoints/hf_export"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, cfg = EcomBertNER.from_pretrained(model_dir, device=device)

tokenizer = AutoTokenizer.from_pretrained(model_dir)
text = "Apple released a new iPhone in California."

enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
input_ids = enc["input_ids"].to(device)
attention_mask = enc["attention_mask"].to(device)

o = model(input_ids=input_ids, attention_mask=attention_mask)
logits = o["logits"][0]  # (C, L, L)
probs = torch.sigmoid(logits)
threshold = float(cfg.get("threshold", 0.5))

hits = (probs > threshold).nonzero(as_tuple=False)
print(hits[:10])

Few-shot examples

The model predicts spans over the following 23 labels:

Label Description
MAIN_PRODUCT Primary product being searched/described
SUB_PRODUCT Secondary / accessory product
BRAND Brand name
MODEL Model number or name
IP IP / licensed character / franchise
MATERIAL Material composition
COLOR Color attribute
SHAPE Shape attribute
PATTERN Pattern or print
STYLE Style descriptor
FUNCTION Function or use-case
ATTRIBUTE Other product attribute
COMPATIBILITY Compatible device / platform
CROWD Target audience
OCCASION Use occasion or scene
LOCATION Geographic / location reference
MEASUREMENT Size, dimension, capacity
TIME Time reference
QUANTITY Count or amount
SALE Promotion or sale information
SHOP Shop or seller name
CONJ Conjunction linking entities
PREP Preposition linking entities

Example 1

Input:

"Nike running shoes for men, breathable mesh upper, size 42"

Expected entities:

  • BRAND: "Nike"
  • MAIN_PRODUCT: "running shoes"
  • CROWD: "men"
  • MATERIAL: "breathable mesh"
  • MEASUREMENT: "size 42"

Example 2

Input:

"iPhone 15 Pro compatible leather case, black, for outdoor use"

Expected entities:

  • COMPATIBILITY: "iPhone 15 Pro"
  • MAIN_PRODUCT: "leather case"
  • MATERIAL: "leather"
  • COLOR: "black"
  • OCCASION: "outdoor use"

Example 3

Input:

"Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free"

Expected entities:

  • IP: "Disney Mickey"
  • PATTERN: "Mickey pattern"
  • CROWD: "kids"
  • MATERIAL: "cotton"
  • MAIN_PRODUCT: "pajamas"
  • QUANTITY: "3-piece set"
  • SALE: "buy 2 get 1 free"

Training data

Not provided in this repository model card.

Evaluation

This repository includes evaluate.py for evaluating .pt checkpoints produced during training.

Environmental impact

Not measured.

Citation

If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup.

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support