ecombert-ner-v1 / README.md
xinyacs's picture
Upload folder using huggingface_hub
7781e94 verified
---
license: apache-2.0
language:
- en
pipeline_tag: token-classification
tags:
- named-entity-recognition
- ner
- span-ner
- globalpointer
- pytorch
library_name: transformers
model_name: EcomBert_NER_V1
---
# EcomBert_NER_V1
## Model description
`EcomBert_NER_V1` is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head.
This repository exports and loads the model using a lightweight HuggingFace-style folder layout:
- `config.json`
- `pytorch_model.bin`
- tokenizer files saved by `transformers.AutoTokenizer.save_pretrained(...)`
**Parameter size**: ~0.4B parameters (as configured/reported for this model card).
## Intended uses & limitations
### Intended uses
- Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets).
- Offline batch inference and evaluation.
### Limitations
- This is a span-scoring model: it predicts `(label, start, end)` spans. Overlapping spans are possible.
- Output quality depends heavily on:
- the training dataset schema and label definitions
- the decision threshold (`threshold`)
- tokenization behavior (subword boundaries)
- Long inputs will be truncated to `max_length`.
## How to use
### 1) Train and export
During training, the best checkpoint is exported to a HuggingFace-style directory (by default `checkpoints/hf_export`).
Example:
```bash
python train.py \
--splits_dir ./data2/splits \
--output_dir checkpoints \
--model_name bert-base-chinese \
--hf_export_dir hf_export
```
This produces:
- `checkpoints/hf_export/config.json`
- `checkpoints/hf_export/pytorch_model.bin`
- `checkpoints/hf_export/tokenizer.*`
### 2) Inference (CLI)
```bash
python infer.py \
--model_dir checkpoints/hf_export \
--text "Apple released a new iPhone in California."
```
You can optionally override the threshold:
```bash
python infer.py \
--model_dir checkpoints/hf_export \
--text "Apple released a new iPhone in California." \
--threshold 0.55
```
### 3) Inference (Python)
```python
import torch
from transformers import AutoTokenizer
from model import EcomBertNER
model_dir = "checkpoints/hf_export"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, cfg = EcomBertNER.from_pretrained(model_dir, device=device)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
text = "Apple released a new iPhone in California."
enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
input_ids = enc["input_ids"].to(device)
attention_mask = enc["attention_mask"].to(device)
o = model(input_ids=input_ids, attention_mask=attention_mask)
logits = o["logits"][0] # (C, L, L)
probs = torch.sigmoid(logits)
threshold = float(cfg.get("threshold", 0.5))
hits = (probs > threshold).nonzero(as_tuple=False)
print(hits[:10])
```
## Few-shot examples
The model predicts spans over the following **23 labels**:
| Label | Description |
|---|---|
| `MAIN_PRODUCT` | Primary product being searched/described |
| `SUB_PRODUCT` | Secondary / accessory product |
| `BRAND` | Brand name |
| `MODEL` | Model number or name |
| `IP` | IP / licensed character / franchise |
| `MATERIAL` | Material composition |
| `COLOR` | Color attribute |
| `SHAPE` | Shape attribute |
| `PATTERN` | Pattern or print |
| `STYLE` | Style descriptor |
| `FUNCTION` | Function or use-case |
| `ATTRIBUTE` | Other product attribute |
| `COMPATIBILITY` | Compatible device / platform |
| `CROWD` | Target audience |
| `OCCASION` | Use occasion or scene |
| `LOCATION` | Geographic / location reference |
| `MEASUREMENT` | Size, dimension, capacity |
| `TIME` | Time reference |
| `QUANTITY` | Count or amount |
| `SALE` | Promotion or sale information |
| `SHOP` | Shop or seller name |
| `CONJ` | Conjunction linking entities |
| `PREP` | Preposition linking entities |
---
### Example 1
**Input**:
```
"Nike running shoes for men, breathable mesh upper, size 42"
```
**Expected entities**:
- `BRAND`: "Nike"
- `MAIN_PRODUCT`: "running shoes"
- `CROWD`: "men"
- `MATERIAL`: "breathable mesh"
- `MEASUREMENT`: "size 42"
---
### Example 2
**Input**:
```
"iPhone 15 Pro compatible leather case, black, for outdoor use"
```
**Expected entities**:
- `COMPATIBILITY`: "iPhone 15 Pro"
- `MAIN_PRODUCT`: "leather case"
- `MATERIAL`: "leather"
- `COLOR`: "black"
- `OCCASION`: "outdoor use"
---
### Example 3
**Input**:
```
"Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free"
```
**Expected entities**:
- `IP`: "Disney Mickey"
- `PATTERN`: "Mickey pattern"
- `CROWD`: "kids"
- `MATERIAL`: "cotton"
- `MAIN_PRODUCT`: "pajamas"
- `QUANTITY`: "3-piece set"
- `SALE`: "buy 2 get 1 free"
## Training data
Not provided in this repository model card.
## Evaluation
This repository includes `evaluate.py` for evaluating `.pt` checkpoints produced during training.
## Environmental impact
Not measured.
## Citation
If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup.