Upload folder using huggingface_hub

7781e94 verified 6 days ago

5.08 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: token-classification
	tags:
	- named-entity-recognition
	- ner
	- span-ner
	- globalpointer
	- pytorch
	library_name: transformers
	model_name: EcomBert_NER_V1
	---

	# EcomBert_NER_V1

	## Model description

	`EcomBert_NER_V1` is a span-based Named Entity Recognition (NER) model built on top of a BERT encoder with a GlobalPointer-style span classification head.

	This repository exports and loads the model using a lightweight HuggingFace-style folder layout:

	- `config.json`
	- `pytorch_model.bin`
	- tokenizer files saved by `transformers.AutoTokenizer.save_pretrained(...)`

	Parameter size: ~0.4B parameters (as configured/reported for this model card).

	## Intended uses & limitations

	### Intended uses

	- Extracting entity spans from short-to-medium English texts (e.g., product titles, user queries, support tickets).
	- Offline batch inference and evaluation.

	### Limitations

	- This is a span-scoring model: it predicts `(label, start, end)` spans. Overlapping spans are possible.
	- Output quality depends heavily on:
	- the training dataset schema and label definitions
	- the decision threshold (`threshold`)
	- tokenization behavior (subword boundaries)
	- Long inputs will be truncated to `max_length`.

	## How to use

	### 1) Train and export

	During training, the best checkpoint is exported to a HuggingFace-style directory (by default `checkpoints/hf_export`).

	Example:

	```bash
	python train.py \
	--splits_dir ./data2/splits \
	--output_dir checkpoints \
	--model_name bert-base-chinese \
	--hf_export_dir hf_export
	```

	This produces:

	- `checkpoints/hf_export/config.json`
	- `checkpoints/hf_export/pytorch_model.bin`
	- `checkpoints/hf_export/tokenizer.*`

	### 2) Inference (CLI)

	```bash
	python infer.py \
	--model_dir checkpoints/hf_export \
	--text "Apple released a new iPhone in California."
	```

	You can optionally override the threshold:

	```bash
	python infer.py \
	--model_dir checkpoints/hf_export \
	--text "Apple released a new iPhone in California." \
	--threshold 0.55
	```

	### 3) Inference (Python)

	```python
	import torch
	from transformers import AutoTokenizer
	from model import EcomBertNER

	model_dir = "checkpoints/hf_export"

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model, cfg = EcomBertNER.from_pretrained(model_dir, device=device)

	tokenizer = AutoTokenizer.from_pretrained(model_dir)
	text = "Apple released a new iPhone in California."

	enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
	input_ids = enc["input_ids"].to(device)
	attention_mask = enc["attention_mask"].to(device)

	o = model(input_ids=input_ids, attention_mask=attention_mask)
	logits = o["logits"][0] # (C, L, L)
	probs = torch.sigmoid(logits)
	threshold = float(cfg.get("threshold", 0.5))

	hits = (probs > threshold).nonzero(as_tuple=False)
	print(hits[:10])
	```

	## Few-shot examples

	The model predicts spans over the following 23 labels:

	\| Label \| Description \|
	\|---\|---\|
	\| `MAIN_PRODUCT` \| Primary product being searched/described \|
	\| `SUB_PRODUCT` \| Secondary / accessory product \|
	\| `BRAND` \| Brand name \|
	\| `MODEL` \| Model number or name \|
	\| `IP` \| IP / licensed character / franchise \|
	\| `MATERIAL` \| Material composition \|
	\| `COLOR` \| Color attribute \|
	\| `SHAPE` \| Shape attribute \|
	\| `PATTERN` \| Pattern or print \|
	\| `STYLE` \| Style descriptor \|
	\| `FUNCTION` \| Function or use-case \|
	\| `ATTRIBUTE` \| Other product attribute \|
	\| `COMPATIBILITY` \| Compatible device / platform \|
	\| `CROWD` \| Target audience \|
	\| `OCCASION` \| Use occasion or scene \|
	\| `LOCATION` \| Geographic / location reference \|
	\| `MEASUREMENT` \| Size, dimension, capacity \|
	\| `TIME` \| Time reference \|
	\| `QUANTITY` \| Count or amount \|
	\| `SALE` \| Promotion or sale information \|
	\| `SHOP` \| Shop or seller name \|
	\| `CONJ` \| Conjunction linking entities \|
	\| `PREP` \| Preposition linking entities \|

	---

	### Example 1

	Input:

	```
	"Nike running shoes for men, breathable mesh upper, size 42"
	```

	Expected entities:

	- `BRAND`: "Nike"
	- `MAIN_PRODUCT`: "running shoes"
	- `CROWD`: "men"
	- `MATERIAL`: "breathable mesh"
	- `MEASUREMENT`: "size 42"

	---

	### Example 2

	Input:

	```
	"iPhone 15 Pro compatible leather case, black, for outdoor use"
	```

	Expected entities:

	- `COMPATIBILITY`: "iPhone 15 Pro"
	- `MAIN_PRODUCT`: "leather case"
	- `MATERIAL`: "leather"
	- `COLOR`: "black"
	- `OCCASION`: "outdoor use"

	---

	### Example 3

	Input:

	```
	"Disney Mickey pattern kids cotton pajamas, 3-piece set, buy 2 get 1 free"
	```

	Expected entities:

	- `IP`: "Disney Mickey"
	- `PATTERN`: "Mickey pattern"
	- `CROWD`: "kids"
	- `MATERIAL`: "cotton"
	- `MAIN_PRODUCT`: "pajamas"
	- `QUANTITY`: "3-piece set"
	- `SALE`: "buy 2 get 1 free"

	## Training data

	Not provided in this repository model card.

	## Evaluation

	This repository includes `evaluate.py` for evaluating `.pt` checkpoints produced during training.

	## Environmental impact

	Not measured.

	## Citation

	If you use this work, consider citing your dataset and the BERT/Transformer literature relevant to your setup.