Eland NER - Chinese Public Opinion Named Entity Recognition
A Chinese NER model for public opinion analysis, fine-tuned on hfl/chinese-roberta-wwm-ext (BERT), achieving Typed F1=68.4% on 9 entity types covering people, organizations, locations, dates, events, products, laws, metrics, and topics.
Model Description
This is a BertForTokenClassification model fine-tuned for Chinese Named Entity Recognition in the context of public opinion and media monitoring. It extracts 9 entity types commonly found in news articles, social media posts, and forum discussions in Traditional Chinese.
Entity Types
| Type | Label | Description | Examples |
|---|---|---|---|
| PER | 人物 | Named persons (public figures, executives, politicians) | 川普, 魏哲家, 賴清德 |
| ORG | 組織機構 | Organizations (companies, government agencies, parties) | 台積電, 民進黨, FBI |
| LOC | 地點 | Locations (countries, cities, districts) | 台北, 美國, 信義區 |
| DATE | 時間 | Temporal expressions (dates, quarters, holidays) | 2026年, Q4, 春節 |
| EVENT | 事件 | Named events (conferences, elections, incidents) | CES 2025, 股東會, 九合一選舉 |
| PROD | 產品/品牌/服務 | Products, brands, and services | iPhone, ChatGPT, TPASS |
| LAW | 法規 | Named laws and regulations | 證券交易法, 勞基法, 個資法 |
| METRIC | 指標名稱 | Metric/indicator names (not concrete values) | 營收, 本益比, GDP |
| TOPIC | 輿情主題詞 | Opinion mining topic keywords | AI, 半導體, 碳中和 |
BIO Label Scheme
19 labels total: O + 9 entity types x 2 (B- prefix for beginning, I- prefix for inside).
Performance
Per-Type Metrics (on test50 set)
| Entity Type | Precision | Recall | F1 |
|---|---|---|---|
| PER | 75.0% | 83.7% | 79.1% |
| ORG | 59.2% | 82.4% | 68.9% |
| LOC | 68.8% | 89.8% | 77.9% |
| DATE | 65.2% | 90.0% | 75.6% |
| EVENT | 42.9% | 64.3% | 51.4% |
| PROD | 50.0% | 75.8% | 60.2% |
| LAW | 57.1% | 80.0% | 66.7% |
| METRIC | 59.6% | 77.8% | 67.5% |
| TOPIC | 47.6% | 71.4% | 57.1% |
| Typed Average | 59.3% | 80.8% | 68.4% |
Note: The model favors recall over precision by design. Use post-processing confidence thresholds (see below) to trade recall for precision based on your use case.
Usage
With Transformers Pipeline
from transformers import pipeline
ner = pipeline("ner", model="p988744/eland-ner-zh", aggregation_strategy="simple")
result = ner("台積電董事長魏哲家今日表示營收創新高")
for entity in result:
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
# 台積電 -> ORG (0.95)
# 魏哲家 -> PER (0.92)
# 營收 -> METRIC (0.88)
Manual Inference
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = "p988744/eland-ner-zh"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
text = "賴清德出席在台北舉行的半導體產業論壇"
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")[0]
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)[0]
confidences = torch.softmax(outputs.logits, dim=-1)[0].max(dim=-1).values
id2label = model.config.id2label
for i, (pred, conf) in enumerate(zip(predictions, confidences)):
label = id2label[pred.item()]
if label != "O":
start, end = offset_mapping[i]
token_text = text[start:end]
print(f"{token_text} -> {label} ({conf:.3f})")
Post-Processing
The model tends to over-predict (high recall, lower precision). We provide recommended confidence thresholds derived from false-positive audit analysis:
| Type | Threshold | Effect |
|---|---|---|
| PER | 0.75 | P=82% R=93% |
| ORG | 0.80 | P=82% R=87% |
| LOC | 0.90 | P=83% R=86% |
| DATE | 0.75 | P=67% R=86% |
| EVENT | 0.75 | P=55% R=61% |
| PROD | 0.60 | P=61% R=74% |
| LAW | 0.00 | Too few samples |
| METRIC | 0.85 | P=76% R=72% |
| TOPIC | 0.80 | P=65% R=52% |
Additional heuristic rules:
- Fragment filter: Remove single-character entities and subword artifacts
- METRIC filter: Remove concrete numeric values (e.g., "200億元") - keep only metric names (e.g., "營收")
- TOPIC filter: Remove generic/low-discriminability words (e.g., "策略", "健康")
Training Details
| Parameter | Value |
|---|---|
| Base Model | hfl/chinese-roberta-wwm-ext |
| Architecture | BertForTokenClassification |
| Epochs | 25 (early stopping patience=5) |
| Batch Size | 16 |
| Learning Rate | 3e-5 |
| Weight Decay | 0.01 |
| Warmup Ratio | 0.1 |
| Dropout | 0.2 |
| Label Smoothing | 0.02 |
| Class Weights | Log-scaled (damped) |
| Max Sequence Length | 512 |
| Metric for Best Model | F1 |
Dataset
Trained on 1,979 annotated Chinese articles from news and social media sources.
| Split | Samples | Source |
|---|---|---|
| Train | 1,619 | 95% random split (seed=42) |
| Valid | 86 | 5% random split |
| Test | 49 | Separate manual test set |
Total entity annotations: ~13,000 across 9 types.
See dataset: p988744/eland-ner-zh
Related Models
| Model | Task | Repository |
|---|---|---|
| Eland Sentiment | Sentiment Analysis | p988744/eland-sentiment-zh |
| Eland Stance | Stance Detection | p988744/eland-stance-zh |
| Eland Entity Sentiment | Entity Sentiment | p988744/eland-entity-sentiment-zh |
| Eland Official Doc | Document Formatting | p988744/eland-official-doc-zh |
| Eland Legal IE | Legal Info Extraction | p988744/eland-legal-ie-zh |
Citation
@misc{eland-ner-zh,
author = {Eland AI},
title = {Eland NER: Chinese Public Opinion Named Entity Recognition Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/p988744/eland-ner-zh}
}
License
Apache 2.0
- Downloads last month
- 5
Model tree for p988744/eland-ner-zh
Base model
hfl/chinese-roberta-wwm-ext