Eland NER - Chinese Public Opinion Named Entity Recognition

A Chinese NER model for public opinion analysis, fine-tuned on hfl/chinese-roberta-wwm-ext (BERT), achieving Typed F1=68.4% on 9 entity types covering people, organizations, locations, dates, events, products, laws, metrics, and topics.

Model Description

This is a BertForTokenClassification model fine-tuned for Chinese Named Entity Recognition in the context of public opinion and media monitoring. It extracts 9 entity types commonly found in news articles, social media posts, and forum discussions in Traditional Chinese.

Entity Types

Type Label Description Examples
PER 人物 Named persons (public figures, executives, politicians) 川普, 魏哲家, 賴清德
ORG 組織機構 Organizations (companies, government agencies, parties) 台積電, 民進黨, FBI
LOC 地點 Locations (countries, cities, districts) 台北, 美國, 信義區
DATE 時間 Temporal expressions (dates, quarters, holidays) 2026年, Q4, 春節
EVENT 事件 Named events (conferences, elections, incidents) CES 2025, 股東會, 九合一選舉
PROD 產品/品牌/服務 Products, brands, and services iPhone, ChatGPT, TPASS
LAW 法規 Named laws and regulations 證券交易法, 勞基法, 個資法
METRIC 指標名稱 Metric/indicator names (not concrete values) 營收, 本益比, GDP
TOPIC 輿情主題詞 Opinion mining topic keywords AI, 半導體, 碳中和

BIO Label Scheme

19 labels total: O + 9 entity types x 2 (B- prefix for beginning, I- prefix for inside).

Performance

Per-Type Metrics (on test50 set)

Entity Type Precision Recall F1
PER 75.0% 83.7% 79.1%
ORG 59.2% 82.4% 68.9%
LOC 68.8% 89.8% 77.9%
DATE 65.2% 90.0% 75.6%
EVENT 42.9% 64.3% 51.4%
PROD 50.0% 75.8% 60.2%
LAW 57.1% 80.0% 66.7%
METRIC 59.6% 77.8% 67.5%
TOPIC 47.6% 71.4% 57.1%
Typed Average 59.3% 80.8% 68.4%

Note: The model favors recall over precision by design. Use post-processing confidence thresholds (see below) to trade recall for precision based on your use case.

Usage

With Transformers Pipeline

from transformers import pipeline

ner = pipeline("ner", model="p988744/eland-ner-zh", aggregation_strategy="simple")
result = ner("台積電董事長魏哲家今日表示營收創新高")

for entity in result:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
# 台積電 -> ORG (0.95)
# 魏哲家 -> PER (0.92)
# 營收 -> METRIC (0.88)

Manual Inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "p988744/eland-ner-zh"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "賴清德出席在台北舉行的半導體產業論壇"
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")[0]

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)[0]
    confidences = torch.softmax(outputs.logits, dim=-1)[0].max(dim=-1).values

id2label = model.config.id2label
for i, (pred, conf) in enumerate(zip(predictions, confidences)):
    label = id2label[pred.item()]
    if label != "O":
        start, end = offset_mapping[i]
        token_text = text[start:end]
        print(f"{token_text} -> {label} ({conf:.3f})")

Post-Processing

The model tends to over-predict (high recall, lower precision). We provide recommended confidence thresholds derived from false-positive audit analysis:

Type Threshold Effect
PER 0.75 P=82% R=93%
ORG 0.80 P=82% R=87%
LOC 0.90 P=83% R=86%
DATE 0.75 P=67% R=86%
EVENT 0.75 P=55% R=61%
PROD 0.60 P=61% R=74%
LAW 0.00 Too few samples
METRIC 0.85 P=76% R=72%
TOPIC 0.80 P=65% R=52%

Additional heuristic rules:

  • Fragment filter: Remove single-character entities and subword artifacts
  • METRIC filter: Remove concrete numeric values (e.g., "200億元") - keep only metric names (e.g., "營收")
  • TOPIC filter: Remove generic/low-discriminability words (e.g., "策略", "健康")

Training Details

Parameter Value
Base Model hfl/chinese-roberta-wwm-ext
Architecture BertForTokenClassification
Epochs 25 (early stopping patience=5)
Batch Size 16
Learning Rate 3e-5
Weight Decay 0.01
Warmup Ratio 0.1
Dropout 0.2
Label Smoothing 0.02
Class Weights Log-scaled (damped)
Max Sequence Length 512
Metric for Best Model F1

Dataset

Trained on 1,979 annotated Chinese articles from news and social media sources.

Split Samples Source
Train 1,619 95% random split (seed=42)
Valid 86 5% random split
Test 49 Separate manual test set

Total entity annotations: ~13,000 across 9 types.

See dataset: p988744/eland-ner-zh

Related Models

Model Task Repository
Eland Sentiment Sentiment Analysis p988744/eland-sentiment-zh
Eland Stance Stance Detection p988744/eland-stance-zh
Eland Entity Sentiment Entity Sentiment p988744/eland-entity-sentiment-zh
Eland Official Doc Document Formatting p988744/eland-official-doc-zh
Eland Legal IE Legal Info Extraction p988744/eland-legal-ie-zh

Citation

@misc{eland-ner-zh,
  author = {Eland AI},
  title = {Eland NER: Chinese Public Opinion Named Entity Recognition Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/p988744/eland-ner-zh}
}

License

Apache 2.0

Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for p988744/eland-ner-zh

Finetuned
(71)
this model

Dataset used to train p988744/eland-ner-zh