Eland NER - Chinese Public Opinion Named Entity Recognition

A Chinese NER model for public opinion analysis, fine-tuned on hfl/chinese-roberta-wwm-ext (BERT), achieving Typed F1=68.4% on 9 entity types covering people, organizations, locations, dates, events, products, laws, metrics, and topics.

Model Description

This is a BertForTokenClassification model fine-tuned for Chinese Named Entity Recognition in the context of public opinion and media monitoring. It extracts 9 entity types commonly found in news articles, social media posts, and forum discussions in Traditional Chinese.

Entity Types

Type	Label	Description	Examples
PER	人物	Named persons (public figures, executives, politicians)	川普, 魏哲家, 賴清德
ORG	組織機構	Organizations (companies, government agencies, parties)	台積電, 民進黨, FBI
LOC	地點	Locations (countries, cities, districts)	台北, 美國, 信義區
DATE	時間	Temporal expressions (dates, quarters, holidays)	2026年, Q4, 春節
EVENT	事件	Named events (conferences, elections, incidents)	CES 2025, 股東會, 九合一選舉
PROD	產品/品牌/服務	Products, brands, and services	iPhone, ChatGPT, TPASS
LAW	法規	Named laws and regulations	證券交易法, 勞基法, 個資法
METRIC	指標名稱	Metric/indicator names (not concrete values)	營收, 本益比, GDP
TOPIC	輿情主題詞	Opinion mining topic keywords	AI, 半導體, 碳中和

BIO Label Scheme

19 labels total: O + 9 entity types x 2 (B- prefix for beginning, I- prefix for inside).

Performance

Per-Type Metrics (on test50 set)

Entity Type	Precision	Recall	F1
PER	75.0%	83.7%	79.1%
ORG	59.2%	82.4%	68.9%
LOC	68.8%	89.8%	77.9%
DATE	65.2%	90.0%	75.6%
EVENT	42.9%	64.3%	51.4%
PROD	50.0%	75.8%	60.2%
LAW	57.1%	80.0%	66.7%
METRIC	59.6%	77.8%	67.5%
TOPIC	47.6%	71.4%	57.1%
Typed Average	59.3%	80.8%	68.4%

Note: The model favors recall over precision by design. Use post-processing confidence thresholds (see below) to trade recall for precision based on your use case.

Usage

With Transformers Pipeline

from transformers import pipeline

ner = pipeline("ner", model="p988744/eland-ner-zh", aggregation_strategy="simple")
result = ner("台積電董事長魏哲家今日表示營收創新高")

for entity in result:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
# 台積電 -> ORG (0.95)
# 魏哲家 -> PER (0.92)
# 營收 -> METRIC (0.88)

Manual Inference

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "p988744/eland-ner-zh"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "賴清德出席在台北舉行的半導體產業論壇"
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")[0]

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)[0]
    confidences = torch.softmax(outputs.logits, dim=-1)[0].max(dim=-1).values

id2label = model.config.id2label
for i, (pred, conf) in enumerate(zip(predictions, confidences)):
    label = id2label[pred.item()]
    if label != "O":
        start, end = offset_mapping[i]
        token_text = text[start:end]
        print(f"{token_text} -> {label} ({conf:.3f})")

Post-Processing

The model tends to over-predict (high recall, lower precision). We provide recommended confidence thresholds derived from false-positive audit analysis:

Type	Threshold	Effect
PER	0.75	P=82% R=93%
ORG	0.80	P=82% R=87%
LOC	0.90	P=83% R=86%
DATE	0.75	P=67% R=86%
EVENT	0.75	P=55% R=61%
PROD	0.60	P=61% R=74%
LAW	0.00	Too few samples
METRIC	0.85	P=76% R=72%
TOPIC	0.80	P=65% R=52%

Additional heuristic rules:

Fragment filter: Remove single-character entities and subword artifacts
METRIC filter: Remove concrete numeric values (e.g., "200億元") - keep only metric names (e.g., "營收")
TOPIC filter: Remove generic/low-discriminability words (e.g., "策略", "健康")

Training Details

Parameter	Value
Base Model	hfl/chinese-roberta-wwm-ext
Architecture	BertForTokenClassification
Epochs	25 (early stopping patience=5)
Batch Size	16
Learning Rate	3e-5
Weight Decay	0.01
Warmup Ratio	0.1
Dropout	0.2
Label Smoothing	0.02
Class Weights	Log-scaled (damped)
Max Sequence Length	512
Metric for Best Model	F1

Dataset

Trained on 1,979 annotated Chinese articles from news and social media sources.

Split	Samples	Source
Train	1,619	95% random split (seed=42)
Valid	86	5% random split
Test	49	Separate manual test set

Total entity annotations: ~13,000 across 9 types.

See dataset: p988744/eland-ner-zh

Related Models

Model	Task	Repository
Eland Sentiment	Sentiment Analysis	p988744/eland-sentiment-zh
Eland Stance	Stance Detection	p988744/eland-stance-zh
Eland Entity Sentiment	Entity Sentiment	p988744/eland-entity-sentiment-zh
Eland Official Doc	Document Formatting	p988744/eland-official-doc-zh
Eland Legal IE	Legal Info Extraction	p988744/eland-legal-ie-zh

Citation

@misc{eland-ner-zh,
  author = {Eland AI},
  title = {Eland NER: Chinese Public Opinion Named Entity Recognition Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/p988744/eland-ner-zh}
}

License

Apache 2.0

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for p988744/eland-ner-zh

Base model

hfl/chinese-roberta-wwm-ext

Finetuned

(72)

this model

p988744
/

eland-ner-zh