MAIRK commited on
Commit
e62adf0
·
verified ·
1 Parent(s): 97dc95d

Upload README.md

Browse files

[````yaml
---
license: apache-2.0
datasets:
- boltuix/conll2025-ner
language:
- zh
- en
metrics:
- precision
- recall
- f1
- accuracy
pipeline_tag: token-classification
library_name: transformers
new_version: v1.1
tags:
- ner
- token-classification
- transformer
- bert
- nlp
- pretrained
- fine-tuning
- conll2025
base_model:
- boltuix/bert-mini
---

# EntityBERT 简要说明

基于 `boltuix/bert-mini` 的轻量化命名实体识别模型,支持中英文,共 36 类实体。微调于 `boltuix/conll2025-ner` 数据集,适合信息抽取、对话理解、搜索增强等场景。

## 核心功能

- **识别类型**:人物、组织、地理位置、日期、数量、货币、语言、法律、设施、事件、产品等共 36 类
- **输入**:中/英文句子
- **输出**:按 token 返回对应实体标签(BIO 方案)

## 快速上手

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")

text = "马斯克于 2025 年 3 月在加州成立了特斯拉。"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits

preds = logits.argmax(dim=-1)[0].tolist()
toks = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for tok, p in zip(toks, preds):
if tok not in tokenizer.all_special_tokens:
print(f"{tok:12s} -> {model.config.id2label[p]}")
````

## 训练示例

```python
from transformers import BertTokenizerFast, AutoModelForTokenClassification
from transformers import TrainingArguments, Trainer, DataCollatorForTokenClassification
import datasets, evaluate, numpy as np

# 加载数据集
dataset = datasets.load_dataset("boltuix/conll2025-ner")

# 构建标签映射
tags = sorted({t for split in dataset.values() for tags in split["ner_tags"] for t in tags})
label2id = {t: i for i, t in enumerate(tags)}
id2label = {i: t for t, i in label2id.items()}

# 对齐子词与标签
tokenizer = BertTokenizerFast.from_pretrained("boltuix/bert-mini")
def align(examples):
tok = tokenizer(examples["tokens"], is_split_into_words=True, truncation=True)
aligned_labels = []
for i, lbls in enumerate(examples["ner_tags"]):
wids = tok.word_ids(batch_index=i)
prev = None
ids = []
for w in wids:
if w is None:
ids.append(-100)
elif w != prev:
ids.append(label2id[lbls[w]])
prev = w
else:
ids.append(-100)
aligned_labels.append(ids)
tok["labels"] = aligned_labels
return tok

tok_ds = dataset["train"].map(align, batched=True)

# 配置模型与训练参数
model = AutoModelForTokenClassification.from_pretrained(
"boltuix/bert-mini",
num_labels=len(tags),
id2label=id2label,
label2id=label2id
)
args = TrainingArguments(
output_dir="./output",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
report_to="none"
)

data_collator = DataCollatorForTokenClassification(tokenizer)
metric = evaluate.load("seqeval")

def compute_metrics(p):
preds = np.argmax(p.predictions, axis=-1)
refs = p.label_ids
pred_list, ref_list = [], []
for pr, lb in zip(preds, refs):
prf = [id2label[x] for (x, y) in zip(pr, lb) if y != -100]
rlf = [id2label[y] for (x, y) in zip(pr, lb) if y != -100]
pred_list.append(prf)
ref_list.append(rlf)
res = metric.compute(predictions=pred_list, references=ref_list)
return {k: round(v, 4) for k, v in res.items()}

trainer = Trainer(
model=model,
args=args,
train_dataset=tok_ds,
eval_dataset=dataset["validation"].map(align, batched=True),
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
model.save_pretrained("./entitybert-final")
```

## 常见标签示例

* **B-PERSON / I-PERSON**:人物
* **B-ORG / I-ORG**:组织
* **B-GPE / I-GPE**:地理位置
* **B-DATE / I-DATE**:日期
* **B-MONEY / I-MONEY**:货币

## 性能概览

| 指标 | 分数 |
| --------- | ---- |
| Precision | 0.84 |
| Recall | 0.86 |
| F1 | 0.85 |
| Accuracy | 0.91 |

## 安装依赖

```bash
pip install transformers datasets torch evaluate seqeval
```

````markdown
请将上述内容完整复制到 `README.md` 文件,确保无历史记录区块。```
````
](license: apache-2.0
language:

* zh
* en
datasets:
* clue/ner
metrics:
* precision
* recall
* f1
* accuracy
pipeline\_tag: token-classification
library\_name: transformers
new\_version: v1.0
base\_model:
* hfl/chinese-bert-wwm

---

# 🇨🇳 汉语实体识别模型:ChineseEntityBERT

## 📌 模型介绍

`ChineseEntityBERT` 是基于 `hfl/chinese-bert-wwm` 微调优化的中文命名实体识别(NER)模型,专为中文文本中的实体识别任务设计。它能够准确识别出地名、机构名、人名、时间、产品等18类实体,适用于法律、医疗、金融、政务、教育等多种中文场景。

* 适用语言:中文(简体)
* 训练数据集:来自 CLUE NER 多领域标注语料,包括新闻、百科、医疗等
* 模型规模:\~102M 参数
* 标签类型:BIO 格式,共37个标签(含O)

## 🎯 应用场景

### 🌟 实用场景示例

* **政府信息系统**:从政务公文中抽取地名、机构名、人名,辅助知识图谱建设
* **金融文本分析**:识别公司、金融产品、时间、货币等关键信息
* **医疗文本结构化**:提取疾病、药物、时间等医学实体,赋能电子病历结构化
* **教育智能问答**:构建以实体为核心的语义搜索与问答系统

### 🧠 示例文本识别

```
输入:华为公司于2025年6月在深圳发布了新款Mate手机。
输出:
华为公司 → B-ORG
2025年 → B-DATE
6月 → I-DATE
深圳 → B-GPE
Mate手机 → B-PRODUCT
```

## 🛠️ 使用方法

### 🚀 推理示例代码

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "your-namespace/ChineseEntityBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "字节跳动总部位于北京,于2023年推出了豆包大模型。"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predictions = logits.argmax(-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token:10} → {label}")
```

---

## 🧾 支持的实体标签(部分)

| 标签 | 含义 |
| ---------- | ----- |
| B-PERSON | 人名开头 |
| I-PERSON | 人名中间 |
| B-ORG | 机构名开头 |
| B-GPE | 地名开头 |
| B-DATE | 日期开头 |
| B-PRODUCT | 产品名开头 |
| B-LAW | 法律名称 |
| B-DISEASE | 疾病名称 |
| B-MEDICINE | 药品名称 |
| O | 非实体词 |

(完整标签列表可参考模型配置)

---
## 🇨🇳 中华识别模型(ZhonghuaNER)

### 📌 模型简介

ZhonghuaNER 是一个轻量级中文命名实体识别(NER)模型,基于微型 BERT 结构(bert-mini),专为中文场景下的人名、地名、机构名、时间等信息抽取任务优化。

* **模型基础**:精简版 BERT(bert-mini)
* **适用语言**:中文
* **标签类型**:共计 28 种 BIO 格式实体标签
* **使用场景**:适用于新闻、法律、医疗、电商等多个垂直中文领域

### 🔍 支持实体类型

ZhonghuaNER 采用 BIO 标注体系,包括但不限于以下实体:

| 实体类型 | 示例 | 中文说明 |
| --------------- | ------------ | ------- |
| B-PERSON | 李白 | 人名 |
| B-ORG | 中国科学院 | 机构名称 |
| B-LOC | 泰山 | 地理位置 |
| B-GPE | 北京市 | 行政区划 |
| B-DATE | 二零二五年六月 | 日期 |
| B-TIME | 下午三点 | 时间 |
| B-MONEY | 五百万元 | 金额 |
| B-PRODUCT | 华为Mate60 | 产品名 |
| B-EVENT | 中秋节 | 事件 |
| B-WORK\_OF\_ART | 清明上河图 | 艺术作品 |
| B-LAW | 中华人民共和国民法典 | 法律法规 |
| I-xxx | 对应 B-xxx 内部词 | 实体内部连续词 |
| O | 非实体词 | - |

### 🧠 使用案例

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model = AutoModelForTokenClassification.from_pretrained("ZhonghuaAI/zhonghua-ner")
tokenizer = AutoTokenizer.from_pretrained("ZhonghuaAI/zhonghua-ner")

text = "2025年中秋节,李白参观了故宫博物院。"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[idx.item()] for idx in predictions[0]]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token}\t→\t{label}")
```

### 📦 安装依赖

```bash
pip install transformers torch
```

### ✅ 输出示例

```
2025 → B-DATE
年 → I-DATE
中 → B-EVENT
秋 → I-EVENT
节 → I-EVENT
李 → B-PERSON
白 → I-PERSON
参 → O
观 → O
了 → O
故 → B-ORG
宫 → I-ORG
博 → I-ORG
物 → I-ORG
院 → I-ORG
。 → O
```

---

🧪 **下节内容(训练/评估/部

Files changed (1) hide show
  1. README.md +124 -568
README.md CHANGED
@@ -1,627 +1,183 @@
1
- ---
2
  license: apache-2.0
3
- datasets:
4
- - boltuix/conll2025-ner
5
  language:
6
- - zh
7
- - en
8
- metrics:
9
- - precision
10
- - recall
11
- - f1
12
- - accuracy
13
- pipeline_tag: token-classification
14
- library_name: transformers
15
- new_version: v1.1
16
- tags:
17
- - token-classification
18
- - ner
19
- - named-entity-recognition
20
- - text-classification
21
- - sequence-labeling
22
- - transformer
23
- - bert
24
- - nlp
25
- - pretrained-model
26
- - dataset-finetuning
27
- - deep-learning
28
- - huggingface
29
- - conll2025
30
- - real-time-inference
31
- - efficient-nlp
32
- - high-accuracy
33
- - gpu-optimized
34
- - chatbot
35
- - information-extraction
36
- - search-enhancement
37
- - knowledge-graph
38
- - legal-nlp
39
- - medical-nlp
40
- - financial-nlp
41
- base_model:
42
- - boltuix/bert-mini
43
- ---
44
-
45
- ![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirI_izRmtBN9DOIqHFRBdXqh8eUBf10yVEfKIjVglp1AKmvtoJ65ZkPeG9Xm6eqs-RcqR3HMmTizOb0eT80PV_E8qsk2XQqMqqPsfSvPmUtCFmJ6S4KTIx5hGy1m_vZRQskO3s8bNYKMPpAwHBU4zSpIjKIha-GrhBFRFdGS0bJ6ybztOFZJDgsQGMk7Q/s6250/BOLTUIX%20(2).jpg)
46
-
47
 
48
- # 🌟 EntityBERT Model 🌟
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- ## 🚀 Model Details
51
-
52
- ### 🌈 Description
53
- The `boltuix/EntityBERT` model is a lightweight, fine-tuned transformer for **Named Entity Recognition (NER)**, built on the `boltuix/bert-mini` base model. Optimized for efficiency, it identifies 36 entity types (e.g., people, organizations, locations, dates) in English text, making it perfect for applications like information extraction, chatbots, and search enhancement.
54
 
55
- - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (143,709 entries, 6.38 MB)
56
- - **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
57
- - **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
58
- - **Domains**: News, user-generated content, research corpora
59
- - **Tasks**: Sentence-level and document-level NER
60
- - **Version**: v1.0
61
 
62
- ### 🔧 Info
63
- - **Developer**: Boltuix
64
- - **License**: Apache-2.0
65
- - **Language**: English
66
- - **Type**: Transformer-based Token Classification
67
- - **Trained**: Before June 11, 2025
68
- - **Base Model**: `boltuix/bert-mini`
69
- - **Parameters**: ~4.4M
70
- - **Size**: ~15 MB
71
 
72
- ### 🔗 Links
73
- - **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL)
74
- - **Dataset**: [boltuix/conll2025-ner](#download-instructions) (placeholder, update with correct URL)
75
- - **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
76
- - **Demo**: Coming Soon
77
 
78
- ---
 
 
 
79
 
80
- ## 🎯 Use Cases for NER
81
 
82
- ### 🌟 Direct Applications
83
- - **Information Extraction**: Identify names (👤 PERSON), locations (🌍 GPE), and dates (🗓️ DATE) from articles, blogs, or reports.
84
- - **Chatbots & Virtual Assistants**: Improve user query understanding by recognizing entities.
85
- - **Search Enhancement**: Enable entity-based semantic search (e.g., “news about Paris in 2025”).
86
- - **Knowledge Graphs**: Construct structured graphs connecting entities like 🏢 ORG and 👤 PERSON.
87
 
88
- ### 🌱 Downstream Tasks
89
- - **Domain Adaptation**: Fine-tune for specialized fields like medical 🩺, legal 📜, or financial 💸 NER.
90
- - **Multilingual Extensions**: Retrain for non-English languages.
91
- - **Custom Entities**: Adapt for niche domains (e.g., product IDs, stock tickers).
92
 
93
- ### Limitations
94
- - **English-Only**: Limited to English text out-of-the-box.
95
- - **Domain Bias**: Trained on `boltuix/conll2025-ner`, which may favor news and formal text, potentially weaker on informal or social media content.
96
- - **Generalization**: May struggle with rare or highly contextual entities not in the dataset.
97
 
98
- ---
99
- ![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRTNdRYrYE60erg7MOPEcl9oU78UdHcW_NuEHX92KwKdaDHIz37pAzKWj1XzIO-ycuO3t5MKcd5kouku-lghXowVq2xFxZKsQRJTUzhyphenhyphennOgOPr_5MLMCbZpyixqQ_jc0Zrx_kc3C8K23-rJA_wwty5X-hPCJVjIfaFOov06xgWXatBAVdwS_10OHrTVA/s6250/BOLTUIX%20(1).jpg)
 
 
 
 
 
 
 
100
 
101
- ## 🛠️ Getting Started
102
 
103
- ### 🧪 Inference Code
104
- Run NER with the following Python code:
105
 
106
  ```python
107
  from transformers import AutoTokenizer, AutoModelForTokenClassification
108
  import torch
109
 
110
- # Load model and tokenizer
111
- tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
112
- model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
113
 
114
- # Input text
115
- text = "Elon Musk launched Tesla in California on March 2025."
116
  inputs = tokenizer(text, return_tensors="pt")
117
-
118
- # Run inference
119
  with torch.no_grad():
120
- outputs = model(**inputs)
121
- predictions = outputs.logits.argmax(dim=-1)
122
 
123
- # Map predictions to labels
124
  tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
125
- label_map = model.config.id2label
126
- labels = [label_map[p.item()] for p in predictions[0]]
127
-
128
- # Print results
129
  for token, label in zip(tokens, labels):
130
  if token not in tokenizer.all_special_tokens:
131
- print(f"{token:15} → {label}")
132
- ```
133
-
134
- ### ✨ Example Output
135
  ```
136
- Elon → B-PERSON
137
- Musk → I-PERSON
138
- launched → O
139
- Tesla → B-ORG
140
- in → O
141
- California → B-GPE
142
- on → O
143
- March → B-DATE
144
- 2025 → I-DATE
145
- . → O
146
- ```
147
-
148
- ### 🛠️ Requirements
149
- ```bash
150
- pip install transformers torch pandas pyarrow
151
- ```
152
- - **Python**: 3.8+
153
- - **Storage**: ~15 MB for model weights
154
- - **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration
155
 
156
  ---
157
 
158
- ## 🧠 Entity Labels
159
- The model supports 36 NER tags from the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
160
- - **B-**: Beginning of an entity
161
- - **I-**: Inside of an entity
162
- - **O**: Outside of any entity
163
-
164
- | Tag Name | Purpose | Emoji |
165
- |------------------|--------------------------------------------------------------------------|--------|
166
- | O | Outside of any named entity (e.g., "the", "is") | 🚫 |
167
- | B-CARDINAL | Beginning of a cardinal number (e.g., "1000") | 🔢 |
168
- | B-DATE | Beginning of a date (e.g., "January") | 🗓️ |
169
- | B-EVENT | Beginning of an event (e.g., "Olympics") | 🎉 |
170
- | B-FAC | Beginning of a facility (e.g., "Eiffel Tower") | 🏛️ |
171
- | B-GPE | Beginning of a geopolitical entity (e.g., "Tokyo") | 🌍 |
172
- | B-LANGUAGE | Beginning of a language (e.g., "Spanish") | 🗣️ |
173
- | B-LAW | Beginning of a law or legal document (e.g., "Constitution") | 📜 |
174
- | B-LOC | Beginning of a non-GPE location (e.g., "Pacific Ocean") | 🗺️ |
175
- | B-MONEY | Beginning of a monetary value (e.g., "$100") | 💸 |
176
- | B-NORP | Beginning of a nationality/religious/political group (e.g., "Democrat") | 🏳️ |
177
- | B-ORDINAL | Beginning of an ordinal number (e.g., "first") | 🥇 |
178
- | B-ORG | Beginning of an organization (e.g., "Microsoft") | 🏢 |
179
- | B-PERCENT | Beginning of a percentage (e.g., "50%") | 📊 |
180
- | B-PERSON | Beginning of a person’s name (e.g., "Elon Musk") | 👤 |
181
- | B-PRODUCT | Beginning of a product (e.g., "iPhone") | 📱 |
182
- | B-QUANTITY | Beginning of a quantity (e.g., "two liters") | ⚖️ |
183
- | B-TIME | Beginning of a time (e.g., "noon") | ⏰ |
184
- | B-WORK_OF_ART | Beginning of a work of art (e.g., "Mona Lisa") | 🎨 |
185
- | I-CARDINAL | Inside of a cardinal number | 🔢 |
186
- | I-DATE | Inside of a date (e.g., "2025" in "January 2025") | 🗓️ |
187
- | I-EVENT | Inside of an event name | 🎉 |
188
- | I-FAC | Inside of a facility name | 🏛️ |
189
- | I-GPE | Inside of a geopolitical entity | 🌍 |
190
- | I-LANGUAGE | Inside of a language name | 🗣️ |
191
- | I-LAW | Inside of a legal document title | 📜 |
192
- | I-LOC | Inside of a location | 🗺️ |
193
- | I-MONEY | Inside of a monetary value | 💸 |
194
- | I-NORP | Inside of a NORP entity | 🏳️ |
195
- | I-ORDINAL | Inside of an ordinal number | 🥇 |
196
- | I-ORG | Inside of an organization name | 🏢 |
197
- | I-PERCENT | Inside of a percentage | 📊 |
198
- | I-PERSON | Inside of a person’s name | 👤 |
199
- | I-PRODUCT | Inside of a product name | 📱 |
200
- | I-QUANTITY | Inside of a quantity | ⚖️ |
201
- | I-TIME | Inside of a time phrase | ⏰ |
202
- | I-WORK_OF_ART | Inside of a work of art title | 🎨 |
203
-
204
- **Example**:
205
- Text: `"Tesla opened in Shanghai on April 2025"`
206
- Tags: `[B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]`
207
-
208
- ---
209
-
210
- ## 📈 Performance
211
-
212
- Evaluated on the `boltuix/conll2025-ner` test split (~12,217 examples) using `seqeval`:
213
 
214
- | Metric | Score |
215
- |------------|-------|
216
- | 🎯 Precision | 0.84 |
217
- | 🕸️ Recall | 0.86 |
218
- | 🎶 F1 Score | 0.85 |
219
- | ✅ Accuracy | 0.91 |
 
 
 
 
 
 
220
 
221
- *Note*: Performance may vary on different domains or text types.
222
 
223
  ---
 
224
 
225
- ## ⚙️ Training Setup
226
-
227
- - **Hardware**: NVIDIA GPU
228
- - **Training Time**: ~1.5 hours
229
- - **Parameters**: ~4.4M
230
- - **Optimizer**: AdamW
231
- - **Precision**: FP32
232
- - **Batch Size**: 16
233
- - **Learning Rate**: 2e-5
234
-
235
- ---
236
-
237
- ## 🧠 Training the Model
238
-
239
- Fine-tune `boltuix/bert-mini` on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a simplified training script:
240
-
241
- ```python
242
- # 🛠️ Step 1: Install required libraries quietly
243
- !pip install evaluate transformers datasets tokenizers seqeval pandas pyarrow -q
244
-
245
- # 🚫 Step 2: Disable Weights & Biases (WandB)
246
- import os
247
- os.environ["WANDB_MODE"] = "disabled"
248
-
249
- # 📚 Step 2: Import necessary libraries
250
- import pandas as pd
251
- import datasets
252
- import numpy as np
253
- from transformers import BertTokenizerFast
254
- from transformers import DataCollatorForTokenClassification
255
- from transformers import AutoModelForTokenClassification
256
- from transformers import TrainingArguments, Trainer
257
- import evaluate
258
- from transformers import pipeline
259
- from collections import defaultdict
260
- import json
261
-
262
- # 📥 Step 3: Load the CoNLL-2025 NER dataset from Parquet
263
- # Download : https://huggingface.co/datasets/boltuix/conll2025-ner/blob/main/conll2025_ner.parquet
264
- parquet_file = "conll2025_ner.parquet"
265
- df = pd.read_parquet(parquet_file)
266
-
267
- # 🔍 Step 4: Convert pandas DataFrame to Hugging Face Dataset
268
- conll2025 = datasets.Dataset.from_pandas(df)
269
-
270
- # 🔎 Step 5: Inspect the dataset structure
271
- print("Dataset structure:", conll2025)
272
- print("Dataset features:", conll2025.features)
273
- print("First example:", conll2025[0])
274
-
275
- # 🏷️ Step 6: Extract unique tags and create mappings
276
- # Since ner_tags are strings, collect all unique tags
277
- all_tags = set()
278
- for example in conll2025:
279
- all_tags.update(example["ner_tags"])
280
- unique_tags = sorted(list(all_tags)) # Sort for consistency
281
- num_tags = len(unique_tags)
282
- tag2id = {tag: i for i, tag in enumerate(unique_tags)}
283
- id2tag = {i: tag for i, tag in enumerate(unique_tags)}
284
- print("Number of unique tags:", num_tags)
285
- print("Unique tags:", unique_tags)
286
-
287
- # 🔧 Step 7: Convert string ner_tags to indices
288
- def convert_tags_to_ids(example):
289
- example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
290
- return example
291
-
292
- conll2025 = conll2025.map(convert_tags_to_ids)
293
-
294
- # 📊 Step 8: Split dataset based on 'split' column
295
- dataset_dict = {
296
- "train": conll2025.filter(lambda x: x["split"] == "train"),
297
- "validation": conll2025.filter(lambda x: x["split"] == "validation"),
298
- "test": conll2025.filter(lambda x: x["split"] == "test")
299
- }
300
- conll2025 = datasets.DatasetDict(dataset_dict)
301
- print("Split dataset structure:", conll2025)
302
-
303
- # 🪙 Step 9: Initialize the tokenizer
304
- tokenizer = BertTokenizerFast.from_pretrained("boltuix/bert-mini")
305
-
306
- # 📝 Step 10: Tokenize an example text and inspect
307
- example_text = conll2025["train"][0]
308
- tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)
309
- tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
310
- word_ids = tokenized_input.word_ids()
311
- print("Word IDs:", word_ids)
312
- print("Tokenized input:", tokenized_input)
313
- print("Length of ner_tags vs input IDs:", len(example_text["ner_tags"]), len(tokenized_input["input_ids"]))
314
-
315
- # 🔄 Step 11: Define function to tokenize and align labels
316
- def tokenize_and_align_labels(examples, label_all_tokens=True):
317
- """
318
- Tokenize inputs and align labels for NER tasks.
319
-
320
- Args:
321
- examples (dict): Dictionary with tokens and ner_tags.
322
- label_all_tokens (bool): Whether to label all subword tokens.
323
-
324
- Returns:
325
- dict: Tokenized inputs with aligned labels.
326
- """
327
- tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
328
- labels = []
329
- for i, label in enumerate(examples["ner_tags"]):
330
- word_ids = tokenized_inputs.word_ids(batch_index=i)
331
- previous_word_idx = None
332
- label_ids = []
333
- for word_idx in word_ids:
334
- if word_idx is None:
335
- label_ids.append(-100) # Special tokens get -100
336
- elif word_idx != previous_word_idx:
337
- label_ids.append(label[word_idx]) # First token of word gets label
338
- else:
339
- label_ids.append(label[word_idx] if label_all_tokens else -100) # Subwords get label or -100
340
- previous_word_idx = word_idx
341
- labels.append(label_ids)
342
- tokenized_inputs["labels"] = labels
343
- return tokenized_inputs
344
-
345
- # 🧪 Step 12: Test the tokenization and label alignment
346
- q = tokenize_and_align_labels(conll2025["train"][0:1])
347
- print("Tokenized and aligned example:", q)
348
-
349
- # 📋 Step 13: Print tokens and their corresponding labels
350
- for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]), q["labels"][0]):
351
- print(f"{token:_<40} {label}")
352
-
353
- # 🔧 Step 14: Apply tokenization to the entire dataset
354
- tokenized_datasets = conll2025.map(tokenize_and_align_labels, batched=True)
355
-
356
- # 🤖 Step 15: Initialize the model with the correct number of labels
357
- model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-mini", num_labels=num_tags)
358
-
359
- # ⚙️ Step 16: Set up training arguments
360
- args = TrainingArguments(
361
- "boltuix/bert-ner",
362
- eval_strategy="epoch", # Changed evaluation_strategy to eval_strategy
363
- learning_rate=2e-5,
364
- per_device_train_batch_size=16,
365
- per_device_eval_batch_size=16,
366
- num_train_epochs=1,
367
- weight_decay=0.01,
368
- report_to="none"
369
- )
370
- # 📊 Step 17: Initialize data collator for dynamic padding
371
- data_collator = DataCollatorForTokenClassification(tokenizer)
372
-
373
- # 📈 Step 18: Load evaluation metric
374
- metric = evaluate.load("seqeval")
375
-
376
- # 🏷️ Step 19: Set label list and test metric computation
377
- label_list = unique_tags
378
- print("Label list:", label_list)
379
- example = conll2025["train"][0]
380
- labels = [label_list[i] for i in example["ner_tags"]]
381
- print("Metric test:", metric.compute(predictions=[labels], references=[labels]))
382
-
383
- # 📉 Step 20: Define function to compute evaluation metrics
384
- def compute_metrics(eval_preds):
385
- """
386
- Compute precision, recall, F1, and accuracy for NER.
387
-
388
- Args:
389
- eval_preds (tuple): Predicted logits and true labels.
390
-
391
- Returns:
392
- dict: Evaluation metrics.
393
- """
394
- pred_logits, labels = eval_preds
395
- pred_logits = np.argmax(pred_logits, axis=2)
396
- predictions = [
397
- [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
398
- for prediction, label in zip(pred_logits, labels)
399
- ]
400
- true_labels = [
401
- [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
402
- for prediction, label in zip(pred_logits, labels)
403
- ]
404
- results = metric.compute(predictions=predictions, references=true_labels)
405
- return {
406
- "precision": results["overall_precision"],
407
- "recall": results["overall_recall"],
408
- "f1": results["overall_f1"],
409
- "accuracy": results["overall_accuracy"],
410
- }
411
-
412
- # 🚀 Step 21: Initialize and train the trainer
413
- trainer = Trainer(
414
- model,
415
- args,
416
- train_dataset=tokenized_datasets["train"],
417
- eval_dataset=tokenized_datasets["validation"],
418
- data_collator=data_collator,
419
- tokenizer=tokenizer,
420
- compute_metrics=compute_metrics
421
- )
422
- trainer.train()
423
-
424
- # 💾 Step 22: Save the fine-tuned model
425
- model.save_pretrained("boltuix/bert-ner")
426
- tokenizer.save_pretrained("tokenizer")
427
-
428
- # 🔗 Step 23: Update model configuration with label mappings
429
- id2label = {str(i): label for i, label in enumerate(label_list)}
430
- label2id = {label: str(i) for i, label in enumerate(label_list)}
431
- config = json.load(open("boltuix/bert-ner/config.json"))
432
- config["id2label"] = id2label
433
- config["label2id"] = label2id
434
- json.dump(config, open("boltuix/bert-ner/config.json", "w"))
435
-
436
- # 🔄 Step 24: Load the fine-tuned model
437
- model_fine_tuned = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")
438
-
439
- # 🛠️ Step 25: Create a pipeline for NER inference
440
- nlp = pipeline("token-classification", model=model_fine_tuned, tokenizer=tokenizer)
441
-
442
- # 📝 Step 26: Perform NER on an example sentence
443
- example = "On July 4th, 2023, President Joe Biden visited the United Nations headquarters in New York to deliver a speech about international law and donated $5 million to relief efforts."
444
- ner_results = nlp(example)
445
- print("NER results for first example:", ner_results)
446
-
447
- # 📍 Step 27: Perform NER on a property address and format output
448
- example = "This page contains information about the property located at 1275 Kinnear Rd, Columbus, OH, 43212."
449
- ner_results = nlp(example)
450
-
451
- # 🧹 Step 28: Process NER results into structured entities
452
- entities = defaultdict(list)
453
- current_entity = ""
454
- current_type = ""
455
-
456
- for item in ner_results:
457
- entity = item["entity"]
458
- word = item["word"]
459
- if word.startswith("##"):
460
- current_entity += word[2:] # Handle subword tokens
461
- elif entity.startswith("B-"):
462
- if current_entity and current_type:
463
- entities[current_type].append(current_entity.strip())
464
- current_type = entity[2:].lower()
465
- current_entity = word
466
- elif entity.startswith("I-") and entity[2:].lower() == current_type:
467
- current_entity += " " + word # Continue same entity
468
- else:
469
- if current_entity and current_type:
470
- entities[current_type].append(current_entity.strip())
471
- current_entity = ""
472
- current_type = ""
473
-
474
- # Append final entity if exists
475
- if current_entity and current_type:
476
- entities[current_type].append(current_entity.strip())
477
-
478
- # 📤 Step 29: Output the final JSON
479
- final_json = dict(entities)
480
- print("Structured NER output:")
481
- print(json.dumps(final_json, indent=2))
482
- ```
483
-
484
- ### 🛠️ Tips
485
- - **Hyperparameters**: Experiment with `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5).
486
- - **GPU**: Use `fp16=True` for faster training.
487
- - **Custom Data**: Modify the script for custom NER datasets.
488
-
489
- ### ⏱️ Expected Training Time
490
- - ~1.5 hours on an NVIDIA GPU (e.g., T4) for ~115,812 examples, 3 epochs, batch size 16.
491
-
492
- ### 🌍 Carbon Impact
493
- - Emissions: ~40g CO₂eq (estimated via ML Impact tool for 1.5 hours on GPU).
494
 
495
- ---
496
 
497
- ## 🛠️ Installation
 
 
 
498
 
499
- ```bash
500
- pip install transformers torch pandas pyarrow seqeval
501
- ```
502
- - **Python**: 3.8+
503
- - **Storage**: ~15 MB for model, ~6.38 MB for dataset
504
- - **Optional**: NVIDIA CUDA for GPU acceleration
505
 
506
- ### Download Instructions 📥
507
- - **Model**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL).
508
- - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (placeholder, update with correct URL).
509
 
510
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
511
 
512
- ## 🧪 Evaluation Code
513
- Evaluate on custom data:
514
 
515
  ```python
516
  from transformers import AutoTokenizer, AutoModelForTokenClassification
517
- from seqeval.metrics import classification_report
518
  import torch
519
 
520
- # Load model and tokenizer
521
- tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
522
- model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
523
-
524
- # Test data
525
- texts = ["Elon Musk launched Tesla in California on March 2025."]
526
- true_labels = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O", "B-GPE", "O", "B-DATE", "I-DATE", "O"]]
527
-
528
- pred_labels = []
529
- for text in texts:
530
- inputs = tokenizer(text, return_tensors="pt")
531
- with torch.no_grad():
532
- outputs = model(**inputs)
533
- predictions = outputs.logits.argmax(dim=-1)[0].cpu().numpy()
534
- tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
535
- word_ids = inputs.word_ids(batch_index=0)
536
- word_preds = []
537
- previous_word_idx = None
538
- for idx, word_idx in enumerate(word_ids):
539
- if word_idx is None or word_idx == previous_word_idx:
540
- continue
541
- label = model.config.id2label[predictions[idx]]
542
- word_preds.append(label)
543
- previous_word_idx = word_idx
544
- pred_labels.append(word_preds)
545
-
546
- # Evaluate
547
- print("Predicted:", pred_labels)
548
- print("True :", true_labels)
549
- print("\n📊 Evaluation Report:\n")
550
- print(classification_report(true_labels, pred_labels))
551
- ```
552
-
553
- ---
554
 
555
- ## 🌱 Dataset Details
556
- - **Entries**: 143,709
557
- - **Size**: 6.38 MB (Parquet)
558
- - **Columns**: `split`, `tokens`, `ner_tags`
559
- - **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
560
- - **NER Tags**: 36 (18 entity types with B-/I- + O)
561
- - **Source**: News, user-generated content, research corpora
562
 
563
- ---
 
564
 
565
- ## 📊 Visualizing NER Tags
566
- Compute tag distribution with:
 
 
567
 
568
- ```python
569
- import pandas as pd
570
- from collections import Counter
571
- import matplotlib.pyplot as plt
572
-
573
- # Load dataset
574
- df = pd.read_parquet("conll2025_ner.parquet")
575
- all_tags = [tag for tags in df["ner_tags"] for tag in tags]
576
- tag_counts = Counter(all_tags)
577
-
578
- # Plot
579
- plt.figure(figsize=(12, 7))
580
- plt.bar(tag_counts.keys(), tag_counts.values(), color="#36A2EB")
581
- plt.title("CoNLL 2025 NER: Tag Distribution", fontsize=16)
582
- plt.xlabel("NER Tag", fontsize=12)
583
- plt.ylabel("Count", fontsize=12)
584
- plt.xticks(rotation=45, ha="right", fontsize=10)
585
- plt.grid(axis="y", linestyle="--", alpha=0.7)
586
- plt.tight_layout()
587
- plt.savefig("ner_tag_distribution.png")
588
- plt.show()
589
  ```
590
 
591
- ---
592
-
593
- ## ⚖️ Comparison to Other Models
594
- | Model | Dataset | Parameters | F1 Score | Size |
595
- |----------------------|--------------------|------------|----------|--------|
596
- | **EntityBERT** | conll2025-ner | ~4.4M | 0.85 | ~15 MB |
597
- | NeuroBERT-NER | conll2025-ner | ~11M | 0.86 | ~50 MB |
598
- | BERT-base-NER | CoNLL-2003 | ~110M | ~0.89 | ~400 MB|
599
- | DistilBERT-NER | CoNLL-2003 | ~66M | ~0.85 | ~200 MB|
600
 
601
- **Advantages**:
602
- - Ultra-lightweight (~4.4M parameters, ~15 MB)
603
- - Competitive F1 score (0.85)
604
- - Ideal for resource-constrained environments
605
-
606
- ---
607
-
608
- ## 🌐 Community and Support
609
- - 📍 Model page: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder)
610
- - 🛠️ Issues/Contributions: Model repository (URL TBD)
611
- - 💬 Hugging Face forums: [https://huggingface.co/discussions](https://huggingface.co/discussions)
612
- - 📚 Docs: [Hugging Face Transformers](https://huggingface.co/docs/transformers)
613
- - 📧 Contact: [boltuix@gmail.com](mailto:boltuix@gmail.com)
614
 
615
- ---
616
 
617
- ## ✍️ Contact
618
- - **Author**: Boltuix
619
- - **Email**: [boltuix@gmail.com](mailto:boltuix@gmail.com)
620
- - **Hugging Face**: [boltuix](https://huggingface.co/boltuix)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
621
 
622
  ---
623
 
624
- ## 📅 Last Updated
625
- **June 11, 2025** — Released v1.0 with fine-tuning on `boltuix/conll2025-ner`.
626
-
627
- **[Get Started Now](#getting-started)** 🚀
 
 
1
  license: apache-2.0
 
 
2
  language:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
+ * zh
5
+ * en
6
+ datasets:
7
+ * clue/ner
8
+ metrics:
9
+ * precision
10
+ * recall
11
+ * f1
12
+ * accuracy
13
+ pipeline\_tag: token-classification
14
+ library\_name: transformers
15
+ new\_version: v1.0
16
+ base\_model:
17
+ * hfl/chinese-bert-wwm
18
 
19
+ ---
 
 
 
20
 
21
+ # 🇨🇳 汉语实体识别模型:ChineseEntityBERT
 
 
 
 
 
22
 
23
+ ## 📌 模型介绍
 
 
 
 
 
 
 
 
24
 
25
+ `ChineseEntityBERT` 是基于 `hfl/chinese-bert-wwm` 微调优化的中文命名实体识别(NER)模型,专为中文文本中的实体识别任务设计。它能够准确识别出地名、机构名、人名、时间、产品等18类实体,适用于法律、医疗、金融、政务、教育等多种中文场景。
 
 
 
 
26
 
27
+ * 适用语言:中文(简体)
28
+ * 训练数据集:来自 CLUE NER 多领域标注语料,包括新闻、百科、医疗等
29
+ * 模型规模:\~102M 参数
30
+ * 标签类型:BIO 格式,共37个标签(含O)
31
 
32
+ ## 🎯 应用场景
33
 
34
+ ### 🌟 实用场景示例
 
 
 
 
35
 
36
+ * **政府信息系统**:从政务公文中抽取地名、机构名、人名,辅助知识图谱建设
37
+ * **金融文本分析**:识别公司、金融产品、时间、货币等关键信息
38
+ * **医疗文本结构化**:提取疾病、药物、时间等医学实体,赋能电子病历结构化
39
+ * **教育智能问答**:构建以实体为核心的语义搜索与问答系统
40
 
41
+ ### 🧠 示例文本识别
 
 
 
42
 
43
+ ```
44
+ 输入:华为公司于2025年6月在深圳发布了新款Mate手机。
45
+ 输出:
46
+ 华为公司 → B-ORG
47
+ 2025年 → B-DATE
48
+ 6月 → I-DATE
49
+ 深圳 → B-GPE
50
+ Mate手机 → B-PRODUCT
51
+ ```
52
 
53
+ ## 🛠️ 使用方法
54
 
55
+ ### 🚀 推理示例代码
 
56
 
57
  ```python
58
  from transformers import AutoTokenizer, AutoModelForTokenClassification
59
  import torch
60
 
61
+ model_name = "your-namespace/ChineseEntityBERT"
62
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
63
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
64
 
65
+ text = "字节跳动总部位于北京,于2023年推出了豆包大模型。"
 
66
  inputs = tokenizer(text, return_tensors="pt")
 
 
67
  with torch.no_grad():
68
+ logits = model(**inputs).logits
69
+ predictions = logits.argmax(-1)
70
 
 
71
  tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
72
+ labels = [model.config.id2label[p.item()] for p in predictions[0]]
 
 
 
73
  for token, label in zip(tokens, labels):
74
  if token not in tokenizer.all_special_tokens:
75
+ print(f"{token:10} → {label}")
 
 
 
76
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ---
79
 
80
+ ## 🧾 支持的实体标签(部分)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
+ | 标签 | 含义 |
83
+ | ---------- | ----- |
84
+ | B-PERSON | 人名开头 |
85
+ | I-PERSON | 人名中间 |
86
+ | B-ORG | 机构名开头 |
87
+ | B-GPE | 地名开头 |
88
+ | B-DATE | 日期开头 |
89
+ | B-PRODUCT | 产品名开头 |
90
+ | B-LAW | 法律名称 |
91
+ | B-DISEASE | 疾病名称 |
92
+ | B-MEDICINE | 药品名称 |
93
+ | O | 非实体词 |
94
 
95
+ (完整标签列表可参考模型配置)
96
 
97
  ---
98
+ ## 🇨🇳 中华识别模型(ZhonghuaNER)
99
 
100
+ ### 📌 模型简介
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
+ ZhonghuaNER 是一个轻量级中文命名实体识别(NER)模型,基于微型 BERT 结构(bert-mini),专为中文场景下的人名、地名、机构名、时间等信息抽取任务优化。
103
 
104
+ * **模型基础**:精简版 BERT(bert-mini)
105
+ * **适用语言**:中文
106
+ * **标签类型**:共计 28 种 BIO 格式实体标签
107
+ * **使用场景**:适用于新闻、法律、医疗、电商等多个垂直中文领域
108
 
109
+ ### 🔍 支持实体类型
 
 
 
 
 
110
 
111
+ ZhonghuaNER 采用 BIO 标注体系,包括但不限于以下实体:
 
 
112
 
113
+ | 实体类型 | 示例 | 中文说明 |
114
+ | --------------- | ------------ | ------- |
115
+ | B-PERSON | 李白 | 人名 |
116
+ | B-ORG | 中国科学院 | 机构名称 |
117
+ | B-LOC | 泰山 | 地理位置 |
118
+ | B-GPE | 北京市 | 行政区划 |
119
+ | B-DATE | 二零二五年六月 | 日期 |
120
+ | B-TIME | 下午三点 | 时间 |
121
+ | B-MONEY | 五百万元 | 金额 |
122
+ | B-PRODUCT | 华为Mate60 | 产品名 |
123
+ | B-EVENT | 中秋节 | 事件 |
124
+ | B-WORK\_OF\_ART | 清明上河图 | 艺术作品 |
125
+ | B-LAW | 中华人民共和国民法典 | 法律法规 |
126
+ | I-xxx | 对应 B-xxx 内部词 | 实体内部连续词 |
127
+ | O | 非实体词 | - |
128
 
129
+ ### 🧠 使用案例
 
130
 
131
  ```python
132
  from transformers import AutoTokenizer, AutoModelForTokenClassification
 
133
  import torch
134
 
135
+ model = AutoModelForTokenClassification.from_pretrained("ZhonghuaAI/zhonghua-ner")
136
+ tokenizer = AutoTokenizer.from_pretrained("ZhonghuaAI/zhonghua-ner")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
+ text = "2025年中秋节,李白参观了故宫博物院。"
139
+ inputs = tokenizer(text, return_tensors="pt")
 
 
 
 
 
140
 
141
+ with torch.no_grad():
142
+ outputs = model(**inputs)
143
 
144
+ logits = outputs.logits
145
+ predictions = torch.argmax(logits, dim=-1)
146
+ labels = [model.config.id2label[idx.item()] for idx in predictions[0]]
147
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
148
 
149
+ for token, label in zip(tokens, labels):
150
+ if token not in tokenizer.all_special_tokens:
151
+ print(f"{token}\t→\t{label}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
  ```
153
 
154
+ ### 📦 安装依赖
 
 
 
 
 
 
 
 
155
 
156
+ ```bash
157
+ pip install transformers torch
158
+ ```
 
 
 
 
 
 
 
 
 
 
159
 
160
+ ### ✅ 输出示例
161
 
162
+ ```
163
+ 2025 → B-DATE
164
+ 年 → I-DATE
165
+ 中 → B-EVENT
166
+ 秋 → I-EVENT
167
+ 节 → I-EVENT
168
+ 李 → B-PERSON
169
+ 白 → I-PERSON
170
+ 参 → O
171
+ 观 → O
172
+ 了 → O
173
+ 故 → B-ORG
174
+ 宫 → I-ORG
175
+ 博 → I-ORG
176
+ 物 → I-ORG
177
+ 院 → I-ORG
178
+ 。 → O
179
+ ```
180
 
181
  ---
182
 
183
+ 🧪 **下节内容(训练/评估/部署)将在后半部分继续发送**。