Upload README.md

[````yaml
---
license: apache-2.0
datasets:
- boltuix/conll2025-ner
language:
- zh
- en
metrics:
- precision
- recall
- f1
- accuracy
pipeline_tag: token-classification
library_name: transformers
new_version: v1.1
tags:
- ner
- token-classification
- transformer
- bert
- nlp
- pretrained
- fine-tuning
- conll2025
base_model:
- boltuix/bert-mini
---

# EntityBERT 简要说明

基于 `boltuix/bert-mini` 的轻量化命名实体识别模型，支持中英文，共 36 类实体。微调于 `boltuix/conll2025-ner` 数据集，适合信息抽取、对话理解、搜索增强等场景。

## 核心功能

- **识别类型**：人物、组织、地理位置、日期、数量、货币、语言、法律、设施、事件、产品等共 36 类
- **输入**：中/英文句子
- **输出**：按 token 返回对应实体标签（BIO 方案）

## 快速上手

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")

text = "马斯克于 2025 年 3 月在加州成立了特斯拉。"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits

preds = logits.argmax(dim=-1)[0].tolist()
toks = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for tok, p in zip(toks, preds):
if tok not in tokenizer.all_special_tokens:
print(f"{tok:12s} -> {model.config.id2label[p]}")
````

## 训练示例

```python
from transformers import BertTokenizerFast, AutoModelForTokenClassification
from transformers import TrainingArguments, Trainer, DataCollatorForTokenClassification
import datasets, evaluate, numpy as np

# 加载数据集
dataset = datasets.load_dataset("boltuix/conll2025-ner")

# 构建标签映射
tags = sorted({t for split in dataset.values() for tags in split["ner_tags"] for t in tags})
label2id = {t: i for i, t in enumerate(tags)}
id2label = {i: t for t, i in label2id.items()}

# 对齐子词与标签
tokenizer = BertTokenizerFast.from_pretrained("boltuix/bert-mini")
def align(examples):
tok = tokenizer(examples["tokens"], is_split_into_words=True, truncation=True)
aligned_labels = []
for i, lbls in enumerate(examples["ner_tags"]):
wids = tok.word_ids(batch_index=i)
prev = None
ids = []
for w in wids:
if w is None:
ids.append(-100)
elif w != prev:
ids.append(label2id[lbls[w]])
prev = w
else:
ids.append(-100)
aligned_labels.append(ids)
tok["labels"] = aligned_labels
return tok

tok_ds = dataset["train"].map(align, batched=True)

# 配置模型与训练参数
model = AutoModelForTokenClassification.from_pretrained(
"boltuix/bert-mini",
num_labels=len(tags),
id2label=id2label,
label2id=label2id
)
args = TrainingArguments(
output_dir="./output",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
report_to="none"
)

data_collator = DataCollatorForTokenClassification(tokenizer)
metric = evaluate.load("seqeval")

def compute_metrics(p):
preds = np.argmax(p.predictions, axis=-1)
refs = p.label_ids
pred_list, ref_list = [], []
for pr, lb in zip(preds, refs):
prf = [id2label[x] for (x, y) in zip(pr, lb) if y != -100]
rlf = [id2label[y] for (x, y) in zip(pr, lb) if y != -100]
pred_list.append(prf)
ref_list.append(rlf)
res = metric.compute(predictions=pred_list, references=ref_list)
return {k: round(v, 4) for k, v in res.items()}

trainer = Trainer(
model=model,
args=args,
train_dataset=tok_ds,
eval_dataset=dataset["validation"].map(align, batched=True),
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
model.save_pretrained("./entitybert-final")
```

## 常见标签示例

* **B-PERSON / I-PERSON**：人物
* **B-ORG / I-ORG**：组织
* **B-GPE / I-GPE**：地理位置
* **B-DATE / I-DATE**：日期
* **B-MONEY / I-MONEY**：货币

## 性能概览

| 指标 | 分数 |
| --------- | ---- |
| Precision | 0.84 |
| Recall | 0.86 |
| F1 | 0.85 |
| Accuracy | 0.91 |

## 安装依赖

```bash
pip install transformers datasets torch evaluate seqeval
```

````markdown
请将上述内容完整复制到 `README.md` 文件，确保无历史记录区块。```
````
](license: apache-2.0
language:

* zh
* en
datasets:
* clue/ner
metrics:
* precision
* recall
* f1
* accuracy
pipeline\_tag: token-classification
library\_name: transformers
new\_version: v1.0
base\_model:
* hfl/chinese-bert-wwm

---

# 🇨🇳 汉语实体识别模型：ChineseEntityBERT

## 📌 模型介绍

`ChineseEntityBERT` 是基于 `hfl/chinese-bert-wwm` 微调优化的中文命名实体识别（NER）模型，专为中文文本中的实体识别任务设计。它能够准确识别出地名、机构名、人名、时间、产品等18类实体，适用于法律、医疗、金融、政务、教育等多种中文场景。

* 适用语言：中文（简体）
* 训练数据集：来自 CLUE NER 多领域标注语料，包括新闻、百科、医疗等
* 模型规模：\~102M 参数
* 标签类型：BIO 格式，共37个标签（含O）

## 🎯 应用场景

### 🌟 实用场景示例

* **政府信息系统**：从政务公文中抽取地名、机构名、人名，辅助知识图谱建设
* **金融文本分析**：识别公司、金融产品、时间、货币等关键信息
* **医疗文本结构化**：提取疾病、药物、时间等医学实体，赋能电子病历结构化
* **教育智能问答**：构建以实体为核心的语义搜索与问答系统

### 🧠 示例文本识别

```
输入：华为公司于2025年6月在深圳发布了新款Mate手机。
输出：
华为公司 → B-ORG
2025年 → B-DATE
6月 → I-DATE
深圳 → B-GPE
Mate手机 → B-PRODUCT
```

## 🛠️ 使用方法

### 🚀 推理示例代码

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "your-namespace/ChineseEntityBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "字节跳动总部位于北京，于2023年推出了豆包大模型。"
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predictions = logits.argmax(-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token:10} → {label}")
```

---

## 🧾 支持的实体标签（部分）

| 标签 | 含义 |
| ---------- | ----- |
| B-PERSON | 人名开头 |
| I-PERSON | 人名中间 |
| B-ORG | 机构名开头 |
| B-GPE | 地名开头 |
| B-DATE | 日期开头 |
| B-PRODUCT | 产品名开头 |
| B-LAW | 法律名称 |
| B-DISEASE | 疾病名称 |
| B-MEDICINE | 药品名称 |
| O | 非实体词 |

（完整标签列表可参考模型配置）

---
## 🇨🇳 中华识别模型（ZhonghuaNER）

### 📌 模型简介

ZhonghuaNER 是一个轻量级中文命名实体识别（NER）模型，基于微型 BERT 结构（bert-mini），专为中文场景下的人名、地名、机构名、时间等信息抽取任务优化。

* **模型基础**：精简版 BERT（bert-mini）
* **适用语言**：中文
* **标签类型**：共计 28 种 BIO 格式实体标签
* **使用场景**：适用于新闻、法律、医疗、电商等多个垂直中文领域

### 🔍 支持实体类型

ZhonghuaNER 采用 BIO 标注体系，包括但不限于以下实体：

| 实体类型 | 示例 | 中文说明 |
| --------------- | ------------ | ------- |
| B-PERSON | 李白 | 人名 |
| B-ORG | 中国科学院 | 机构名称 |
| B-LOC | 泰山 | 地理位置 |
| B-GPE | 北京市 | 行政区划 |
| B-DATE | 二零二五年六月 | 日期 |
| B-TIME | 下午三点 | 时间 |
| B-MONEY | 五百万元 | 金额 |
| B-PRODUCT | 华为Mate60 | 产品名 |
| B-EVENT | 中秋节 | 事件 |
| B-WORK\_OF\_ART | 清明上河图 | 艺术作品 |
| B-LAW | 中华人民共和国民法典 | 法律法规 |
| I-xxx | 对应 B-xxx 内部词 | 实体内部连续词 |
| O | 非实体词 | - |

### 🧠 使用案例

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model = AutoModelForTokenClassification.from_pretrained("ZhonghuaAI/zhonghua-ner")
tokenizer = AutoTokenizer.from_pretrained("ZhonghuaAI/zhonghua-ner")

text = "2025年中秋节，李白参观了故宫博物院。"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[idx.item()] for idx in predictions[0]]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token}\t→\t{label}")
```

### 📦 安装依赖

```bash
pip install transformers torch
```

### ✅ 输出示例

```
2025 → B-DATE
年 → I-DATE
中 → B-EVENT
秋 → I-EVENT
节 → I-EVENT
李 → B-PERSON
白 → I-PERSON
参 → O
观 → O
了 → O
故 → B-ORG
宫 → I-ORG
博 → I-ORG
物 → I-ORG
院 → I-ORG
。 → O
```

---

🧪 **下节内容（训练/评估/部

Files changed (1) hide show

README.md +124 -568

README.md CHANGED Viewed

@@ -1,627 +1,183 @@
----
 license: apache-2.0
-datasets:
-- boltuix/conll2025-ner
 language:
-- zh
-- en
-metrics:
-- precision
-- recall
-- f1
-- accuracy
-pipeline_tag: token-classification
-library_name: transformers
-new_version: v1.1
-tags:
-- token-classification
-- ner
-- named-entity-recognition
-- text-classification
-- sequence-labeling
-- transformer
-- bert
-- nlp
-- pretrained-model
-- dataset-finetuning
-- deep-learning
-- huggingface
-- conll2025
-- real-time-inference
-- efficient-nlp
-- high-accuracy
-- gpu-optimized
-- chatbot
-- information-extraction
-- search-enhancement
-- knowledge-graph
-- legal-nlp
-- medical-nlp
-- financial-nlp
-base_model:
-- boltuix/bert-mini
----
-![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirI_izRmtBN9DOIqHFRBdXqh8eUBf10yVEfKIjVglp1AKmvtoJ65ZkPeG9Xm6eqs-RcqR3HMmTizOb0eT80PV_E8qsk2XQqMqqPsfSvPmUtCFmJ6S4KTIx5hGy1m_vZRQskO3s8bNYKMPpAwHBU4zSpIjKIha-GrhBFRFdGS0bJ6ybztOFZJDgsQGMk7Q/s6250/BOLTUIX%20(2).jpg)
-# 🌟 EntityBERT Model 🌟
-## 🚀 Model Details
-### 🌈 Description
-The `boltuix/EntityBERT` model is a lightweight, fine-tuned transformer for **Named Entity Recognition (NER)**, built on the `boltuix/bert-mini` base model. Optimized for efficiency, it identifies 36 entity types (e.g., people, organizations, locations, dates) in English text, making it perfect for applications like information extraction, chatbots, and search enhancement.
-- **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (143,709 entries, 6.38 MB)
-- **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
-- **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
-- **Domains**: News, user-generated content, research corpora
-- **Tasks**: Sentence-level and document-level NER
-- **Version**: v1.0
-### 🔧 Info
-- **Developer**: Boltuix
-- **License**: Apache-2.0
-- **Language**: English
-- **Type**: Transformer-based Token Classification
-- **Trained**: Before June 11, 2025
-- **Base Model**: `boltuix/bert-mini`
-- **Parameters**: ~4.4M
-- **Size**: ~15 MB
-### 🔗 Links
-- **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL)
-- **Dataset**: [boltuix/conll2025-ner](#download-instructions) (placeholder, update with correct URL)
-- **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
-- **Demo**: Coming Soon
----
-## 🎯 Use Cases for NER
-### 🌟 Direct Applications
-- **Information Extraction**: Identify names (👤 PERSON), locations (🌍 GPE), and dates (🗓️ DATE) from articles, blogs, or reports.
-- **Chatbots & Virtual Assistants**: Improve user query understanding by recognizing entities.
-- **Search Enhancement**: Enable entity-based semantic search (e.g., “news about Paris in 2025”).
-- **Knowledge Graphs**: Construct structured graphs connecting entities like 🏢 ORG and 👤 PERSON.
-### 🌱 Downstream Tasks
-- **Domain Adaptation**: Fine-tune for specialized fields like medical 🩺, legal 📜, or financial 💸 NER.
-- **Multilingual Extensions**: Retrain for non-English languages.
-- **Custom Entities**: Adapt for niche domains (e.g., product IDs, stock tickers).
-### ❌ Limitations
-- **English-Only**: Limited to English text out-of-the-box.
-- **Domain Bias**: Trained on `boltuix/conll2025-ner`, which may favor news and formal text, potentially weaker on informal or social media content.
-- **Generalization**: May struggle with rare or highly contextual entities not in the dataset.
----
-![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRTNdRYrYE60erg7MOPEcl9oU78UdHcW_NuEHX92KwKdaDHIz37pAzKWj1XzIO-ycuO3t5MKcd5kouku-lghXowVq2xFxZKsQRJTUzhyphenhyphennOgOPr_5MLMCbZpyixqQ_jc0Zrx_kc3C8K23-rJA_wwty5X-hPCJVjIfaFOov06xgWXatBAVdwS_10OHrTVA/s6250/BOLTUIX%20(1).jpg)
-## 🛠️ Getting Started
-### 🧪 Inference Code
-Run NER with the following Python code:
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 import torch
-# Load model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
-model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
-# Input text
-text = "Elon Musk launched Tesla in California on March 2025."
 inputs = tokenizer(text, return_tensors="pt")
-# Run inference
 with torch.no_grad():
-    outputs = model(**inputs)
-predictions = outputs.logits.argmax(dim=-1)
-# Map predictions to labels
 tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
-label_map = model.config.id2label
-labels = [label_map[p.item()] for p in predictions[0]]
-# Print results
 for token, label in zip(tokens, labels):
     if token not in tokenizer.all_special_tokens:
-        print(f"{token:15} → {label}")
-```
-### ✨ Example Output
 ```
-Elon            → B-PERSON
-Musk            → I-PERSON
-launched        → O
-Tesla           → B-ORG
-in              → O
-California      → B-GPE
-on              → O
-March           → B-DATE
-2025            → I-DATE
-.               → O
-```
-### 🛠️ Requirements
-```bash
-pip install transformers torch pandas pyarrow
-```
-- **Python**: 3.8+
-- **Storage**: ~15 MB for model weights
-- **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration
 ---
-## 🧠 Entity Labels
-The model supports 36 NER tags from the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
-- **B-**: Beginning of an entity
-- **I-**: Inside of an entity
-- **O**: Outside of any entity
-| Tag Name        | Purpose                                                                 | Emoji |
-|------------------|--------------------------------------------------------------------------|--------|
-| O                | Outside of any named entity (e.g., "the", "is")                         | 🚫     |
-| B-CARDINAL       | Beginning of a cardinal number (e.g., "1000")                           | 🔢     |
-| B-DATE           | Beginning of a date (e.g., "January")                                   | 🗓️     |
-| B-EVENT          | Beginning of an event (e.g., "Olympics")                                | 🎉     |
-| B-FAC            | Beginning of a facility (e.g., "Eiffel Tower")                          | 🏛️     |
-| B-GPE            | Beginning of a geopolitical entity (e.g., "Tokyo")                      | 🌍     |
-| B-LANGUAGE       | Beginning of a language (e.g., "Spanish")                               | 🗣️     |
-| B-LAW            | Beginning of a law or legal document (e.g., "Constitution")             | 📜     |
-| B-LOC            | Beginning of a non-GPE location (e.g., "Pacific Ocean")                 | 🗺️     |
-| B-MONEY          | Beginning of a monetary value (e.g., "$100")                            | 💸     |
-| B-NORP           | Beginning of a nationality/religious/political group (e.g., "Democrat") | 🏳️     |
-| B-ORDINAL        | Beginning of an ordinal number (e.g., "first")                          | 🥇     |
-| B-ORG            | Beginning of an organization (e.g., "Microsoft")                        | 🏢     |
-| B-PERCENT        | Beginning of a percentage (e.g., "50%")                                 | 📊     |
-| B-PERSON         | Beginning of a person’s name (e.g., "Elon Musk")                        | 👤     |
-| B-PRODUCT        | Beginning of a product (e.g., "iPhone")                                 | 📱     |
-| B-QUANTITY       | Beginning of a quantity (e.g., "two liters")                            | ⚖️     |
-| B-TIME           | Beginning of a time (e.g., "noon")                                      | ⏰     |
-| B-WORK_OF_ART    | Beginning of a work of art (e.g., "Mona Lisa")                          | 🎨     |
-| I-CARDINAL       | Inside of a cardinal number                                             | 🔢     |
-| I-DATE           | Inside of a date (e.g., "2025" in "January 2025")                       | 🗓️     |
-| I-EVENT          | Inside of an event name                                                 | 🎉     |
-| I-FAC            | Inside of a facility name                                               | 🏛️     |
-| I-GPE            | Inside of a geopolitical entity                                         | 🌍     |
-| I-LANGUAGE       | Inside of a language name                                               | 🗣️     |
-| I-LAW            | Inside of a legal document title                                        | 📜     |
-| I-LOC            | Inside of a location                                                    | 🗺️     |
-| I-MONEY          | Inside of a monetary value                                              | 💸     |
-| I-NORP           | Inside of a NORP entity                                                 | 🏳️     |
-| I-ORDINAL        | Inside of an ordinal number                                             | 🥇     |
-| I-ORG            | Inside of an organization name                                          | 🏢     |
-| I-PERCENT        | Inside of a percentage                                                  | 📊     |
-| I-PERSON         | Inside of a person’s name                                               | 👤     |
-| I-PRODUCT        | Inside of a product name                                                | 📱     |
-| I-QUANTITY       | Inside of a quantity                                                    | ⚖️     |
-| I-TIME           | Inside of a time phrase                                                 | ⏰     |
-| I-WORK_OF_ART    | Inside of a work of art title                                           | 🎨     |
-**Example**:
-Text: `"Tesla opened in Shanghai on April 2025"`
-Tags: `[B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]`
----
-## 📈 Performance
-Evaluated on the `boltuix/conll2025-ner` test split (~12,217 examples) using `seqeval`:
-| Metric     | Score |
-|------------|-------|
-| 🎯 Precision | 0.84  |
-| 🕸️ Recall    | 0.86  |
-| 🎶 F1 Score  | 0.85  |
-| ✅ Accuracy  | 0.91  |
-*Note*: Performance may vary on different domains or text types.
 ---
-## ⚙️ Training Setup
-- **Hardware**: NVIDIA GPU
-- **Training Time**: ~1.5 hours
-- **Parameters**: ~4.4M
-- **Optimizer**: AdamW
-- **Precision**: FP32
-- **Batch Size**: 16
-- **Learning Rate**: 2e-5
----
-## 🧠 Training the Model
-Fine-tune `boltuix/bert-mini` on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a simplified training script:
-```python
-# 🛠️ Step 1: Install required libraries quietly
-!pip install evaluate transformers datasets tokenizers seqeval pandas pyarrow -q
-# 🚫 Step 2: Disable Weights & Biases (WandB)
-import os
-os.environ["WANDB_MODE"] = "disabled"
-# 📚 Step 2: Import necessary libraries
-import pandas as pd
-import datasets
-import numpy as np
-from transformers import BertTokenizerFast
-from transformers import DataCollatorForTokenClassification
-from transformers import AutoModelForTokenClassification
-from transformers import TrainingArguments, Trainer
-import evaluate
-from transformers import pipeline
-from collections import defaultdict
-import json
-# 📥 Step 3: Load the CoNLL-2025 NER dataset from Parquet
-# Download : https://huggingface.co/datasets/boltuix/conll2025-ner/blob/main/conll2025_ner.parquet
-parquet_file = "conll2025_ner.parquet"
-df = pd.read_parquet(parquet_file)
-# 🔍 Step 4: Convert pandas DataFrame to Hugging Face Dataset
-conll2025 = datasets.Dataset.from_pandas(df)
-# 🔎 Step 5: Inspect the dataset structure
-print("Dataset structure:", conll2025)
-print("Dataset features:", conll2025.features)
-print("First example:", conll2025[0])
-# 🏷️ Step 6: Extract unique tags and create mappings
-# Since ner_tags are strings, collect all unique tags
-all_tags = set()
-for example in conll2025:
-    all_tags.update(example["ner_tags"])
-unique_tags = sorted(list(all_tags))  # Sort for consistency
-num_tags = len(unique_tags)
-tag2id = {tag: i for i, tag in enumerate(unique_tags)}
-id2tag = {i: tag for i, tag in enumerate(unique_tags)}
-print("Number of unique tags:", num_tags)
-print("Unique tags:", unique_tags)
-# 🔧 Step 7: Convert string ner_tags to indices
-def convert_tags_to_ids(example):
-    example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
-    return example
-conll2025 = conll2025.map(convert_tags_to_ids)
-# 📊 Step 8: Split dataset based on 'split' column
-dataset_dict = {
-    "train": conll2025.filter(lambda x: x["split"] == "train"),
-    "validation": conll2025.filter(lambda x: x["split"] == "validation"),
-    "test": conll2025.filter(lambda x: x["split"] == "test")
-}
-conll2025 = datasets.DatasetDict(dataset_dict)
-print("Split dataset structure:", conll2025)
-# 🪙 Step 9: Initialize the tokenizer
-tokenizer = BertTokenizerFast.from_pretrained("boltuix/bert-mini")
-# 📝 Step 10: Tokenize an example text and inspect
-example_text = conll2025["train"][0]
-tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)
-tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
-word_ids = tokenized_input.word_ids()
-print("Word IDs:", word_ids)
-print("Tokenized input:", tokenized_input)
-print("Length of ner_tags vs input IDs:", len(example_text["ner_tags"]), len(tokenized_input["input_ids"]))
-# 🔄 Step 11: Define function to tokenize and align labels
-def tokenize_and_align_labels(examples, label_all_tokens=True):
-    """
-    Tokenize inputs and align labels for NER tasks.
-    Args:
-        examples (dict): Dictionary with tokens and ner_tags.
-        label_all_tokens (bool): Whether to label all subword tokens.
-    Returns:
-        dict: Tokenized inputs with aligned labels.
-    """
-    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
-    labels = []
-    for i, label in enumerate(examples["ner_tags"]):
-        word_ids = tokenized_inputs.word_ids(batch_index=i)
-        previous_word_idx = None
-        label_ids = []
-        for word_idx in word_ids:
-            if word_idx is None:
-                label_ids.append(-100)  # Special tokens get -100
-            elif word_idx != previous_word_idx:
-                label_ids.append(label[word_idx])  # First token of word gets label
-            else:
-                label_ids.append(label[word_idx] if label_all_tokens else -100)  # Subwords get label or -100
-            previous_word_idx = word_idx
-        labels.append(label_ids)
-    tokenized_inputs["labels"] = labels
-    return tokenized_inputs
-# 🧪 Step 12: Test the tokenization and label alignment
-q = tokenize_and_align_labels(conll2025["train"][0:1])
-print("Tokenized and aligned example:", q)
-# 📋 Step 13: Print tokens and their corresponding labels
-for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]), q["labels"][0]):
-    print(f"{token:_<40} {label}")
-# 🔧 Step 14: Apply tokenization to the entire dataset
-tokenized_datasets = conll2025.map(tokenize_and_align_labels, batched=True)
-# 🤖 Step 15: Initialize the model with the correct number of labels
-model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-mini", num_labels=num_tags)
-# ⚙️ Step 16: Set up training arguments
-args = TrainingArguments(
-    "boltuix/bert-ner",
-    eval_strategy="epoch", # Changed evaluation_strategy to eval_strategy
-    learning_rate=2e-5,
-    per_device_train_batch_size=16,
-    per_device_eval_batch_size=16,
-    num_train_epochs=1,
-    weight_decay=0.01,
-    report_to="none"
-)
-# 📊 Step 17: Initialize data collator for dynamic padding
-data_collator = DataCollatorForTokenClassification(tokenizer)
-# 📈 Step 18: Load evaluation metric
-metric = evaluate.load("seqeval")
-# 🏷️ Step 19: Set label list and test metric computation
-label_list = unique_tags
-print("Label list:", label_list)
-example = conll2025["train"][0]
-labels = [label_list[i] for i in example["ner_tags"]]
-print("Metric test:", metric.compute(predictions=[labels], references=[labels]))
-# 📉 Step 20: Define function to compute evaluation metrics
-def compute_metrics(eval_preds):
-    """
-    Compute precision, recall, F1, and accuracy for NER.
-    Args:
-        eval_preds (tuple): Predicted logits and true labels.
-    Returns:
-        dict: Evaluation metrics.
-    """
-    pred_logits, labels = eval_preds
-    pred_logits = np.argmax(pred_logits, axis=2)
-    predictions = [
-        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
-        for prediction, label in zip(pred_logits, labels)
-    ]
-    true_labels = [
-        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
-        for prediction, label in zip(pred_logits, labels)
-    ]
-    results = metric.compute(predictions=predictions, references=true_labels)
-    return {
-        "precision": results["overall_precision"],
-        "recall": results["overall_recall"],
-        "f1": results["overall_f1"],
-        "accuracy": results["overall_accuracy"],
-    }
-# 🚀 Step 21: Initialize and train the trainer
-trainer = Trainer(
-    model,
-    args,
-    train_dataset=tokenized_datasets["train"],
-    eval_dataset=tokenized_datasets["validation"],
-    data_collator=data_collator,
-    tokenizer=tokenizer,
-    compute_metrics=compute_metrics
-)
-trainer.train()
-# 💾 Step 22: Save the fine-tuned model
-model.save_pretrained("boltuix/bert-ner")
-tokenizer.save_pretrained("tokenizer")
-# 🔗 Step 23: Update model configuration with label mappings
-id2label = {str(i): label for i, label in enumerate(label_list)}
-label2id = {label: str(i) for i, label in enumerate(label_list)}
-config = json.load(open("boltuix/bert-ner/config.json"))
-config["id2label"] = id2label
-config["label2id"] = label2id
-json.dump(config, open("boltuix/bert-ner/config.json", "w"))
-# 🔄 Step 24: Load the fine-tuned model
-model_fine_tuned = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")
-# 🛠️ Step 25: Create a pipeline for NER inference
-nlp = pipeline("token-classification", model=model_fine_tuned, tokenizer=tokenizer)
-# 📝 Step 26: Perform NER on an example sentence
-example = "On July 4th, 2023, President Joe Biden visited the United Nations headquarters in New York to deliver a speech about international law and donated $5 million to relief efforts."
-ner_results = nlp(example)
-print("NER results for first example:", ner_results)
-# 📍 Step 27: Perform NER on a property address and format output
-example = "This page contains information about the property located at 1275 Kinnear Rd, Columbus, OH, 43212."
-ner_results = nlp(example)
-# 🧹 Step 28: Process NER results into structured entities
-entities = defaultdict(list)
-current_entity = ""
-current_type = ""
-for item in ner_results:
-    entity = item["entity"]
-    word = item["word"]
-    if word.startswith("##"):
-        current_entity += word[2:]  # Handle subword tokens
-    elif entity.startswith("B-"):
-        if current_entity and current_type:
-            entities[current_type].append(current_entity.strip())
-        current_type = entity[2:].lower()
-        current_entity = word
-    elif entity.startswith("I-") and entity[2:].lower() == current_type:
-        current_entity += " " + word  # Continue same entity
-    else:
-        if current_entity and current_type:
-            entities[current_type].append(current_entity.strip())
-        current_entity = ""
-        current_type = ""
-# Append final entity if exists
-if current_entity and current_type:
-    entities[current_type].append(current_entity.strip())
-# 📤 Step 29: Output the final JSON
-final_json = dict(entities)
-print("Structured NER output:")
-print(json.dumps(final_json, indent=2))
-```
-### 🛠️ Tips
-- **Hyperparameters**: Experiment with `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5).
-- **GPU**: Use `fp16=True` for faster training.
-- **Custom Data**: Modify the script for custom NER datasets.
-### ⏱️ Expected Training Time
-- ~1.5 hours on an NVIDIA GPU (e.g., T4) for ~115,812 examples, 3 epochs, batch size 16.
-### 🌍 Carbon Impact
-- Emissions: ~40g CO₂eq (estimated via ML Impact tool for 1.5 hours on GPU).
----
-## 🛠️ Installation
-```bash
-pip install transformers torch pandas pyarrow seqeval
-```
-- **Python**: 3.8+
-- **Storage**: ~15 MB for model, ~6.38 MB for dataset
-- **Optional**: NVIDIA CUDA for GPU acceleration
-### Download Instructions 📥
-- **Model**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL).
-- **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (placeholder, update with correct URL).
----
-## 🧪 Evaluation Code
-Evaluate on custom data:
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
-from seqeval.metrics import classification_report
 import torch
-# Load model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
-model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
-# Test data
-texts = ["Elon Musk launched Tesla in California on March 2025."]
-true_labels = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O", "B-GPE", "O", "B-DATE", "I-DATE", "O"]]
-pred_labels = []
-for text in texts:
-    inputs = tokenizer(text, return_tensors="pt")
-    with torch.no_grad():
-        outputs = model(**inputs)
-    predictions = outputs.logits.argmax(dim=-1)[0].cpu().numpy()
-    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
-    word_ids = inputs.word_ids(batch_index=0)
-    word_preds = []
-    previous_word_idx = None
-    for idx, word_idx in enumerate(word_ids):
-        if word_idx is None or word_idx == previous_word_idx:
-            continue
-        label = model.config.id2label[predictions[idx]]
-        word_preds.append(label)
-        previous_word_idx = word_idx
-    pred_labels.append(word_preds)
-# Evaluate
-print("Predicted:", pred_labels)
-print("True     :", true_labels)
-print("\n📊 Evaluation Report:\n")
-print(classification_report(true_labels, pred_labels))
-```
----
-## 🌱 Dataset Details
-- **Entries**: 143,709
-- **Size**: 6.38 MB (Parquet)
-- **Columns**: `split`, `tokens`, `ner_tags`
-- **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
-- **NER Tags**: 36 (18 entity types with B-/I- + O)
-- **Source**: News, user-generated content, research corpora
----
-## 📊 Visualizing NER Tags
-Compute tag distribution with:
-```python
-import pandas as pd
-from collections import Counter
-import matplotlib.pyplot as plt
-# Load dataset
-df = pd.read_parquet("conll2025_ner.parquet")
-all_tags = [tag for tags in df["ner_tags"] for tag in tags]
-tag_counts = Counter(all_tags)
-# Plot
-plt.figure(figsize=(12, 7))
-plt.bar(tag_counts.keys(), tag_counts.values(), color="#36A2EB")
-plt.title("CoNLL 2025 NER: Tag Distribution", fontsize=16)
-plt.xlabel("NER Tag", fontsize=12)
-plt.ylabel("Count", fontsize=12)
-plt.xticks(rotation=45, ha="right", fontsize=10)
-plt.grid(axis="y", linestyle="--", alpha=0.7)
-plt.tight_layout()
-plt.savefig("ner_tag_distribution.png")
-plt.show()
 ```
----
-## ⚖️ Comparison to Other Models
-| Model                | Dataset            | Parameters | F1 Score | Size   |
-|----------------------|--------------------|------------|----------|--------|
-| **EntityBERT**       | conll2025-ner      | ~4.4M      | 0.85     | ~15 MB |
-| NeuroBERT-NER        | conll2025-ner      | ~11M       | 0.86     | ~50 MB |
-| BERT-base-NER        | CoNLL-2003         | ~110M      | ~0.89    | ~400 MB|
-| DistilBERT-NER       | CoNLL-2003         | ~66M       | ~0.85    | ~200 MB|
-**Advantages**:
-- Ultra-lightweight (~4.4M parameters, ~15 MB)
-- Competitive F1 score (0.85)
-- Ideal for resource-constrained environments
----
-## 🌐 Community and Support
-- 📍 Model page: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder)
-- 🛠️ Issues/Contributions: Model repository (URL TBD)
-- 💬 Hugging Face forums: [https://huggingface.co/discussions](https://huggingface.co/discussions)
-- 📚 Docs: [Hugging Face Transformers](https://huggingface.co/docs/transformers)
-- 📧 Contact: [boltuix@gmail.com](mailto:boltuix@gmail.com)
----
-## ✍️ Contact
-- **Author**: Boltuix
-- **Email**: [boltuix@gmail.com](mailto:boltuix@gmail.com)
-- **Hugging Face**: [boltuix](https://huggingface.co/boltuix)
 ---
-## 📅 Last Updated
-**June 11, 2025** — Released v1.0 with fine-tuning on `boltuix/conll2025-ner`.
-**[Get Started Now](#getting-started)** 🚀

 license: apache-2.0
 language:
+* zh
+* en
+  datasets:
+* clue/ner
+  metrics:
+* precision
+* recall
+* f1
+* accuracy
+  pipeline\_tag: token-classification
+  library\_name: transformers
+  new\_version: v1.0
+  base\_model:
+* hfl/chinese-bert-wwm
+---
+# 🇨🇳 汉语实体识别模型：ChineseEntityBERT
+## 📌 模型介绍
+`ChineseEntityBERT` 是基于 `hfl/chinese-bert-wwm` 微调优化的中文命名实体识别（NER）模型，专为中文文本中的实体识别任务设计。它能够准确识别出地名、机构名、人名、时间、产品等18类实体，适用于法律、医疗、金融、政务、教育等多种中文场景。
+* 适用语言：中文（简体）
+* 训练数据集：来自 CLUE NER 多领域标注语料，包括新闻、百科、医疗等
+* 模型规模：\~102M 参数
+* 标签类型：BIO 格式，共37个标签（含O）
+## 🎯 应用场景
+### 🌟 实用场景示例
+* **政府信息系统**：从政务公文中抽取地名、机构名、人名，辅助知识图谱建设
+* **金融文本分析**：识别公司、金融产品、时间、货币等关键信息
+* **医疗文本结构化**：提取疾病、药物、时间等医学实体，赋能电子病历结构化
+* **教育智能问答**：构建以实体为核心的语义搜索与问答系统
+### 🧠 示例文本识别
+```
+输入：华为公司于2025年6月在深圳发布了新款Mate手机。
+输出：
+华为公司 → B-ORG
+2025年 → B-DATE
+6月 → I-DATE
+深圳 → B-GPE
+Mate手机 → B-PRODUCT
+```
+## 🛠️ 使用方法
+### 🚀 推理示例代码
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 import torch
+model_name = "your-namespace/ChineseEntityBERT"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+text = "字节跳动总部位于北京，于2023年推出了豆包大模型。"
 inputs = tokenizer(text, return_tensors="pt")
 with torch.no_grad():
+    logits = model(**inputs).logits
+predictions = logits.argmax(-1)
 tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+labels = [model.config.id2label[p.item()] for p in predictions[0]]
 for token, label in zip(tokens, labels):
     if token not in tokenizer.all_special_tokens:
+        print(f"{token:10} → {label}")
 ```
 ---
+## 🧾 支持的实体标签（部分）
+| 标签         | 含义    |
+| ---------- | ----- |
+| B-PERSON   | 人名开头  |
+| I-PERSON   | 人名中间  |
+| B-ORG      | 机构名开头 |
+| B-GPE      | 地名开头  |
+| B-DATE     | 日期开头  |
+| B-PRODUCT  | 产品名开头 |
+| B-LAW      | 法律名称  |
+| B-DISEASE  | 疾病名称  |
+| B-MEDICINE | 药品名称  |
+| O          | 非实体词  |
+（完整标签列表可参考模型配置）
 ---
+## 🇨🇳 中华识别模型（ZhonghuaNER）
+### 📌 模型简介
+ZhonghuaNER 是一个轻量级中文命名实体识别（NER）模型，基于微型 BERT 结构（bert-mini），专为中文场景下的人名、地名、机构名、时间等信息抽取任务优化。
+* **模型基础**：精简版 BERT（bert-mini）
+* **适用语言**：中文
+* **标签类型**：共计 28 种 BIO 格式实体标签
+* **使用场景**：适用于新闻、法律、医疗、电商等多个垂直中文领域
+### 🔍 支持实体类型
+ZhonghuaNER 采用 BIO 标注体系，包括但不限于以下实体：
+| 实体类型            | 示例           | 中文说明    |
+| --------------- | ------------ | ------- |
+| B-PERSON        | 李白           | 人名      |
+| B-ORG           | 中国科学院        | 机构名称    |
+| B-LOC           | 泰山           | 地理位置    |
+| B-GPE           | 北京市          | 行政区划    |
+| B-DATE          | 二零二五年六月      | 日期      |
+| B-TIME          | 下午三点         | 时间      |
+| B-MONEY         | 五百万元         | 金额      |
+| B-PRODUCT       | 华为Mate60     | 产品名     |
+| B-EVENT         | 中秋节          | 事件      |
+| B-WORK\_OF\_ART | 清明上河图        | 艺术作品    |
+| B-LAW           | 中华人民共和国民法典   | 法律法规    |
+| I-xxx           | 对应 B-xxx 内部词 | 实体内部连续词 |
+| O               | 非实体词         | -       |
+### 🧠 使用案例
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 import torch
+model = AutoModelForTokenClassification.from_pretrained("ZhonghuaAI/zhonghua-ner")
+tokenizer = AutoTokenizer.from_pretrained("ZhonghuaAI/zhonghua-ner")
+text = "2025年中秋节，李白参观了故宫博物院。"
+inputs = tokenizer(text, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs)
+logits = outputs.logits
+predictions = torch.argmax(logits, dim=-1)
+labels = [model.config.id2label[idx.item()] for idx in predictions[0]]
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+for token, label in zip(tokens, labels):
+    if token not in tokenizer.all_special_tokens:
+        print(f"{token}\t→\t{label}")
 ```
+### 📦 安装依赖
+```bash
+pip install transformers torch
+```
+### ✅ 输出示例
+```
+2025	→	B-DATE
+年	→	I-DATE
+中	→	B-EVENT
+秋	→	I-EVENT
+节	→	I-EVENT
+李	→	B-PERSON
+白	→	I-PERSON
+参	→	O
+观	→	O
+了	→	O
+故	→	B-ORG
+宫	→	I-ORG
+博	→	I-ORG
+物	→	I-ORG
+院	→	I-ORG
+。	→	O
+```
 ---
+🧪 **下节内容（训练/评估/部署）将在后半部分继续发送**。