MAIRK commited on
Commit
97dc95d
Β·
verified Β·
1 Parent(s): 706a225

Upioad README.md

Browse files

---
license: apache-2.0
datasets:
- boltuix/conll2025-ner
language:
- zh
- en
metrics:
- precision
- recall
- f1
- accuracy
pipeline_tag: token-classification
library_name: transformers
new_version: v1.1
tags:
- token-classification
- ner
- named-entity-recognition
- text-classification
- sequence-labeling
- transformer
- bert
- nlp
- pretrained-model
- dataset-finetuning
- deep-learning
- huggingface
- conll2025
- real-time-inference
- efficient-nlp
- high-accuracy
- gpu-optimized
- chatbot
- information-extraction
- search-enhancement
- knowledge-graph
- legal-nlp
- medical-nlp
- financial-nlp
base_model:
- boltuix/bert-mini
---

![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirI_izRmtBN9DOIqHFRBdXqh8eUBf10yVEfKIjVglp1AKmvtoJ65ZkPeG9Xm6eqs-RcqR3HMmTizOb0eT80PV_E8qsk2XQqMqqPsfSvPmUtCFmJ6S4KTIx5hGy1m_vZRQskO3s8bNYKMPpAwHBU4zSpIjKIha-GrhBFRFdGS0bJ6ybztOFZJDgsQGMk7Q/s6250/BOLTUIX%20(2).jpg)


# 🌟 EntityBERT Model 🌟

## πŸš€ Model Details

### 🌈 Description
The `boltuix/EntityBERT` model is a lightweight, fine-tuned transformer for **Named Entity Recognition (NER)**, built on the `boltuix/bert-mini` base model. Optimized for efficiency, it identifies 36 entity types (e.g., people, organizations, locations, dates) in English text, making it perfect for applications like information extraction, chatbots, and search enhancement.

- **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (143,709 entries, 6.38 MB)
- **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
- **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
- **Domains**: News, user-generated content, research corpora
- **Tasks**: Sentence-level and document-level NER
- **Version**: v1.0

### πŸ”§ Info
- **Developer**: Boltuix
- **License**: Apache-2.0
- **Language**: English
- **Type**: Transformer-based Token Classification
- **Trained**: Before June 11, 2025
- **Base Model**: `boltuix/bert-mini`
- **Parameters**: ~4.4M
- **Size**: ~15 MB

### πŸ”— Links
- **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL)
- **Dataset**: [boltuix/conll2025-ner](#download-instructions) (placeholder, update with correct URL)
- **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
- **Demo**: Coming Soon

---

## 🎯 Use Cases for NER

### 🌟 Direct Applications
- **Information Extraction**: Identify names (πŸ‘€ PERSON), locations (🌍 GPE), and dates (πŸ—“οΈ DATE) from articles, blogs, or reports.
- **Chatbots & Virtual Assistants**: Improve user query understanding by recognizing entities.
- **Search Enhancement**: Enable entity-based semantic search (e.g., β€œnews about Paris in 2025”).
- **Knowledge Graphs**: Construct structured graphs connecting entities like 🏒 ORG and πŸ‘€ PERSON.

### 🌱 Downstream Tasks
- **Domain Adaptation**: Fine-tune for specialized fields like medical 🩺, legal πŸ“œ, or financial πŸ’Έ NER.
- **Multilingual Extensions**: Retrain for non-English languages.
- **Custom Entities**: Adapt for niche domains (e.g., product IDs, stock tickers).

### ❌ Limitations
- **English-Only**: Limited to English text out-of-the-box.
- **Domain Bias**: Trained on `boltuix/conll2025-ner`, which may favor news and formal text, potentially weaker on informal or social media content.
- **Generalization**: May struggle with rare or highly contextual entities not in the dataset.

---
![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRTNdRYrYE60erg7MOPEcl9oU78UdHcW_NuEHX92KwKdaDHIz37pAzKWj1XzIO-ycuO3t5MKcd5kouku-lghXowVq2xFxZKsQRJTUzhyphenhyphennOgOPr_5MLMCbZpyixqQ_jc0Zrx_kc3C8K23-rJA_wwty5X-hPCJVjIfaFOov06xgWXatBAVdwS_10OHrTVA/s6250/BOLTUIX%20(1).jpg)

## πŸ› οΈ Getting Started

### πŸ§ͺ Inference Code
Run NER with the following Python code:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")

# Input text
text = "Elon Musk launched Tesla in California on March 2025."
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# Map predictions to labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
label_map = model.config.id2label
labels = [label_map[p.item()] for p in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token:15} β†’ {label}")
```

### ✨ Example Output
```
Elon β†’ B-PERSON
Musk β†’ I-PERSON
launched β†’ O
Tesla β†’ B-ORG
in β†’ O
California β†’ B-GPE
on β†’ O
March β†’ B-DATE
2025 β†’ I-DATE
. β†’ O
```

### πŸ› οΈ Requirements
```bash
pip install transformers torch pandas pyarrow
```
- **Python**: 3.8+
- **Storage**: ~15 MB for model weights
- **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration

---

## 🧠 Entity Labels
The model supports 36 NER tags from the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
- **B-**: Beginning of an entity
- **I-**: Inside of an entity
- **O**: Outside of any entity

| Tag Name | Purpose | Emoji |
|------------------|--------------------------------------------------------------------------|--------|
| O | Outside of any named entity (e.g., "the", "is") | 🚫 |
| B-CARDINAL | Beginning of a cardinal number (e.g., "1000") | πŸ”’ |
| B-DATE | Beginning of a date (e.g., "January") | πŸ—“οΈ |
| B-EVENT | Beginning of an event (e.g., "Olympics") | πŸŽ‰ |
| B-FAC | Beginning of a facility (e.g., "Eiffel Tower") | πŸ›οΈ |
| B-GPE | Beginning of a geopolitical entity (e.g., "Tokyo") | 🌍 |
| B-LANGUAGE | Beginning of a language (e.g., "Spanish") | πŸ—£οΈ |
| B-LAW | Beginning of a law or legal document (e.g., "Constitution") | πŸ“œ |
| B-LOC | Beginning of a non-GPE location (e.g., "Pacific Ocean") | πŸ—ΊοΈ |
| B-MONEY | Beginning of a monetary value (e.g., "$100") | πŸ’Έ |
| B-NORP | Beginning of a nationality/religious/political group (e.g., "Democrat") | 🏳️ |
| B-ORDINAL | Beginning of an ordinal number (e.g., "first") | πŸ₯‡ |
| B-ORG | Beginning of an organization (e.g., "Microsoft") | 🏒 |
| B-PERCENT | Beginning of a percentage (e.g., "50%") | πŸ“Š |
| B-PERSON | Beginning of a person’s name (e.g., "Elon Musk") | πŸ‘€ |
| B-PRODUCT | Beginning of a product (e.g., "iPhone") | πŸ“± |
| B-QUANTITY | Beginning of a quantity (e.g., "two liters") | βš–οΈ |
| B-TIME | Beginning of a time (e.g., "noon") | ⏰ |
| B-WORK_OF_ART | Beginning of a work of art (e.g., "Mona Lisa") | 🎨 |
| I-CARDINAL | Inside of a cardinal number | πŸ”’ |
| I-DATE | Inside of a date (e.g., "2025" in "January 2025") | πŸ—“οΈ |
| I-EVENT | Inside of an event name | πŸŽ‰ |
| I-FAC | Inside of a facility name | πŸ›οΈ |
| I-GPE | Inside of a geopolitical entity | 🌍 |
| I-LANGUAGE | Inside of a language name | πŸ—£οΈ |
| I-LAW | Inside of a legal document title | πŸ“œ |
| I-LOC | Inside of a location | πŸ—ΊοΈ |
| I-MONEY | Inside of a monetary value | πŸ’Έ |
| I-NORP | Inside of a NORP entity | 🏳️ |
| I-ORDINAL | Inside of an ordinal number | πŸ₯‡ |
| I-ORG | Inside of an organization name | 🏒 |
| I-PERCENT | Inside of a percentage | πŸ“Š |
| I-PERSON | Inside of a person’s name | πŸ‘€ |
| I-PRODUCT | Inside of a product name | πŸ“± |
| I-QUANTITY | Inside of a quantity | βš–οΈ |
| I-TIME | Inside of a time phrase | ⏰ |
| I-WORK_OF_ART | Inside of a work of art title | 🎨 |

**Example**:
Text: `"Tesla opened in Shanghai on April 2025"`
Tags: `[B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]`

---

## πŸ“ˆ Performance

Evaluated on the `boltuix/conll2025-ner` test split (~12,217 examples) using `seqeval`:

| Metric | Score |
|------------|-------|
| 🎯 Precision | 0.84 |
| πŸ•ΈοΈ Recall | 0.86 |
| 🎢 F1 Score | 0.85 |
| βœ… Accuracy | 0.91 |

*Note*: Performance may vary on different domains or text types.

---

## βš™οΈ Training Setup

- **Hardware**: NVIDIA GPU
- **Training Time**: ~1.5 hours
- **Parameters**: ~4.4M
- **Optimizer**: Adam

Files changed (1) hide show
  1. README.md +620 -23
README.md CHANGED
@@ -1,30 +1,627 @@
1
  ---
2
- tags:
3
- - llama
4
- - text-generation
5
- - conversational
6
- - chatbot
7
- license: mit
8
- language:
9
- - zh
10
- - en
11
  datasets:
12
- - custom
 
 
 
13
  metrics:
14
- - name: C‑Eval EM
15
- value: 68.3
16
- - name: GPT4Bot‑Bench F1
17
- value: 72.1
18
- - name: SelfChat Similarity
19
- value: 0.87
20
- pipeline_tag: text-generation
21
- model-index:
22
- - name: MAIRK/abab
23
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ---
25
 
26
- # My LLaMA 2 Chatbot (7B)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- This repository contains a **fine‑tuned bilingual conversational model** based on **LLaMAΒ 2Β 7B**, built for Chinese and English dialogue tasks.
 
29
 
30
- ...
 
1
  ---
2
+ license: apache-2.0
 
 
 
 
 
 
 
 
3
  datasets:
4
+ - boltuix/conll2025-ner
5
+ language:
6
+ - zh
7
+ - en
8
  metrics:
9
+ - precision
10
+ - recall
11
+ - f1
12
+ - accuracy
13
+ pipeline_tag: token-classification
14
+ library_name: transformers
15
+ new_version: v1.1
16
+ tags:
17
+ - token-classification
18
+ - ner
19
+ - named-entity-recognition
20
+ - text-classification
21
+ - sequence-labeling
22
+ - transformer
23
+ - bert
24
+ - nlp
25
+ - pretrained-model
26
+ - dataset-finetuning
27
+ - deep-learning
28
+ - huggingface
29
+ - conll2025
30
+ - real-time-inference
31
+ - efficient-nlp
32
+ - high-accuracy
33
+ - gpu-optimized
34
+ - chatbot
35
+ - information-extraction
36
+ - search-enhancement
37
+ - knowledge-graph
38
+ - legal-nlp
39
+ - medical-nlp
40
+ - financial-nlp
41
+ base_model:
42
+ - boltuix/bert-mini
43
+ ---
44
+
45
+ ![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirI_izRmtBN9DOIqHFRBdXqh8eUBf10yVEfKIjVglp1AKmvtoJ65ZkPeG9Xm6eqs-RcqR3HMmTizOb0eT80PV_E8qsk2XQqMqqPsfSvPmUtCFmJ6S4KTIx5hGy1m_vZRQskO3s8bNYKMPpAwHBU4zSpIjKIha-GrhBFRFdGS0bJ6ybztOFZJDgsQGMk7Q/s6250/BOLTUIX%20(2).jpg)
46
+
47
+
48
+ # 🌟 EntityBERT Model 🌟
49
+
50
+ ## πŸš€ Model Details
51
+
52
+ ### 🌈 Description
53
+ The `boltuix/EntityBERT` model is a lightweight, fine-tuned transformer for **Named Entity Recognition (NER)**, built on the `boltuix/bert-mini` base model. Optimized for efficiency, it identifies 36 entity types (e.g., people, organizations, locations, dates) in English text, making it perfect for applications like information extraction, chatbots, and search enhancement.
54
+
55
+ - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (143,709 entries, 6.38 MB)
56
+ - **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
57
+ - **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
58
+ - **Domains**: News, user-generated content, research corpora
59
+ - **Tasks**: Sentence-level and document-level NER
60
+ - **Version**: v1.0
61
+
62
+ ### πŸ”§ Info
63
+ - **Developer**: Boltuix
64
+ - **License**: Apache-2.0
65
+ - **Language**: English
66
+ - **Type**: Transformer-based Token Classification
67
+ - **Trained**: Before June 11, 2025
68
+ - **Base Model**: `boltuix/bert-mini`
69
+ - **Parameters**: ~4.4M
70
+ - **Size**: ~15 MB
71
+
72
+ ### πŸ”— Links
73
+ - **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL)
74
+ - **Dataset**: [boltuix/conll2025-ner](#download-instructions) (placeholder, update with correct URL)
75
+ - **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
76
+ - **Demo**: Coming Soon
77
+
78
  ---
79
 
80
+ ## 🎯 Use Cases for NER
81
+
82
+ ### 🌟 Direct Applications
83
+ - **Information Extraction**: Identify names (πŸ‘€ PERSON), locations (🌍 GPE), and dates (πŸ—“οΈ DATE) from articles, blogs, or reports.
84
+ - **Chatbots & Virtual Assistants**: Improve user query understanding by recognizing entities.
85
+ - **Search Enhancement**: Enable entity-based semantic search (e.g., β€œnews about Paris in 2025”).
86
+ - **Knowledge Graphs**: Construct structured graphs connecting entities like 🏒 ORG and πŸ‘€ PERSON.
87
+
88
+ ### 🌱 Downstream Tasks
89
+ - **Domain Adaptation**: Fine-tune for specialized fields like medical 🩺, legal πŸ“œ, or financial πŸ’Έ NER.
90
+ - **Multilingual Extensions**: Retrain for non-English languages.
91
+ - **Custom Entities**: Adapt for niche domains (e.g., product IDs, stock tickers).
92
+
93
+ ### ❌ Limitations
94
+ - **English-Only**: Limited to English text out-of-the-box.
95
+ - **Domain Bias**: Trained on `boltuix/conll2025-ner`, which may favor news and formal text, potentially weaker on informal or social media content.
96
+ - **Generalization**: May struggle with rare or highly contextual entities not in the dataset.
97
+
98
+ ---
99
+ ![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRTNdRYrYE60erg7MOPEcl9oU78UdHcW_NuEHX92KwKdaDHIz37pAzKWj1XzIO-ycuO3t5MKcd5kouku-lghXowVq2xFxZKsQRJTUzhyphenhyphennOgOPr_5MLMCbZpyixqQ_jc0Zrx_kc3C8K23-rJA_wwty5X-hPCJVjIfaFOov06xgWXatBAVdwS_10OHrTVA/s6250/BOLTUIX%20(1).jpg)
100
+
101
+ ## πŸ› οΈ Getting Started
102
+
103
+ ### πŸ§ͺ Inference Code
104
+ Run NER with the following Python code:
105
+
106
+ ```python
107
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
108
+ import torch
109
+
110
+ # Load model and tokenizer
111
+ tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
112
+ model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
113
+
114
+ # Input text
115
+ text = "Elon Musk launched Tesla in California on March 2025."
116
+ inputs = tokenizer(text, return_tensors="pt")
117
+
118
+ # Run inference
119
+ with torch.no_grad():
120
+ outputs = model(**inputs)
121
+ predictions = outputs.logits.argmax(dim=-1)
122
+
123
+ # Map predictions to labels
124
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
125
+ label_map = model.config.id2label
126
+ labels = [label_map[p.item()] for p in predictions[0]]
127
+
128
+ # Print results
129
+ for token, label in zip(tokens, labels):
130
+ if token not in tokenizer.all_special_tokens:
131
+ print(f"{token:15} β†’ {label}")
132
+ ```
133
+
134
+ ### ✨ Example Output
135
+ ```
136
+ Elon β†’ B-PERSON
137
+ Musk β†’ I-PERSON
138
+ launched β†’ O
139
+ Tesla β†’ B-ORG
140
+ in β†’ O
141
+ California β†’ B-GPE
142
+ on β†’ O
143
+ March β†’ B-DATE
144
+ 2025 β†’ I-DATE
145
+ . β†’ O
146
+ ```
147
+
148
+ ### πŸ› οΈ Requirements
149
+ ```bash
150
+ pip install transformers torch pandas pyarrow
151
+ ```
152
+ - **Python**: 3.8+
153
+ - **Storage**: ~15 MB for model weights
154
+ - **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration
155
+
156
+ ---
157
+
158
+ ## 🧠 Entity Labels
159
+ The model supports 36 NER tags from the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
160
+ - **B-**: Beginning of an entity
161
+ - **I-**: Inside of an entity
162
+ - **O**: Outside of any entity
163
+
164
+ | Tag Name | Purpose | Emoji |
165
+ |------------------|--------------------------------------------------------------------------|--------|
166
+ | O | Outside of any named entity (e.g., "the", "is") | 🚫 |
167
+ | B-CARDINAL | Beginning of a cardinal number (e.g., "1000") | πŸ”’ |
168
+ | B-DATE | Beginning of a date (e.g., "January") | πŸ—“οΈ |
169
+ | B-EVENT | Beginning of an event (e.g., "Olympics") | πŸŽ‰ |
170
+ | B-FAC | Beginning of a facility (e.g., "Eiffel Tower") | πŸ›οΈ |
171
+ | B-GPE | Beginning of a geopolitical entity (e.g., "Tokyo") | 🌍 |
172
+ | B-LANGUAGE | Beginning of a language (e.g., "Spanish") | πŸ—£οΈ |
173
+ | B-LAW | Beginning of a law or legal document (e.g., "Constitution") | πŸ“œ |
174
+ | B-LOC | Beginning of a non-GPE location (e.g., "Pacific Ocean") | πŸ—ΊοΈ |
175
+ | B-MONEY | Beginning of a monetary value (e.g., "$100") | πŸ’Έ |
176
+ | B-NORP | Beginning of a nationality/religious/political group (e.g., "Democrat") | 🏳️ |
177
+ | B-ORDINAL | Beginning of an ordinal number (e.g., "first") | πŸ₯‡ |
178
+ | B-ORG | Beginning of an organization (e.g., "Microsoft") | 🏒 |
179
+ | B-PERCENT | Beginning of a percentage (e.g., "50%") | πŸ“Š |
180
+ | B-PERSON | Beginning of a person’s name (e.g., "Elon Musk") | πŸ‘€ |
181
+ | B-PRODUCT | Beginning of a product (e.g., "iPhone") | πŸ“± |
182
+ | B-QUANTITY | Beginning of a quantity (e.g., "two liters") | βš–οΈ |
183
+ | B-TIME | Beginning of a time (e.g., "noon") | ⏰ |
184
+ | B-WORK_OF_ART | Beginning of a work of art (e.g., "Mona Lisa") | 🎨 |
185
+ | I-CARDINAL | Inside of a cardinal number | πŸ”’ |
186
+ | I-DATE | Inside of a date (e.g., "2025" in "January 2025") | πŸ—“οΈ |
187
+ | I-EVENT | Inside of an event name | πŸŽ‰ |
188
+ | I-FAC | Inside of a facility name | πŸ›οΈ |
189
+ | I-GPE | Inside of a geopolitical entity | 🌍 |
190
+ | I-LANGUAGE | Inside of a language name | πŸ—£οΈ |
191
+ | I-LAW | Inside of a legal document title | πŸ“œ |
192
+ | I-LOC | Inside of a location | πŸ—ΊοΈ |
193
+ | I-MONEY | Inside of a monetary value | πŸ’Έ |
194
+ | I-NORP | Inside of a NORP entity | 🏳️ |
195
+ | I-ORDINAL | Inside of an ordinal number | πŸ₯‡ |
196
+ | I-ORG | Inside of an organization name | 🏒 |
197
+ | I-PERCENT | Inside of a percentage | πŸ“Š |
198
+ | I-PERSON | Inside of a person’s name | πŸ‘€ |
199
+ | I-PRODUCT | Inside of a product name | πŸ“± |
200
+ | I-QUANTITY | Inside of a quantity | βš–οΈ |
201
+ | I-TIME | Inside of a time phrase | ⏰ |
202
+ | I-WORK_OF_ART | Inside of a work of art title | 🎨 |
203
+
204
+ **Example**:
205
+ Text: `"Tesla opened in Shanghai on April 2025"`
206
+ Tags: `[B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]`
207
+
208
+ ---
209
+
210
+ ## πŸ“ˆ Performance
211
+
212
+ Evaluated on the `boltuix/conll2025-ner` test split (~12,217 examples) using `seqeval`:
213
+
214
+ | Metric | Score |
215
+ |------------|-------|
216
+ | 🎯 Precision | 0.84 |
217
+ | πŸ•ΈοΈ Recall | 0.86 |
218
+ | 🎢 F1 Score | 0.85 |
219
+ | βœ… Accuracy | 0.91 |
220
+
221
+ *Note*: Performance may vary on different domains or text types.
222
+
223
+ ---
224
+
225
+ ## βš™οΈ Training Setup
226
+
227
+ - **Hardware**: NVIDIA GPU
228
+ - **Training Time**: ~1.5 hours
229
+ - **Parameters**: ~4.4M
230
+ - **Optimizer**: AdamW
231
+ - **Precision**: FP32
232
+ - **Batch Size**: 16
233
+ - **Learning Rate**: 2e-5
234
+
235
+ ---
236
+
237
+ ## 🧠 Training the Model
238
+
239
+ Fine-tune `boltuix/bert-mini` on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a simplified training script:
240
+
241
+ ```python
242
+ # πŸ› οΈ Step 1: Install required libraries quietly
243
+ !pip install evaluate transformers datasets tokenizers seqeval pandas pyarrow -q
244
+
245
+ # 🚫 Step 2: Disable Weights & Biases (WandB)
246
+ import os
247
+ os.environ["WANDB_MODE"] = "disabled"
248
+
249
+ # πŸ“š Step 2: Import necessary libraries
250
+ import pandas as pd
251
+ import datasets
252
+ import numpy as np
253
+ from transformers import BertTokenizerFast
254
+ from transformers import DataCollatorForTokenClassification
255
+ from transformers import AutoModelForTokenClassification
256
+ from transformers import TrainingArguments, Trainer
257
+ import evaluate
258
+ from transformers import pipeline
259
+ from collections import defaultdict
260
+ import json
261
+
262
+ # πŸ“₯ Step 3: Load the CoNLL-2025 NER dataset from Parquet
263
+ # Download : https://huggingface.co/datasets/boltuix/conll2025-ner/blob/main/conll2025_ner.parquet
264
+ parquet_file = "conll2025_ner.parquet"
265
+ df = pd.read_parquet(parquet_file)
266
+
267
+ # πŸ” Step 4: Convert pandas DataFrame to Hugging Face Dataset
268
+ conll2025 = datasets.Dataset.from_pandas(df)
269
+
270
+ # πŸ”Ž Step 5: Inspect the dataset structure
271
+ print("Dataset structure:", conll2025)
272
+ print("Dataset features:", conll2025.features)
273
+ print("First example:", conll2025[0])
274
+
275
+ # 🏷️ Step 6: Extract unique tags and create mappings
276
+ # Since ner_tags are strings, collect all unique tags
277
+ all_tags = set()
278
+ for example in conll2025:
279
+ all_tags.update(example["ner_tags"])
280
+ unique_tags = sorted(list(all_tags)) # Sort for consistency
281
+ num_tags = len(unique_tags)
282
+ tag2id = {tag: i for i, tag in enumerate(unique_tags)}
283
+ id2tag = {i: tag for i, tag in enumerate(unique_tags)}
284
+ print("Number of unique tags:", num_tags)
285
+ print("Unique tags:", unique_tags)
286
+
287
+ # πŸ”§ Step 7: Convert string ner_tags to indices
288
+ def convert_tags_to_ids(example):
289
+ example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
290
+ return example
291
+
292
+ conll2025 = conll2025.map(convert_tags_to_ids)
293
+
294
+ # πŸ“Š Step 8: Split dataset based on 'split' column
295
+ dataset_dict = {
296
+ "train": conll2025.filter(lambda x: x["split"] == "train"),
297
+ "validation": conll2025.filter(lambda x: x["split"] == "validation"),
298
+ "test": conll2025.filter(lambda x: x["split"] == "test")
299
+ }
300
+ conll2025 = datasets.DatasetDict(dataset_dict)
301
+ print("Split dataset structure:", conll2025)
302
+
303
+ # πŸͺ™ Step 9: Initialize the tokenizer
304
+ tokenizer = BertTokenizerFast.from_pretrained("boltuix/bert-mini")
305
+
306
+ # πŸ“ Step 10: Tokenize an example text and inspect
307
+ example_text = conll2025["train"][0]
308
+ tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)
309
+ tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
310
+ word_ids = tokenized_input.word_ids()
311
+ print("Word IDs:", word_ids)
312
+ print("Tokenized input:", tokenized_input)
313
+ print("Length of ner_tags vs input IDs:", len(example_text["ner_tags"]), len(tokenized_input["input_ids"]))
314
+
315
+ # πŸ”„ Step 11: Define function to tokenize and align labels
316
+ def tokenize_and_align_labels(examples, label_all_tokens=True):
317
+ """
318
+ Tokenize inputs and align labels for NER tasks.
319
+
320
+ Args:
321
+ examples (dict): Dictionary with tokens and ner_tags.
322
+ label_all_tokens (bool): Whether to label all subword tokens.
323
+
324
+ Returns:
325
+ dict: Tokenized inputs with aligned labels.
326
+ """
327
+ tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
328
+ labels = []
329
+ for i, label in enumerate(examples["ner_tags"]):
330
+ word_ids = tokenized_inputs.word_ids(batch_index=i)
331
+ previous_word_idx = None
332
+ label_ids = []
333
+ for word_idx in word_ids:
334
+ if word_idx is None:
335
+ label_ids.append(-100) # Special tokens get -100
336
+ elif word_idx != previous_word_idx:
337
+ label_ids.append(label[word_idx]) # First token of word gets label
338
+ else:
339
+ label_ids.append(label[word_idx] if label_all_tokens else -100) # Subwords get label or -100
340
+ previous_word_idx = word_idx
341
+ labels.append(label_ids)
342
+ tokenized_inputs["labels"] = labels
343
+ return tokenized_inputs
344
+
345
+ # πŸ§ͺ Step 12: Test the tokenization and label alignment
346
+ q = tokenize_and_align_labels(conll2025["train"][0:1])
347
+ print("Tokenized and aligned example:", q)
348
+
349
+ # πŸ“‹ Step 13: Print tokens and their corresponding labels
350
+ for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]), q["labels"][0]):
351
+ print(f"{token:_<40} {label}")
352
+
353
+ # πŸ”§ Step 14: Apply tokenization to the entire dataset
354
+ tokenized_datasets = conll2025.map(tokenize_and_align_labels, batched=True)
355
+
356
+ # πŸ€– Step 15: Initialize the model with the correct number of labels
357
+ model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-mini", num_labels=num_tags)
358
+
359
+ # βš™οΈ Step 16: Set up training arguments
360
+ args = TrainingArguments(
361
+ "boltuix/bert-ner",
362
+ eval_strategy="epoch", # Changed evaluation_strategy to eval_strategy
363
+ learning_rate=2e-5,
364
+ per_device_train_batch_size=16,
365
+ per_device_eval_batch_size=16,
366
+ num_train_epochs=1,
367
+ weight_decay=0.01,
368
+ report_to="none"
369
+ )
370
+ # πŸ“Š Step 17: Initialize data collator for dynamic padding
371
+ data_collator = DataCollatorForTokenClassification(tokenizer)
372
+
373
+ # πŸ“ˆ Step 18: Load evaluation metric
374
+ metric = evaluate.load("seqeval")
375
+
376
+ # 🏷️ Step 19: Set label list and test metric computation
377
+ label_list = unique_tags
378
+ print("Label list:", label_list)
379
+ example = conll2025["train"][0]
380
+ labels = [label_list[i] for i in example["ner_tags"]]
381
+ print("Metric test:", metric.compute(predictions=[labels], references=[labels]))
382
+
383
+ # πŸ“‰ Step 20: Define function to compute evaluation metrics
384
+ def compute_metrics(eval_preds):
385
+ """
386
+ Compute precision, recall, F1, and accuracy for NER.
387
+
388
+ Args:
389
+ eval_preds (tuple): Predicted logits and true labels.
390
+
391
+ Returns:
392
+ dict: Evaluation metrics.
393
+ """
394
+ pred_logits, labels = eval_preds
395
+ pred_logits = np.argmax(pred_logits, axis=2)
396
+ predictions = [
397
+ [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
398
+ for prediction, label in zip(pred_logits, labels)
399
+ ]
400
+ true_labels = [
401
+ [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
402
+ for prediction, label in zip(pred_logits, labels)
403
+ ]
404
+ results = metric.compute(predictions=predictions, references=true_labels)
405
+ return {
406
+ "precision": results["overall_precision"],
407
+ "recall": results["overall_recall"],
408
+ "f1": results["overall_f1"],
409
+ "accuracy": results["overall_accuracy"],
410
+ }
411
+
412
+ # πŸš€ Step 21: Initialize and train the trainer
413
+ trainer = Trainer(
414
+ model,
415
+ args,
416
+ train_dataset=tokenized_datasets["train"],
417
+ eval_dataset=tokenized_datasets["validation"],
418
+ data_collator=data_collator,
419
+ tokenizer=tokenizer,
420
+ compute_metrics=compute_metrics
421
+ )
422
+ trainer.train()
423
+
424
+ # πŸ’Ύ Step 22: Save the fine-tuned model
425
+ model.save_pretrained("boltuix/bert-ner")
426
+ tokenizer.save_pretrained("tokenizer")
427
+
428
+ # πŸ”— Step 23: Update model configuration with label mappings
429
+ id2label = {str(i): label for i, label in enumerate(label_list)}
430
+ label2id = {label: str(i) for i, label in enumerate(label_list)}
431
+ config = json.load(open("boltuix/bert-ner/config.json"))
432
+ config["id2label"] = id2label
433
+ config["label2id"] = label2id
434
+ json.dump(config, open("boltuix/bert-ner/config.json", "w"))
435
+
436
+ # πŸ”„ Step 24: Load the fine-tuned model
437
+ model_fine_tuned = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")
438
+
439
+ # πŸ› οΈ Step 25: Create a pipeline for NER inference
440
+ nlp = pipeline("token-classification", model=model_fine_tuned, tokenizer=tokenizer)
441
+
442
+ # πŸ“ Step 26: Perform NER on an example sentence
443
+ example = "On July 4th, 2023, President Joe Biden visited the United Nations headquarters in New York to deliver a speech about international law and donated $5 million to relief efforts."
444
+ ner_results = nlp(example)
445
+ print("NER results for first example:", ner_results)
446
+
447
+ # πŸ“ Step 27: Perform NER on a property address and format output
448
+ example = "This page contains information about the property located at 1275 Kinnear Rd, Columbus, OH, 43212."
449
+ ner_results = nlp(example)
450
+
451
+ # 🧹 Step 28: Process NER results into structured entities
452
+ entities = defaultdict(list)
453
+ current_entity = ""
454
+ current_type = ""
455
+
456
+ for item in ner_results:
457
+ entity = item["entity"]
458
+ word = item["word"]
459
+ if word.startswith("##"):
460
+ current_entity += word[2:] # Handle subword tokens
461
+ elif entity.startswith("B-"):
462
+ if current_entity and current_type:
463
+ entities[current_type].append(current_entity.strip())
464
+ current_type = entity[2:].lower()
465
+ current_entity = word
466
+ elif entity.startswith("I-") and entity[2:].lower() == current_type:
467
+ current_entity += " " + word # Continue same entity
468
+ else:
469
+ if current_entity and current_type:
470
+ entities[current_type].append(current_entity.strip())
471
+ current_entity = ""
472
+ current_type = ""
473
+
474
+ # Append final entity if exists
475
+ if current_entity and current_type:
476
+ entities[current_type].append(current_entity.strip())
477
+
478
+ # πŸ“€ Step 29: Output the final JSON
479
+ final_json = dict(entities)
480
+ print("Structured NER output:")
481
+ print(json.dumps(final_json, indent=2))
482
+ ```
483
+
484
+ ### πŸ› οΈ Tips
485
+ - **Hyperparameters**: Experiment with `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5).
486
+ - **GPU**: Use `fp16=True` for faster training.
487
+ - **Custom Data**: Modify the script for custom NER datasets.
488
+
489
+ ### ⏱️ Expected Training Time
490
+ - ~1.5 hours on an NVIDIA GPU (e.g., T4) for ~115,812 examples, 3 epochs, batch size 16.
491
+
492
+ ### 🌍 Carbon Impact
493
+ - Emissions: ~40g COβ‚‚eq (estimated via ML Impact tool for 1.5 hours on GPU).
494
+
495
+ ---
496
+
497
+ ## πŸ› οΈ Installation
498
+
499
+ ```bash
500
+ pip install transformers torch pandas pyarrow seqeval
501
+ ```
502
+ - **Python**: 3.8+
503
+ - **Storage**: ~15 MB for model, ~6.38 MB for dataset
504
+ - **Optional**: NVIDIA CUDA for GPU acceleration
505
+
506
+ ### Download Instructions πŸ“₯
507
+ - **Model**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL).
508
+ - **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (placeholder, update with correct URL).
509
+
510
+ ---
511
+
512
+ ## πŸ§ͺ Evaluation Code
513
+ Evaluate on custom data:
514
+
515
+ ```python
516
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
517
+ from seqeval.metrics import classification_report
518
+ import torch
519
+
520
+ # Load model and tokenizer
521
+ tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
522
+ model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
523
+
524
+ # Test data
525
+ texts = ["Elon Musk launched Tesla in California on March 2025."]
526
+ true_labels = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O", "B-GPE", "O", "B-DATE", "I-DATE", "O"]]
527
+
528
+ pred_labels = []
529
+ for text in texts:
530
+ inputs = tokenizer(text, return_tensors="pt")
531
+ with torch.no_grad():
532
+ outputs = model(**inputs)
533
+ predictions = outputs.logits.argmax(dim=-1)[0].cpu().numpy()
534
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
535
+ word_ids = inputs.word_ids(batch_index=0)
536
+ word_preds = []
537
+ previous_word_idx = None
538
+ for idx, word_idx in enumerate(word_ids):
539
+ if word_idx is None or word_idx == previous_word_idx:
540
+ continue
541
+ label = model.config.id2label[predictions[idx]]
542
+ word_preds.append(label)
543
+ previous_word_idx = word_idx
544
+ pred_labels.append(word_preds)
545
+
546
+ # Evaluate
547
+ print("Predicted:", pred_labels)
548
+ print("True :", true_labels)
549
+ print("\nπŸ“Š Evaluation Report:\n")
550
+ print(classification_report(true_labels, pred_labels))
551
+ ```
552
+
553
+ ---
554
+
555
+ ## 🌱 Dataset Details
556
+ - **Entries**: 143,709
557
+ - **Size**: 6.38 MB (Parquet)
558
+ - **Columns**: `split`, `tokens`, `ner_tags`
559
+ - **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
560
+ - **NER Tags**: 36 (18 entity types with B-/I- + O)
561
+ - **Source**: News, user-generated content, research corpora
562
+
563
+ ---
564
+
565
+ ## πŸ“Š Visualizing NER Tags
566
+ Compute tag distribution with:
567
+
568
+ ```python
569
+ import pandas as pd
570
+ from collections import Counter
571
+ import matplotlib.pyplot as plt
572
+
573
+ # Load dataset
574
+ df = pd.read_parquet("conll2025_ner.parquet")
575
+ all_tags = [tag for tags in df["ner_tags"] for tag in tags]
576
+ tag_counts = Counter(all_tags)
577
+
578
+ # Plot
579
+ plt.figure(figsize=(12, 7))
580
+ plt.bar(tag_counts.keys(), tag_counts.values(), color="#36A2EB")
581
+ plt.title("CoNLL 2025 NER: Tag Distribution", fontsize=16)
582
+ plt.xlabel("NER Tag", fontsize=12)
583
+ plt.ylabel("Count", fontsize=12)
584
+ plt.xticks(rotation=45, ha="right", fontsize=10)
585
+ plt.grid(axis="y", linestyle="--", alpha=0.7)
586
+ plt.tight_layout()
587
+ plt.savefig("ner_tag_distribution.png")
588
+ plt.show()
589
+ ```
590
+
591
+ ---
592
+
593
+ ## βš–οΈ Comparison to Other Models
594
+ | Model | Dataset | Parameters | F1 Score | Size |
595
+ |----------------------|--------------------|------------|----------|--------|
596
+ | **EntityBERT** | conll2025-ner | ~4.4M | 0.85 | ~15 MB |
597
+ | NeuroBERT-NER | conll2025-ner | ~11M | 0.86 | ~50 MB |
598
+ | BERT-base-NER | CoNLL-2003 | ~110M | ~0.89 | ~400 MB|
599
+ | DistilBERT-NER | CoNLL-2003 | ~66M | ~0.85 | ~200 MB|
600
+
601
+ **Advantages**:
602
+ - Ultra-lightweight (~4.4M parameters, ~15 MB)
603
+ - Competitive F1 score (0.85)
604
+ - Ideal for resource-constrained environments
605
+
606
+ ---
607
+
608
+ ## 🌐 Community and Support
609
+ - πŸ“ Model page: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder)
610
+ - πŸ› οΈ Issues/Contributions: Model repository (URL TBD)
611
+ - πŸ’¬ Hugging Face forums: [https://huggingface.co/discussions](https://huggingface.co/discussions)
612
+ - πŸ“š Docs: [Hugging Face Transformers](https://huggingface.co/docs/transformers)
613
+ - πŸ“§ Contact: [boltuix@gmail.com](mailto:boltuix@gmail.com)
614
+
615
+ ---
616
+
617
+ ## ✍️ Contact
618
+ - **Author**: Boltuix
619
+ - **Email**: [boltuix@gmail.com](mailto:boltuix@gmail.com)
620
+ - **Hugging Face**: [boltuix](https://huggingface.co/boltuix)
621
+
622
+ ---
623
 
624
+ ## πŸ“… Last Updated
625
+ **June 11, 2025** β€” Released v1.0 with fine-tuning on `boltuix/conll2025-ner`.
626
 
627
+ **[Get Started Now](#getting-started)** πŸš€