Upioad README.md

---
license: apache-2.0
datasets:
- boltuix/conll2025-ner
language:
- zh
- en
metrics:
- precision
- recall
- f1
- accuracy
pipeline_tag: token-classification
library_name: transformers
new_version: v1.1
tags:
- token-classification
- ner
- named-entity-recognition
- text-classification
- sequence-labeling
- transformer
- bert
- nlp
- pretrained-model
- dataset-finetuning
- deep-learning
- huggingface
- conll2025
- real-time-inference
- efficient-nlp
- high-accuracy
- gpu-optimized
- chatbot
- information-extraction
- search-enhancement
- knowledge-graph
- legal-nlp
- medical-nlp
- financial-nlp
base_model:
- boltuix/bert-mini
---

![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirI_izRmtBN9DOIqHFRBdXqh8eUBf10yVEfKIjVglp1AKmvtoJ65ZkPeG9Xm6eqs-RcqR3HMmTizOb0eT80PV_E8qsk2XQqMqqPsfSvPmUtCFmJ6S4KTIx5hGy1m_vZRQskO3s8bNYKMPpAwHBU4zSpIjKIha-GrhBFRFdGS0bJ6ybztOFZJDgsQGMk7Q/s6250/BOLTUIX%20(2).jpg)

# 🌟 EntityBERT Model 🌟

## 🚀 Model Details

### 🌈 Description
The `boltuix/EntityBERT` model is a lightweight, fine-tuned transformer for **Named Entity Recognition (NER)**, built on the `boltuix/bert-mini` base model. Optimized for efficiency, it identifies 36 entity types (e.g., people, organizations, locations, dates) in English text, making it perfect for applications like information extraction, chatbots, and search enhancement.

- **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (143,709 entries, 6.38 MB)
- **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
- **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
- **Domains**: News, user-generated content, research corpora
- **Tasks**: Sentence-level and document-level NER
- **Version**: v1.0

### 🔧 Info
- **Developer**: Boltuix
- **License**: Apache-2.0
- **Language**: English
- **Type**: Transformer-based Token Classification
- **Trained**: Before June 11, 2025
- **Base Model**: `boltuix/bert-mini`
- **Parameters**: ~4.4M
- **Size**: ~15 MB

### 🔗 Links
- **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL)
- **Dataset**: [boltuix/conll2025-ner](#download-instructions) (placeholder, update with correct URL)
- **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
- **Demo**: Coming Soon

---

## 🎯 Use Cases for NER

### 🌟 Direct Applications
- **Information Extraction**: Identify names (👤 PERSON), locations (🌍 GPE), and dates (🗓️ DATE) from articles, blogs, or reports.
- **Chatbots & Virtual Assistants**: Improve user query understanding by recognizing entities.
- **Search Enhancement**: Enable entity-based semantic search (e.g., “news about Paris in 2025”).
- **Knowledge Graphs**: Construct structured graphs connecting entities like 🏢 ORG and 👤 PERSON.

### 🌱 Downstream Tasks
- **Domain Adaptation**: Fine-tune for specialized fields like medical 🩺, legal 📜, or financial 💸 NER.
- **Multilingual Extensions**: Retrain for non-English languages.
- **Custom Entities**: Adapt for niche domains (e.g., product IDs, stock tickers).

### ❌ Limitations
- **English-Only**: Limited to English text out-of-the-box.
- **Domain Bias**: Trained on `boltuix/conll2025-ner`, which may favor news and formal text, potentially weaker on informal or social media content.
- **Generalization**: May struggle with rare or highly contextual entities not in the dataset.

---
![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRTNdRYrYE60erg7MOPEcl9oU78UdHcW_NuEHX92KwKdaDHIz37pAzKWj1XzIO-ycuO3t5MKcd5kouku-lghXowVq2xFxZKsQRJTUzhyphenhyphennOgOPr_5MLMCbZpyixqQ_jc0Zrx_kc3C8K23-rJA_wwty5X-hPCJVjIfaFOov06xgWXatBAVdwS_10OHrTVA/s6250/BOLTUIX%20(1).jpg)

## 🛠️ Getting Started

### 🧪 Inference Code
Run NER with the following Python code:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")

# Input text
text = "Elon Musk launched Tesla in California on March 2025."
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# Map predictions to labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
label_map = model.config.id2label
labels = [label_map[p.item()] for p in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token:15} → {label}")
```

### ✨ Example Output
```
Elon → B-PERSON
Musk → I-PERSON
launched → O
Tesla → B-ORG
in → O
California → B-GPE
on → O
March → B-DATE
2025 → I-DATE
. → O
```

### 🛠️ Requirements
```bash
pip install transformers torch pandas pyarrow
```
- **Python**: 3.8+
- **Storage**: ~15 MB for model weights
- **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration

---

## 🧠 Entity Labels
The model supports 36 NER tags from the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
- **B-**: Beginning of an entity
- **I-**: Inside of an entity
- **O**: Outside of any entity

| Tag Name | Purpose | Emoji |
|------------------|--------------------------------------------------------------------------|--------|
| O | Outside of any named entity (e.g., "the", "is") | 🚫 |
| B-CARDINAL | Beginning of a cardinal number (e.g., "1000") | 🔢 |
| B-DATE | Beginning of a date (e.g., "January") | 🗓️ |
| B-EVENT | Beginning of an event (e.g., "Olympics") | 🎉 |
| B-FAC | Beginning of a facility (e.g., "Eiffel Tower") | 🏛️ |
| B-GPE | Beginning of a geopolitical entity (e.g., "Tokyo") | 🌍 |
| B-LANGUAGE | Beginning of a language (e.g., "Spanish") | 🗣️ |
| B-LAW | Beginning of a law or legal document (e.g., "Constitution") | 📜 |
| B-LOC | Beginning of a non-GPE location (e.g., "Pacific Ocean") | 🗺️ |
| B-MONEY | Beginning of a monetary value (e.g., "$100") | 💸 |
| B-NORP | Beginning of a nationality/religious/political group (e.g., "Democrat") | 🏳️ |
| B-ORDINAL | Beginning of an ordinal number (e.g., "first") | 🥇 |
| B-ORG | Beginning of an organization (e.g., "Microsoft") | 🏢 |
| B-PERCENT | Beginning of a percentage (e.g., "50%") | 📊 |
| B-PERSON | Beginning of a person’s name (e.g., "Elon Musk") | 👤 |
| B-PRODUCT | Beginning of a product (e.g., "iPhone") | 📱 |
| B-QUANTITY | Beginning of a quantity (e.g., "two liters") | ⚖️ |
| B-TIME | Beginning of a time (e.g., "noon") | ⏰ |
| B-WORK_OF_ART | Beginning of a work of art (e.g., "Mona Lisa") | 🎨 |
| I-CARDINAL | Inside of a cardinal number | 🔢 |
| I-DATE | Inside of a date (e.g., "2025" in "January 2025") | 🗓️ |
| I-EVENT | Inside of an event name | 🎉 |
| I-FAC | Inside of a facility name | 🏛️ |
| I-GPE | Inside of a geopolitical entity | 🌍 |
| I-LANGUAGE | Inside of a language name | 🗣️ |
| I-LAW | Inside of a legal document title | 📜 |
| I-LOC | Inside of a location | 🗺️ |
| I-MONEY | Inside of a monetary value | 💸 |
| I-NORP | Inside of a NORP entity | 🏳️ |
| I-ORDINAL | Inside of an ordinal number | 🥇 |
| I-ORG | Inside of an organization name | 🏢 |
| I-PERCENT | Inside of a percentage | 📊 |
| I-PERSON | Inside of a person’s name | 👤 |
| I-PRODUCT | Inside of a product name | 📱 |
| I-QUANTITY | Inside of a quantity | ⚖️ |
| I-TIME | Inside of a time phrase | ⏰ |
| I-WORK_OF_ART | Inside of a work of art title | 🎨 |

**Example**:
Text: `"Tesla opened in Shanghai on April 2025"`
Tags: `[B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]`

---

## 📈 Performance

Evaluated on the `boltuix/conll2025-ner` test split (~12,217 examples) using `seqeval`:

| Metric | Score |
|------------|-------|
| 🎯 Precision | 0.84 |
| 🕸️ Recall | 0.86 |
| 🎶 F1 Score | 0.85 |
| ✅ Accuracy | 0.91 |

*Note*: Performance may vary on different domains or text types.

---

## ⚙️ Training Setup

- **Hardware**: NVIDIA GPU
- **Training Time**: ~1.5 hours
- **Parameters**: ~4.4M
- **Optimizer**: Adam

Files changed (1) hide show

README.md +620 -23

README.md CHANGED Viewed

@@ -1,30 +1,627 @@
 ---
-tags:
-  - llama
-  - text-generation
-  - conversational
-  - chatbot
-license: mit
-language:
-  - zh
-  - en
 datasets:
-  - custom
 metrics:
-  - name: C‑Eval EM
-    value: 68.3
-  - name: GPT4Bot‑Bench F1
-    value: 72.1
-  - name: SelfChat Similarity
-    value: 0.87
-pipeline_tag: text-generation
-model-index:
-  - name: MAIRK/abab
-    results: []
 ---
-# My LLaMA 2 Chatbot (7B)
-This repository contains a **fine‑tuned bilingual conversational model** based on **LLaMA 2 7B**, built for Chinese and English dialogue tasks.
-...

 ---
+license: apache-2.0
 datasets:
+- boltuix/conll2025-ner
+language:
+- zh
+- en
 metrics:
+- precision
+- recall
+- f1
+- accuracy
+pipeline_tag: token-classification
+library_name: transformers
+new_version: v1.1
+tags:
+- token-classification
+- ner
+- named-entity-recognition
+- text-classification
+- sequence-labeling
+- transformer
+- bert
+- nlp
+- pretrained-model
+- dataset-finetuning
+- deep-learning
+- huggingface
+- conll2025
+- real-time-inference
+- efficient-nlp
+- high-accuracy
+- gpu-optimized
+- chatbot
+- information-extraction
+- search-enhancement
+- knowledge-graph
+- legal-nlp
+- medical-nlp
+- financial-nlp
+base_model:
+- boltuix/bert-mini
+---
+![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirI_izRmtBN9DOIqHFRBdXqh8eUBf10yVEfKIjVglp1AKmvtoJ65ZkPeG9Xm6eqs-RcqR3HMmTizOb0eT80PV_E8qsk2XQqMqqPsfSvPmUtCFmJ6S4KTIx5hGy1m_vZRQskO3s8bNYKMPpAwHBU4zSpIjKIha-GrhBFRFdGS0bJ6ybztOFZJDgsQGMk7Q/s6250/BOLTUIX%20(2).jpg)
+# 🌟 EntityBERT Model 🌟
+## 🚀 Model Details
+### 🌈 Description
+The `boltuix/EntityBERT` model is a lightweight, fine-tuned transformer for **Named Entity Recognition (NER)**, built on the `boltuix/bert-mini` base model. Optimized for efficiency, it identifies 36 entity types (e.g., people, organizations, locations, dates) in English text, making it perfect for applications like information extraction, chatbots, and search enhancement.
+- **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (143,709 entries, 6.38 MB)
+- **Entity Types**: 36 NER tags (18 entity categories with B-/I- tags + O)
+- **Training Examples**: ~115,812 | **Validation**: ~15,680 | **Test**: ~12,217
+- **Domains**: News, user-generated content, research corpora
+- **Tasks**: Sentence-level and document-level NER
+- **Version**: v1.0
+### 🔧 Info
+- **Developer**: Boltuix
+- **License**: Apache-2.0
+- **Language**: English
+- **Type**: Transformer-based Token Classification
+- **Trained**: Before June 11, 2025
+- **Base Model**: `boltuix/bert-mini`
+- **Parameters**: ~4.4M
+- **Size**: ~15 MB
+### 🔗 Links
+- **Model Repository**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL)
+- **Dataset**: [boltuix/conll2025-ner](#download-instructions) (placeholder, update with correct URL)
+- **Hugging Face Docs**: [Transformers](https://huggingface.co/docs/transformers)
+- **Demo**: Coming Soon
 ---
+## 🎯 Use Cases for NER
+### 🌟 Direct Applications
+- **Information Extraction**: Identify names (👤 PERSON), locations (🌍 GPE), and dates (🗓️ DATE) from articles, blogs, or reports.
+- **Chatbots & Virtual Assistants**: Improve user query understanding by recognizing entities.
+- **Search Enhancement**: Enable entity-based semantic search (e.g., “news about Paris in 2025”).
+- **Knowledge Graphs**: Construct structured graphs connecting entities like 🏢 ORG and 👤 PERSON.
+### 🌱 Downstream Tasks
+- **Domain Adaptation**: Fine-tune for specialized fields like medical 🩺, legal 📜, or financial 💸 NER.
+- **Multilingual Extensions**: Retrain for non-English languages.
+- **Custom Entities**: Adapt for niche domains (e.g., product IDs, stock tickers).
+### ❌ Limitations
+- **English-Only**: Limited to English text out-of-the-box.
+- **Domain Bias**: Trained on `boltuix/conll2025-ner`, which may favor news and formal text, potentially weaker on informal or social media content.
+- **Generalization**: May struggle with rare or highly contextual entities not in the dataset.
+---
+![Banner](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRTNdRYrYE60erg7MOPEcl9oU78UdHcW_NuEHX92KwKdaDHIz37pAzKWj1XzIO-ycuO3t5MKcd5kouku-lghXowVq2xFxZKsQRJTUzhyphenhyphennOgOPr_5MLMCbZpyixqQ_jc0Zrx_kc3C8K23-rJA_wwty5X-hPCJVjIfaFOov06xgWXatBAVdwS_10OHrTVA/s6250/BOLTUIX%20(1).jpg)
+## 🛠️ Getting Started
+### 🧪 Inference Code
+Run NER with the following Python code:
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
+model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
+# Input text
+text = "Elon Musk launched Tesla in California on March 2025."
+inputs = tokenizer(text, return_tensors="pt")
+# Run inference
+with torch.no_grad():
+    outputs = model(**inputs)
+predictions = outputs.logits.argmax(dim=-1)
+# Map predictions to labels
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+label_map = model.config.id2label
+labels = [label_map[p.item()] for p in predictions[0]]
+# Print results
+for token, label in zip(tokens, labels):
+    if token not in tokenizer.all_special_tokens:
+        print(f"{token:15} → {label}")
+```
+### ✨ Example Output
+```
+Elon            → B-PERSON
+Musk            → I-PERSON
+launched        → O
+Tesla           → B-ORG
+in              → O
+California      → B-GPE
+on              → O
+March           → B-DATE
+2025            → I-DATE
+.               → O
+```
+### 🛠️ Requirements
+```bash
+pip install transformers torch pandas pyarrow
+```
+- **Python**: 3.8+
+- **Storage**: ~15 MB for model weights
+- **Optional**: `seqeval` for evaluation, `cuda` for GPU acceleration
+---
+## 🧠 Entity Labels
+The model supports 36 NER tags from the `boltuix/conll2025-ner` dataset, using the **BIO tagging scheme**:
+- **B-**: Beginning of an entity
+- **I-**: Inside of an entity
+- **O**: Outside of any entity
+| Tag Name        | Purpose                                                                 | Emoji |
+|------------------|--------------------------------------------------------------------------|--------|
+| O                | Outside of any named entity (e.g., "the", "is")                         | 🚫     |
+| B-CARDINAL       | Beginning of a cardinal number (e.g., "1000")                           | 🔢     |
+| B-DATE           | Beginning of a date (e.g., "January")                                   | 🗓️     |
+| B-EVENT          | Beginning of an event (e.g., "Olympics")                                | 🎉     |
+| B-FAC            | Beginning of a facility (e.g., "Eiffel Tower")                          | 🏛️     |
+| B-GPE            | Beginning of a geopolitical entity (e.g., "Tokyo")                      | 🌍     |
+| B-LANGUAGE       | Beginning of a language (e.g., "Spanish")                               | 🗣️     |
+| B-LAW            | Beginning of a law or legal document (e.g., "Constitution")             | 📜     |
+| B-LOC            | Beginning of a non-GPE location (e.g., "Pacific Ocean")                 | 🗺️     |
+| B-MONEY          | Beginning of a monetary value (e.g., "$100")                            | 💸     |
+| B-NORP           | Beginning of a nationality/religious/political group (e.g., "Democrat") | 🏳️     |
+| B-ORDINAL        | Beginning of an ordinal number (e.g., "first")                          | 🥇     |
+| B-ORG            | Beginning of an organization (e.g., "Microsoft")                        | 🏢     |
+| B-PERCENT        | Beginning of a percentage (e.g., "50%")                                 | 📊     |
+| B-PERSON         | Beginning of a person’s name (e.g., "Elon Musk")                        | 👤     |
+| B-PRODUCT        | Beginning of a product (e.g., "iPhone")                                 | 📱     |
+| B-QUANTITY       | Beginning of a quantity (e.g., "two liters")                            | ⚖️     |
+| B-TIME           | Beginning of a time (e.g., "noon")                                      | ⏰     |
+| B-WORK_OF_ART    | Beginning of a work of art (e.g., "Mona Lisa")                          | 🎨     |
+| I-CARDINAL       | Inside of a cardinal number                                             | 🔢     |
+| I-DATE           | Inside of a date (e.g., "2025" in "January 2025")                       | 🗓️     |
+| I-EVENT          | Inside of an event name                                                 | 🎉     |
+| I-FAC            | Inside of a facility name                                               | 🏛️     |
+| I-GPE            | Inside of a geopolitical entity                                         | 🌍     |
+| I-LANGUAGE       | Inside of a language name                                               | 🗣️     |
+| I-LAW            | Inside of a legal document title                                        | 📜     |
+| I-LOC            | Inside of a location                                                    | 🗺️     |
+| I-MONEY          | Inside of a monetary value                                              | 💸     |
+| I-NORP           | Inside of a NORP entity                                                 | 🏳️     |
+| I-ORDINAL        | Inside of an ordinal number                                             | 🥇     |
+| I-ORG            | Inside of an organization name                                          | 🏢     |
+| I-PERCENT        | Inside of a percentage                                                  | 📊     |
+| I-PERSON         | Inside of a person’s name                                               | 👤     |
+| I-PRODUCT        | Inside of a product name                                                | 📱     |
+| I-QUANTITY       | Inside of a quantity                                                    | ⚖️     |
+| I-TIME           | Inside of a time phrase                                                 | ⏰     |
+| I-WORK_OF_ART    | Inside of a work of art title                                           | 🎨     |
+**Example**:
+Text: `"Tesla opened in Shanghai on April 2025"`
+Tags: `[B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]`
+---
+## 📈 Performance
+Evaluated on the `boltuix/conll2025-ner` test split (~12,217 examples) using `seqeval`:
+| Metric     | Score |
+|------------|-------|
+| 🎯 Precision | 0.84  |
+| 🕸️ Recall    | 0.86  |
+| 🎶 F1 Score  | 0.85  |
+| ✅ Accuracy  | 0.91  |
+*Note*: Performance may vary on different domains or text types.
+---
+## ⚙️ Training Setup
+- **Hardware**: NVIDIA GPU
+- **Training Time**: ~1.5 hours
+- **Parameters**: ~4.4M
+- **Optimizer**: AdamW
+- **Precision**: FP32
+- **Batch Size**: 16
+- **Learning Rate**: 2e-5
+---
+## 🧠 Training the Model
+Fine-tune `boltuix/bert-mini` on the `boltuix/conll2025-ner` dataset to replicate or extend `EntityBERT`. Below is a simplified training script:
+```python
+# 🛠️ Step 1: Install required libraries quietly
+!pip install evaluate transformers datasets tokenizers seqeval pandas pyarrow -q
+# 🚫 Step 2: Disable Weights & Biases (WandB)
+import os
+os.environ["WANDB_MODE"] = "disabled"
+# 📚 Step 2: Import necessary libraries
+import pandas as pd
+import datasets
+import numpy as np
+from transformers import BertTokenizerFast
+from transformers import DataCollatorForTokenClassification
+from transformers import AutoModelForTokenClassification
+from transformers import TrainingArguments, Trainer
+import evaluate
+from transformers import pipeline
+from collections import defaultdict
+import json
+# 📥 Step 3: Load the CoNLL-2025 NER dataset from Parquet
+# Download : https://huggingface.co/datasets/boltuix/conll2025-ner/blob/main/conll2025_ner.parquet
+parquet_file = "conll2025_ner.parquet"
+df = pd.read_parquet(parquet_file)
+# 🔍 Step 4: Convert pandas DataFrame to Hugging Face Dataset
+conll2025 = datasets.Dataset.from_pandas(df)
+# 🔎 Step 5: Inspect the dataset structure
+print("Dataset structure:", conll2025)
+print("Dataset features:", conll2025.features)
+print("First example:", conll2025[0])
+# 🏷️ Step 6: Extract unique tags and create mappings
+# Since ner_tags are strings, collect all unique tags
+all_tags = set()
+for example in conll2025:
+    all_tags.update(example["ner_tags"])
+unique_tags = sorted(list(all_tags))  # Sort for consistency
+num_tags = len(unique_tags)
+tag2id = {tag: i for i, tag in enumerate(unique_tags)}
+id2tag = {i: tag for i, tag in enumerate(unique_tags)}
+print("Number of unique tags:", num_tags)
+print("Unique tags:", unique_tags)
+# 🔧 Step 7: Convert string ner_tags to indices
+def convert_tags_to_ids(example):
+    example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
+    return example
+conll2025 = conll2025.map(convert_tags_to_ids)
+# 📊 Step 8: Split dataset based on 'split' column
+dataset_dict = {
+    "train": conll2025.filter(lambda x: x["split"] == "train"),
+    "validation": conll2025.filter(lambda x: x["split"] == "validation"),
+    "test": conll2025.filter(lambda x: x["split"] == "test")
+}
+conll2025 = datasets.DatasetDict(dataset_dict)
+print("Split dataset structure:", conll2025)
+# 🪙 Step 9: Initialize the tokenizer
+tokenizer = BertTokenizerFast.from_pretrained("boltuix/bert-mini")
+# 📝 Step 10: Tokenize an example text and inspect
+example_text = conll2025["train"][0]
+tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)
+tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
+word_ids = tokenized_input.word_ids()
+print("Word IDs:", word_ids)
+print("Tokenized input:", tokenized_input)
+print("Length of ner_tags vs input IDs:", len(example_text["ner_tags"]), len(tokenized_input["input_ids"]))
+# 🔄 Step 11: Define function to tokenize and align labels
+def tokenize_and_align_labels(examples, label_all_tokens=True):
+    """
+    Tokenize inputs and align labels for NER tasks.
+    Args:
+        examples (dict): Dictionary with tokens and ner_tags.
+        label_all_tokens (bool): Whether to label all subword tokens.
+    Returns:
+        dict: Tokenized inputs with aligned labels.
+    """
+    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
+    labels = []
+    for i, label in enumerate(examples["ner_tags"]):
+        word_ids = tokenized_inputs.word_ids(batch_index=i)
+        previous_word_idx = None
+        label_ids = []
+        for word_idx in word_ids:
+            if word_idx is None:
+                label_ids.append(-100)  # Special tokens get -100
+            elif word_idx != previous_word_idx:
+                label_ids.append(label[word_idx])  # First token of word gets label
+            else:
+                label_ids.append(label[word_idx] if label_all_tokens else -100)  # Subwords get label or -100
+            previous_word_idx = word_idx
+        labels.append(label_ids)
+    tokenized_inputs["labels"] = labels
+    return tokenized_inputs
+# 🧪 Step 12: Test the tokenization and label alignment
+q = tokenize_and_align_labels(conll2025["train"][0:1])
+print("Tokenized and aligned example:", q)
+# 📋 Step 13: Print tokens and their corresponding labels
+for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]), q["labels"][0]):
+    print(f"{token:_<40} {label}")
+# 🔧 Step 14: Apply tokenization to the entire dataset
+tokenized_datasets = conll2025.map(tokenize_and_align_labels, batched=True)
+# 🤖 Step 15: Initialize the model with the correct number of labels
+model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-mini", num_labels=num_tags)
+# ⚙️ Step 16: Set up training arguments
+args = TrainingArguments(
+    "boltuix/bert-ner",
+    eval_strategy="epoch", # Changed evaluation_strategy to eval_strategy
+    learning_rate=2e-5,
+    per_device_train_batch_size=16,
+    per_device_eval_batch_size=16,
+    num_train_epochs=1,
+    weight_decay=0.01,
+    report_to="none"
+)
+# 📊 Step 17: Initialize data collator for dynamic padding
+data_collator = DataCollatorForTokenClassification(tokenizer)
+# 📈 Step 18: Load evaluation metric
+metric = evaluate.load("seqeval")
+# 🏷️ Step 19: Set label list and test metric computation
+label_list = unique_tags
+print("Label list:", label_list)
+example = conll2025["train"][0]
+labels = [label_list[i] for i in example["ner_tags"]]
+print("Metric test:", metric.compute(predictions=[labels], references=[labels]))
+# 📉 Step 20: Define function to compute evaluation metrics
+def compute_metrics(eval_preds):
+    """
+    Compute precision, recall, F1, and accuracy for NER.
+    Args:
+        eval_preds (tuple): Predicted logits and true labels.
+    Returns:
+        dict: Evaluation metrics.
+    """
+    pred_logits, labels = eval_preds
+    pred_logits = np.argmax(pred_logits, axis=2)
+    predictions = [
+        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
+        for prediction, label in zip(pred_logits, labels)
+    ]
+    true_labels = [
+        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
+        for prediction, label in zip(pred_logits, labels)
+    ]
+    results = metric.compute(predictions=predictions, references=true_labels)
+    return {
+        "precision": results["overall_precision"],
+        "recall": results["overall_recall"],
+        "f1": results["overall_f1"],
+        "accuracy": results["overall_accuracy"],
+    }
+# 🚀 Step 21: Initialize and train the trainer
+trainer = Trainer(
+    model,
+    args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["validation"],
+    data_collator=data_collator,
+    tokenizer=tokenizer,
+    compute_metrics=compute_metrics
+)
+trainer.train()
+# 💾 Step 22: Save the fine-tuned model
+model.save_pretrained("boltuix/bert-ner")
+tokenizer.save_pretrained("tokenizer")
+# 🔗 Step 23: Update model configuration with label mappings
+id2label = {str(i): label for i, label in enumerate(label_list)}
+label2id = {label: str(i) for i, label in enumerate(label_list)}
+config = json.load(open("boltuix/bert-ner/config.json"))
+config["id2label"] = id2label
+config["label2id"] = label2id
+json.dump(config, open("boltuix/bert-ner/config.json", "w"))
+# 🔄 Step 24: Load the fine-tuned model
+model_fine_tuned = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")
+# 🛠️ Step 25: Create a pipeline for NER inference
+nlp = pipeline("token-classification", model=model_fine_tuned, tokenizer=tokenizer)
+# 📝 Step 26: Perform NER on an example sentence
+example = "On July 4th, 2023, President Joe Biden visited the United Nations headquarters in New York to deliver a speech about international law and donated $5 million to relief efforts."
+ner_results = nlp(example)
+print("NER results for first example:", ner_results)
+# 📍 Step 27: Perform NER on a property address and format output
+example = "This page contains information about the property located at 1275 Kinnear Rd, Columbus, OH, 43212."
+ner_results = nlp(example)
+# 🧹 Step 28: Process NER results into structured entities
+entities = defaultdict(list)
+current_entity = ""
+current_type = ""
+for item in ner_results:
+    entity = item["entity"]
+    word = item["word"]
+    if word.startswith("##"):
+        current_entity += word[2:]  # Handle subword tokens
+    elif entity.startswith("B-"):
+        if current_entity and current_type:
+            entities[current_type].append(current_entity.strip())
+        current_type = entity[2:].lower()
+        current_entity = word
+    elif entity.startswith("I-") and entity[2:].lower() == current_type:
+        current_entity += " " + word  # Continue same entity
+    else:
+        if current_entity and current_type:
+            entities[current_type].append(current_entity.strip())
+        current_entity = ""
+        current_type = ""
+# Append final entity if exists
+if current_entity and current_type:
+    entities[current_type].append(current_entity.strip())
+# 📤 Step 29: Output the final JSON
+final_json = dict(entities)
+print("Structured NER output:")
+print(json.dumps(final_json, indent=2))
+```
+### 🛠️ Tips
+- **Hyperparameters**: Experiment with `learning_rate` (1e-5 to 5e-5) or `num_train_epochs` (2-5).
+- **GPU**: Use `fp16=True` for faster training.
+- **Custom Data**: Modify the script for custom NER datasets.
+### ⏱️ Expected Training Time
+- ~1.5 hours on an NVIDIA GPU (e.g., T4) for ~115,812 examples, 3 epochs, batch size 16.
+### 🌍 Carbon Impact
+- Emissions: ~40g CO₂eq (estimated via ML Impact tool for 1.5 hours on GPU).
+---
+## 🛠️ Installation
+```bash
+pip install transformers torch pandas pyarrow seqeval
+```
+- **Python**: 3.8+
+- **Storage**: ~15 MB for model, ~6.38 MB for dataset
+- **Optional**: NVIDIA CUDA for GPU acceleration
+### Download Instructions 📥
+- **Model**: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder, update with correct URL).
+- **Dataset**: [boltuix/conll2025-ner](https://huggingface.co/datasets/boltuix/conll2025-ner) (placeholder, update with correct URL).
+---
+## 🧪 Evaluation Code
+Evaluate on custom data:
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+from seqeval.metrics import classification_report
+import torch
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("boltuix/EntityBERT")
+model = AutoModelForTokenClassification.from_pretrained("boltuix/EntityBERT")
+# Test data
+texts = ["Elon Musk launched Tesla in California on March 2025."]
+true_labels = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O", "B-GPE", "O", "B-DATE", "I-DATE", "O"]]
+pred_labels = []
+for text in texts:
+    inputs = tokenizer(text, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model(**inputs)
+    predictions = outputs.logits.argmax(dim=-1)[0].cpu().numpy()
+    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+    word_ids = inputs.word_ids(batch_index=0)
+    word_preds = []
+    previous_word_idx = None
+    for idx, word_idx in enumerate(word_ids):
+        if word_idx is None or word_idx == previous_word_idx:
+            continue
+        label = model.config.id2label[predictions[idx]]
+        word_preds.append(label)
+        previous_word_idx = word_idx
+    pred_labels.append(word_preds)
+# Evaluate
+print("Predicted:", pred_labels)
+print("True     :", true_labels)
+print("\n📊 Evaluation Report:\n")
+print(classification_report(true_labels, pred_labels))
+```
+---
+## 🌱 Dataset Details
+- **Entries**: 143,709
+- **Size**: 6.38 MB (Parquet)
+- **Columns**: `split`, `tokens`, `ner_tags`
+- **Splits**: Train (~115,812), Validation (~15,680), Test (~12,217)
+- **NER Tags**: 36 (18 entity types with B-/I- + O)
+- **Source**: News, user-generated content, research corpora
+---
+## 📊 Visualizing NER Tags
+Compute tag distribution with:
+```python
+import pandas as pd
+from collections import Counter
+import matplotlib.pyplot as plt
+# Load dataset
+df = pd.read_parquet("conll2025_ner.parquet")
+all_tags = [tag for tags in df["ner_tags"] for tag in tags]
+tag_counts = Counter(all_tags)
+# Plot
+plt.figure(figsize=(12, 7))
+plt.bar(tag_counts.keys(), tag_counts.values(), color="#36A2EB")
+plt.title("CoNLL 2025 NER: Tag Distribution", fontsize=16)
+plt.xlabel("NER Tag", fontsize=12)
+plt.ylabel("Count", fontsize=12)
+plt.xticks(rotation=45, ha="right", fontsize=10)
+plt.grid(axis="y", linestyle="--", alpha=0.7)
+plt.tight_layout()
+plt.savefig("ner_tag_distribution.png")
+plt.show()
+```
+---
+## ⚖️ Comparison to Other Models
+| Model                | Dataset            | Parameters | F1 Score | Size   |
+|----------------------|--------------------|------------|----------|--------|
+| **EntityBERT**       | conll2025-ner      | ~4.4M      | 0.85     | ~15 MB |
+| NeuroBERT-NER        | conll2025-ner      | ~11M       | 0.86     | ~50 MB |
+| BERT-base-NER        | CoNLL-2003         | ~110M      | ~0.89    | ~400 MB|
+| DistilBERT-NER       | CoNLL-2003         | ~66M       | ~0.85    | ~200 MB|
+**Advantages**:
+- Ultra-lightweight (~4.4M parameters, ~15 MB)
+- Competitive F1 score (0.85)
+- Ideal for resource-constrained environments
+---
+## 🌐 Community and Support
+- 📍 Model page: [boltuix/EntityBERT](https://huggingface.co/boltuix/EntityBERT) (placeholder)
+- 🛠️ Issues/Contributions: Model repository (URL TBD)
+- 💬 Hugging Face forums: [https://huggingface.co/discussions](https://huggingface.co/discussions)
+- 📚 Docs: [Hugging Face Transformers](https://huggingface.co/docs/transformers)
+- 📧 Contact: [boltuix@gmail.com](mailto:boltuix@gmail.com)
+---
+## ✍️ Contact
+- **Author**: Boltuix
+- **Email**: [boltuix@gmail.com](mailto:boltuix@gmail.com)
+- **Hugging Face**: [boltuix](https://huggingface.co/boltuix)
+---
+## 📅 Last Updated
+**June 11, 2025** — Released v1.0 with fine-tuning on `boltuix/conll2025-ner`.
+**[Get Started Now](#getting-started)** 🚀