🧠 NERClassifier-BERT-CoNLL2003 A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems. --- ✨ Model Highlights 📌 Based on bert-base-cased (by Google) 🔍 Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset ⚡ Supports prediction of 4 entity types: PER, LOC, ORG, MISC 💾 Available in both full and quantized versions for fast inference --- 🧠 Intended Uses • Resume and document parsing • News article analysis • Question answering pipelines • Chatbots and virtual assistants • Information retrieval and tagging --- 🚫 Limitations • Trained on English-only NER data (CoNLL-2003) • May not perform well on informal text (e.g., tweets, slang) • Entity boundaries may be misaligned with subword tokenization • Limited performance on extremely long sequences (>128 tokens) --- 🏋️‍♂️ Training Details | Field | Value | | -------------- | ------------------------------ | | **Base Model** | `bert-base-cased` | | **Dataset** | CoNLL-2003 | | **Framework** | PyTorch with 🤗 Transformers | | **Epochs** | 5 | | **Batch Size** | 16 | | **Max Length** | 128 tokens | | **Optimizer** | AdamW | | **Loss** | CrossEntropyLoss (token-level) | | **Device** | Trained on CUDA-enabled GPU | --- 📊 Evaluation Metrics | Metric | Score | | ----------------------------------------------- | ----- | | Accuracy | 0.98 | | F1-Score | 0.97 | --- 🔎 Label Mapping | Label ID | Entity Type | | -------- | ----------- | | 0 | O | | 1 | B-PER | | 2 | I-PER | | 3 | B-ORG | | 4 | I-ORG | | 5 | B-LOC | | 6 | I-LOC | | 7 | B-MISC | | 8 | I-MISC | --- 🚀 Usage ```python from transformers import BertTokenizerFast, BertForTokenClassification import torch model_name = "AventIQ-AI/ner_bert_conll2003" tokenizer = BertTokenizerFast.from_pretrained(model_name) model = BertForTokenClassification.from_pretrained(model_name) model.eval() def predict_tokens(text): inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) with torch.no_grad(): outputs = model(**inputs).logits predictions = torch.argmax(outputs, dim=2) tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]] return list(zip(tokens, labels)) # Test example print(predict_tokens("Barack Obama visited Google in California.")) ``` --- 🧩 Quantization Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices. --- 🗂 Repository Structure ``` . ├── model/ # Quantized model files ├── tokenizer_config/ # Tokenizer and vocab files ├── model.safensors/ # Fine-tuned model in safetensors format ├── README.md # Model card ``` --- 🤝 Contributing Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.