| # π§ NER-BERT-AI-Model-using-annotated-corpus-ner | |
| A BERT-based Named Entity Recognition (NER) model fine-tuned on the Entity Annotated Corpus. It classifies tokens in text into predefined entity types such as Person (PER), Organization (ORG), and Location (LOC). This model is well-suited for information extraction, resume parsing, and chatbot applications. | |
| --- | |
| ## β¨ Model Highlights | |
| - π Based on `bert-base-cased` (by Google) | |
| - π Fine-tuned on the Entity Annotated Corpus (`ner_dataset.csv`) | |
| - β‘ Supports prediction of 3 entity types: PER, ORG, LOC | |
| - πΎ Compatible with Hugging Face `pipeline()` for easy inference | |
| --- | |
| ## π§ Intended Uses | |
| - Resume and document parsing | |
| - Chatbots and virtual assistants | |
| - Named entity tagging in structured documents | |
| - Search and information retrieval systems | |
| - News or content analysis | |
| --- | |
| ## π« Limitations | |
| - Trained only on English formal texts | |
| - May not generalize well to informal text or domain-specific jargon | |
| - Subword tokenization may split entities (e.g., "Cupertino" β "Cup", "##ert", "##ino") | |
| - Limited to the entities available in the original dataset (PER, ORG, LOC only) | |
| --- | |
| ## ποΈββοΈ Training Details | |
| | Field | Value | | |
| |---------------|------------------------------| | |
| | Base Model | `bert-base-cased` | | |
| | Dataset | Entity Annotated Corpus | | |
| | Framework | PyTorch with Transformers | | |
| | Epochs | 3 | | |
| | Batch Size | 16 | | |
| | Max Length | 128 tokens | | |
| | Optimizer | AdamW | | |
| | Loss | CrossEntropyLoss (token-level) | | |
| | Device | Trained on CUDA-enabled GPU | | |
| --- | |
| ## π Evaluation Metrics | |
| | Metric | Score | | |
| |-----------|-------| | |
| | Precision | 83.15 | | |
| | Recall | 83.85 | | |
| | F1-Score | 83.50 | | |
| --- | |
| ## π Label Mapping | |
| | Label ID | Entity Type | | |
| |----------|--------------| | |
| | 0 | O | | |
| | 1 | B-PER | | |
| | 2 | I-PER | | |
| | 3 | B-ORG | | |
| | 4 | I-ORG | | |
| | 5 | B-LOC | | |
| | 6 | I-LOC | | |
| --- | |
| ## π Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForTokenClassification | |
| from transformers import pipeline | |
| model_name = "/AventIQ-AI/NER-BERT-AI-Model-using-annotated-corpus-ner" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForTokenClassification.from_pretrained(model_name) | |
| nlp = pipeline("ner", model=model, tokenizer=tokenizer) | |
| example = "My name is Wolfgang and I live in Berlin" | |
| ner_results = nlp(example) | |
| print(ner_results) | |
| ``` | |
| ## π§© Quantization | |
| Post-training quantization can be applied using PyTorch to reduce model size and improve inference performance, especially on edge devices. | |
| ## π Repository Structure | |
| ``` | |
| . | |
| βββ model/ # Trained model files | |
| βββ tokenizer_config/ # Tokenizer and vocab files | |
| βββ model.safensors/ # Model in safetensors format | |
| βββ README.md # Model card | |
| ``` | |
| ## π€ Contributing | |
| We welcome feedback, bug reports, and improvements! | |
| Feel free to open an issue or submit a pull request. |