| π§ NERClassifier-BERT-CoNLL2003 | |
| A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems. | |
| --- | |
| β¨ Model Highlights | |
| π Based on bert-base-cased (by Google) | |
| π Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset | |
| β‘ Supports prediction of 4 entity types: PER, LOC, ORG, MISC | |
| πΎ Available in both full and quantized versions for fast inference | |
| --- | |
| π§ Intended Uses | |
| β’ Resume and document parsing | |
| β’ News article analysis | |
| β’ Question answering pipelines | |
| β’ Chatbots and virtual assistants | |
| β’ Information retrieval and tagging | |
| --- | |
| π« Limitations | |
| β’ Trained on English-only NER data (CoNLL-2003) | |
| β’ May not perform well on informal text (e.g., tweets, slang) | |
| β’ Entity boundaries may be misaligned with subword tokenization | |
| β’ Limited performance on extremely long sequences (>128 tokens) | |
| --- | |
| ποΈββοΈ Training Details | |
| | Field | Value | | |
| | -------------- | ------------------------------ | | |
| | **Base Model** | `bert-base-cased` | | |
| | **Dataset** | CoNLL-2003 | | |
| | **Framework** | PyTorch with π€ Transformers | | |
| | **Epochs** | 5 | | |
| | **Batch Size** | 16 | | |
| | **Max Length** | 128 tokens | | |
| | **Optimizer** | AdamW | | |
| | **Loss** | CrossEntropyLoss (token-level) | | |
| | **Device** | Trained on CUDA-enabled GPU | | |
| --- | |
| π Evaluation Metrics | |
| | Metric | Score | | |
| | ----------------------------------------------- | ----- | | |
| | Accuracy | 0.98 | | |
| | F1-Score | 0.97 | | |
| --- | |
| π Label Mapping | |
| | Label ID | Entity Type | | |
| | -------- | ----------- | | |
| | 0 | O | | |
| | 1 | B-PER | | |
| | 2 | I-PER | | |
| | 3 | B-ORG | | |
| | 4 | I-ORG | | |
| | 5 | B-LOC | | |
| | 6 | I-LOC | | |
| | 7 | B-MISC | | |
| | 8 | I-MISC | | |
| --- | |
| π Usage | |
| ```python | |
| from transformers import BertTokenizerFast, BertForTokenClassification | |
| import torch | |
| model_name = "AventIQ-AI/ner_bert_conll2003" | |
| tokenizer = BertTokenizerFast.from_pretrained(model_name) | |
| model = BertForTokenClassification.from_pretrained(model_name) | |
| model.eval() | |
| def predict_tokens(text): | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) | |
| with torch.no_grad(): | |
| outputs = model(**inputs).logits | |
| predictions = torch.argmax(outputs, dim=2) | |
| tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) | |
| labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]] | |
| return list(zip(tokens, labels)) | |
| # Test example | |
| print(predict_tokens("Barack Obama visited Google in California.")) | |
| ``` | |
| --- | |
| π§© Quantization | |
| Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices. | |
| --- | |
| π Repository Structure | |
| ``` | |
| . | |
| βββ model/ # Quantized model files | |
| βββ tokenizer_config/ # Tokenizer and vocab files | |
| βββ model.safensors/ # Fine-tuned model in safetensors format | |
| βββ README.md # Model card | |
| ``` | |
| --- | |
| π€ Contributing | |
| Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model. | |