AventIQ-AI
/

token-classification-CONLL-2003-NER

Safetensors

bert

Model card Files Files and versions

xet

Community

VedantJhunthra commited on May 21, 2025

Commit

06a0a6f

verified ·

1 Parent(s): 604ed74

Create README.md

Browse files

Files changed (1) hide show

README.md +110 -0

README.md ADDED Viewed

	@@ -0,0 +1,110 @@

+🧠 NERClassifier-BERT-CoNLL2003
+A BERT-based Named Entity Recognition (NER) model fine-tuned on the CoNLL-2003 dataset. It classifies tokens in text into predefined entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). This model is ideal for information extraction, document tagging, and question answering systems.
+---
+✨ Model Highlights
+📌 Based on bert-base-cased (by Google)
+🔍 Fine-tuned on the CoNLL-2003 Named Entity Recognition dataset
+⚡ Supports prediction of 4 entity types: PER, LOC, ORG, MISC
+💾 Available in both full and quantized versions for fast inference
+---
+🧠 Intended Uses
+• Resume and document parsing
+• News article analysis
+• Question answering pipelines
+• Chatbots and virtual assistants
+• Information retrieval and tagging
+---
+🚫 Limitations
+• Trained on English-only NER data (CoNLL-2003)
+• May not perform well on informal text (e.g., tweets, slang)
+• Entity boundaries may be misaligned with subword tokenization
+• Limited performance on extremely long sequences (>128 tokens)
+---
+🏋️‍♂️ Training Details
+| Field          | Value                          |
+| -------------- | ------------------------------ |
+| **Base Model** | `bert-base-cased`              |
+| **Dataset**    | CoNLL-2003                     |
+| **Framework**  | PyTorch with 🤗 Transformers   |
+| **Epochs**     | 5                              |
+| **Batch Size** | 16                             |
+| **Max Length** | 128 tokens                     |
+| **Optimizer**  | AdamW                          |
+| **Loss**       | CrossEntropyLoss (token-level) |
+| **Device**     | Trained on CUDA-enabled GPU    |
+---
+📊 Evaluation Metrics
+| Metric                                          | Score |
+| ----------------------------------------------- | ----- |
+| Accuracy                                        | 0.98  |
+| F1-Score                                        | 0.97  |
+---
+🔎 Label Mapping
+| Label ID | Entity Type |
+| -------- | ----------- |
+| 0        | O           |
+| 1        | B-PER       |
+| 2        | I-PER       |
+| 3        | B-ORG       |
+| 4        | I-ORG       |
+| 5        | B-LOC       |
+| 6        | I-LOC       |
+| 7        | B-MISC      |
+| 8        | I-MISC      |
+---
+🚀 Usage
+```python
+from transformers import BertTokenizerFast, BertForTokenClassification
+import torch
+model_name = "AventIQ-AI/ner_bert_conll2003"
+tokenizer = BertTokenizerFast.from_pretrained(model_name)
+model = BertForTokenClassification.from_pretrained(model_name)
+model.eval()
+def predict_tokens(text):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
+    with torch.no_grad():
+        outputs = model(**inputs).logits
+    predictions = torch.argmax(outputs, dim=2)
+    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+    labels = [model.config.id2label[label_id.item()] for label_id in predictions[0]]
+    return list(zip(tokens, labels))
+# Test example
+print(predict_tokens("Barack Obama visited Google in California."))
+```
+---
+🧩 Quantization
+Post-training static quantization applied using PyTorch to reduce model size and improve inference performance on edge devices.
+---
+🗂 Repository Structure
+```
+.
+├── model/               # Quantized model files
+├── tokenizer_config/    # Tokenizer and vocab files
+├── model.safensors/     # Fine-tuned model in safetensors format
+├── README.md            # Model card
+```
+---
+🤝 Contributing
+Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.