YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
DistilBERT Quantized Model for Named Entity Recognition (NER)
This repository contains a quantized DistilBERT model fine-tuned for Named Entity Recognition tasks. The model identifies entities like persons, locations, organizations, etc., in text while being optimized for efficient deployment through quantization.
Model Details
- Model Architecture: DistilBERT Base Uncased
- Task: Named Entity Recognition
- Dataset: Annotated Corpus for NER (GMB/Groningen Meaning Bank)
- Quantization: INT8 (PyTorch Dynamic Quantization)
- Fine-tuning Framework: Hugging Face Transformers
Entity Categories
The model recognizes these entity types:
- geo = Geographical Entity
- org = Organization
- per = Person
- gpe = Geopolitical Entity
- tim = Time indicator
- art = Artifact
- eve = Event
- nat = Natural Phenomenon
Usage
Installation
pip install transformers torch datasets evaluate seqeval
Loading the Model
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
import torch
# Load quantized model
model_path = "your_username/ner_distilbert_quantized"
model = DistilBertForTokenClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
# Example text
text = "Apple is looking at buying U.K. startup for $1 billion"
# Tokenize and predict
inputs = tokenizer(text.split(),
is_split_into_words=True,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=128)
with torch.no_grad():
outputs = model(**inputs)
# Process predictions
predictions = torch.argmax(outputs.logits, dim=2)[0]
word_ids = inputs.word_ids()
entities = []
current_entity = []
current_label = None
for idx, (word_id, pred_id) in enumerate(zip(word_ids, predictions)):
if word_id is None: # Skip special tokens
continue
pred_label = model.config.id2label[pred_id.item()]
if pred_label.startswith("B-"):
if current_entity:
entities.append((" ".join(current_entity), current_label))
current_entity = [text.split()[word_id]]
current_label = pred_label[2:]
elif pred_label.startswith("I-") and current_label == pred_label[2:]:
current_entity.append(text.split()[word_id])
else:
if current_entity:
entities.append((" ".join(current_entity), current_label))
current_entity = []
current_label = None
print("Identified Entities:")
for entity, label in entities:
print(f"{entity}: {label}")
Performance Metrics
- Precision: 0.91
- Recall: 0.89
- F1 Score: 0.90
Fine-Tuning Details
Dataset
Processed version of the Annotated Corpus for Named Entity Recognition from Kaggle.
Training
- Epochs: 3
- Batch Size: 16
- Learning Rate: 2e-5
- Max Sequence Length: 128 tokens
Quantization
Applied post-training dynamic quantization using:
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8
)
Repository Structure
.
βββ model/ # Quantized model files
βββ training_script.py # Fine-tuning code
βββ inference_example.ipynb # Usage examples
βββ requirements.txt # Dependencies
βββ README.md # This documentation
Limitations
- Performance may degrade on texts with unusual formatting or domain-specific jargon
- The model uses subword tokenization which may split some named entities
- Quantization may cause minor accuracy loss (~1-2%) compared to FP32
Contributing
Contributions are welcome! Please open an issue to discuss proposed changes.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support