DistilBERT Quantized Model for Named Entity Recognition (NER)

This repository contains a quantized DistilBERT model fine-tuned for Named Entity Recognition tasks. The model identifies entities like persons, locations, organizations, etc., in text while being optimized for efficient deployment through quantization.

Model Details

Model Architecture: DistilBERT Base Uncased
Task: Named Entity Recognition
Dataset: Annotated Corpus for NER (GMB/Groningen Meaning Bank)
Quantization: INT8 (PyTorch Dynamic Quantization)
Fine-tuning Framework: Hugging Face Transformers

Entity Categories

The model recognizes these entity types:

geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon

Usage

Installation

pip install transformers torch datasets evaluate seqeval

Loading the Model

from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
import torch

# Load quantized model
model_path = "your_username/ner_distilbert_quantized"
model = DistilBertForTokenClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

# Example text
text = "Apple is looking at buying U.K. startup for $1 billion"

# Tokenize and predict
inputs = tokenizer(text.split(), 
                  is_split_into_words=True,
                  return_tensors="pt",
                  padding="max_length",
                  truncation=True,
                  max_length=128)

with torch.no_grad():
    outputs = model(**inputs)

# Process predictions
predictions = torch.argmax(outputs.logits, dim=2)[0]
word_ids = inputs.word_ids()

entities = []
current_entity = []
current_label = None

for idx, (word_id, pred_id) in enumerate(zip(word_ids, predictions)):
    if word_id is None:  # Skip special tokens
        continue
    
    pred_label = model.config.id2label[pred_id.item()]
    
    if pred_label.startswith("B-"):
        if current_entity:
            entities.append((" ".join(current_entity), current_label))
        current_entity = [text.split()[word_id]]
        current_label = pred_label[2:]
    elif pred_label.startswith("I-") and current_label == pred_label[2:]:
        current_entity.append(text.split()[word_id])
    else:
        if current_entity:
            entities.append((" ".join(current_entity), current_label))
        current_entity = []
        current_label = None

print("Identified Entities:")
for entity, label in entities:
    print(f"{entity}: {label}")

Performance Metrics

Precision: 0.91
Recall: 0.89
F1 Score: 0.90

Fine-Tuning Details

Dataset

Processed version of the Annotated Corpus for Named Entity Recognition from Kaggle.

Training

Epochs: 3
Batch Size: 16
Learning Rate: 2e-5
Max Sequence Length: 128 tokens

Quantization

Applied post-training dynamic quantization using:

quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

Repository Structure

.
├── model/                    # Quantized model files
├── training_script.py        # Fine-tuning code
├── inference_example.ipynb   # Usage examples
├── requirements.txt          # Dependencies
└── README.md                 # This documentation

Limitations

Performance may degrade on texts with unusual formatting or domain-specific jargon
The model uses subword tokenization which may split some named entities
Quantization may cause minor accuracy loss (~1-2%) compared to FP32

Contributing

Contributions are welcome! Please open an issue to discuss proposed changes.

Downloads last month: 1

Safetensors

Model size

65.2M params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support