YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

DistilBERT Quantized Model for Named Entity Recognition (NER)

This repository contains a quantized DistilBERT model fine-tuned for Named Entity Recognition tasks. The model identifies entities like persons, locations, organizations, etc., in text while being optimized for efficient deployment through quantization.

Model Details

  • Model Architecture: DistilBERT Base Uncased
  • Task: Named Entity Recognition
  • Dataset: Annotated Corpus for NER (GMB/Groningen Meaning Bank)
  • Quantization: INT8 (PyTorch Dynamic Quantization)
  • Fine-tuning Framework: Hugging Face Transformers

Entity Categories

The model recognizes these entity types:

  • geo = Geographical Entity
  • org = Organization
  • per = Person
  • gpe = Geopolitical Entity
  • tim = Time indicator
  • art = Artifact
  • eve = Event
  • nat = Natural Phenomenon

Usage

Installation

pip install transformers torch datasets evaluate seqeval

Loading the Model

from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
import torch

# Load quantized model
model_path = "your_username/ner_distilbert_quantized"
model = DistilBertForTokenClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

# Example text
text = "Apple is looking at buying U.K. startup for $1 billion"

# Tokenize and predict
inputs = tokenizer(text.split(), 
                  is_split_into_words=True,
                  return_tensors="pt",
                  padding="max_length",
                  truncation=True,
                  max_length=128)

with torch.no_grad():
    outputs = model(**inputs)

# Process predictions
predictions = torch.argmax(outputs.logits, dim=2)[0]
word_ids = inputs.word_ids()

entities = []
current_entity = []
current_label = None

for idx, (word_id, pred_id) in enumerate(zip(word_ids, predictions)):
    if word_id is None:  # Skip special tokens
        continue
    
    pred_label = model.config.id2label[pred_id.item()]
    
    if pred_label.startswith("B-"):
        if current_entity:
            entities.append((" ".join(current_entity), current_label))
        current_entity = [text.split()[word_id]]
        current_label = pred_label[2:]
    elif pred_label.startswith("I-") and current_label == pred_label[2:]:
        current_entity.append(text.split()[word_id])
    else:
        if current_entity:
            entities.append((" ".join(current_entity), current_label))
        current_entity = []
        current_label = None

print("Identified Entities:")
for entity, label in entities:
    print(f"{entity}: {label}")

Performance Metrics

  • Precision: 0.91
  • Recall: 0.89
  • F1 Score: 0.90

Fine-Tuning Details

Dataset

Processed version of the Annotated Corpus for Named Entity Recognition from Kaggle.

Training

  • Epochs: 3
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Max Sequence Length: 128 tokens

Quantization

Applied post-training dynamic quantization using:

quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

Repository Structure

.
β”œβ”€β”€ model/                    # Quantized model files
β”œβ”€β”€ training_script.py        # Fine-tuning code
β”œβ”€β”€ inference_example.ipynb   # Usage examples
β”œβ”€β”€ requirements.txt          # Dependencies
└── README.md                 # This documentation

Limitations

  • Performance may degrade on texts with unusual formatting or domain-specific jargon
  • The model uses subword tokenization which may split some named entities
  • Quantization may cause minor accuracy loss (~1-2%) compared to FP32

Contributing

Contributions are welcome! Please open an issue to discuss proposed changes.

Downloads last month
-
Safetensors
Model size
65.2M params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support