BiLSTM-CRF for NER (OntoNotes 5.0)
This is a custom BiLSTM-CRF model fine-tuned on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. Unlike Transformer-based models, this architecture combines the sequential feature extraction of BiLSTMs with the structural inference of a Conditional Random Field (CRF) layer, initialized with pre-trained GloVe 300d word embeddings.
π Performance
The model was evaluated on the OntoNotes 5.0 (v12) official test set using seqeval:
| Entity | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| CARDINAL | 0.7310 | 0.7572 | 0.7439 | 1005 |
| DATE | 0.7970 | 0.8309 | 0.8136 | 1786 |
| EVENT | 0.6180 | 0.6471 | 0.6322 | 85 |
| FAC | 0.5678 | 0.4497 | 0.5019 | 149 |
| GPE | 0.8621 | 0.8818 | 0.8718 | 2546 |
| LOC | 0.6491 | 0.6884 | 0.6682 | 215 |
| MONEY | 0.8575 | 0.8648 | 0.8612 | 355 |
| NORP | 0.8734 | 0.8778 | 0.8756 | 990 |
| ORG | 0.8195 | 0.8232 | 0.8213 | 2002 |
| PERSON | 0.8707 | 0.8454 | 0.8578 | 2134 |
| micro avg | 0.8099 | 0.8201 | 0.8150 | 12585 |
| macro avg | 0.7040 | 0.7073 | 0.7046 | 12585 |
| weighted avg | 0.8103 | 0.8201 | 0.8148 | 12585 |
π Model Architecture
- Embedding Layer: GloVe 300d (wiki-gigaword), fine-tuned during training.
- Encoder: 2-layer Bi-directional LSTM with 512 hidden units.
- Decoder: Linear-chain CRF for optimal tag sequence decoding.
- Dropout: 0.5 (Applied to embeddings and LSTM outputs).
π Project Assets
- GitHub Repository: Learnrr/ontonotes5_ner_evaluation
| Asset | File | Description |
|---|---|---|
| Model Weights | bilstm_crf_model.bin |
PyTorch state dictionary (~85.8 MB). |
| Vocabulary | vocab.pth |
Pickled word-to-index mapping. |
| Label List | label_list.pth |
Pickled NER tag list (BIO format). |
| Documentation | README.md |
Model card and usage instructions. |
π Training Infrastructure
- Framework: PyTorch with DistributedDataParallel (DDP).
- Hardware: Multi-GPU (NVIDIA V100) setup with NCCL backend.
- Hyperparameters:
- Optimizer: AdamW (
lr=1e-3,weight_decay=0.01) - Scheduler: Linear warmup with decay (warmup_ratio=0.1)
- Epochs: 20
- Batch Size: 32 per GPU (Effective batch size 64)
- Max Length: 128 tokens
- Optimizer: AdamW (
π Usage
import torch
from model import BiLSTM_CRF # Ensure class definition is accessible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 1. Load mappings
vocab = torch.load("vocab.pth")
label_list = torch.load("label_list.pth")
# 2. Initialize and Load Weights
model = BiLSTM_CRF(
v_size=len(vocab),
t_size=len(label_list),
e_dim=300,
h_dim=512,
w_matrix=torch.zeros(len(vocab), 300)
)
state_dict = torch.load("best_bilstm_crf_ddp.pth", map_location=device)
# Standardize keys (remove 'module.' from DDP training)
new_state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
model.load_state_dict(new_state_dict)
model.to(device).eval()