Model Card for Model ID
Legal-BERT Base Entity Classifier
Overview
A fine-tuned Longformer-based model for classifying legal entities (such as locations and dates) within the context of legal decision texts.
The model is based on allenai/longformer and is trained to predict the type of a marked entity span, given its context, using special entity markers [E] ... [/E].
Model Details
- Model Name: longformer-classifier-refinement-abb
- Architecture: Longformer (allenai/longformer)
- Task: Entity Classification (NER-style, entity-in-context classification)
- Framework: PyTorch, Hugging Face Transformers
- Author: S. Vercoutere
Intended Use
- Purpose: Automatic classification of legal entities (e.g., location, date) in municipal or governmental decision documents.
- Not Intended For: General-purpose NER, non-legal domains, or tasks outside entity classification.
Training Data
- Source: Annotated legal decision texts from Ghent/Freiburg/Bamberg.
- Entity Types:
- Locations:
impact_location,context_location - Dates:
publication_date,session_date,entry_date,expiry_date,legal_date,context_date,validity_period,context_period
- Locations:
- Preprocessing:
- XML-like tags in text, with entities wrapped in
<entity_type>...</entity_type>. - For training, one entity per sample is marked with
[E] ... [/E]in context. - Dataset balanced to max 5000 samples per label.
- XML-like tags in text, with entities wrapped in
Training Procedure
- Model:
nlpaueb/legal-bert-base-uncased - Tokenization: Hugging Face AutoTokenizer, with
[E]and[/E]as additional special tokens. - Max Sequence Length: 2048 (trained)
- Batch Size: 4
- Optimizer: AdamW
- Learning Rate: 2e-5
- Epochs: 10
- Mixed Precision: Yes (AMP)
- Validation Split: 20%
- Evaluation Metrics: Accuracy, F1, confusion matrix
Evaluation
Validation Accuracy: 0.8454 (on held-out validation set)
Detailed Entity-Level Evaluation:
| Entity Label | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| context_date | 0.9272 | 0.9405 | 0.9338 | 975 |
| context_location | 0.9671 | 0.9751 | 0.9711 | 843 |
| context_period | 0.9744 | 0.8321 | 0.8976 | 137 |
| entry_date | 0.9528 | 0.9587 | 0.9557 | 484 |
| expiry_date | 0.8980 | 0.9496 | 0.9231 | 139 |
| impact_location | 0.9501 | 0.9559 | 0.9530 | 997 |
| legal_date | 1.0000 | 0.9926 | 0.9963 | 943 |
| publication_date | 0.9501 | 0.9870 | 0.9682 | 386 |
| session_date | 0.9597 | 0.9597 | 0.9597 | 347 |
| validity_period | 0.9932 | 0.9379 | 0.9648 | 467 |
| accuracy | 0.9601 | 5718 | ||
| macro avg | 0.9572 | 0.9489 | 0.9523 | 5718 |
| weighted avg | 0.9606 | 0.9601 | 0.9601 | 5718 |
Usage Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("svercoutere/longformer-classifier-refinement-abb")
model = AutoModelForSequenceClassification.from_pretrained("svercoutere/longformer-classifier-refinement-abb")
def classify_entity(entity_text, context_text):
marked_text = context_text.replace(entity_text, f"[E] {entity_text} [/E]", 1)
inputs = tokenizer(marked_text, return_tensors="pt", truncation=True, max_length=2048, padding="max_length")
with torch.no_grad():
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=-1).item()
return pred # Map to label using label_encoder.classes_
Limitations & Bias
- The model is trained on legal texts from specific municipalities and may not generalize to other domains or languages.
- Only entity types present in the training data are supported.
- The model expects entities to be marked with
[E] ... [/E]in the input.
Citation
If you use this model, please cite:
@misc{longformer-classifier-refinement-abb,
author = {S. Vercoutere},
title = {Longformer Entity Refinement},
year = {2026},
howpublished = {\url{https://huggingface.co/svercoutere/longformer-classifier-refinement-abb}}
}
- Downloads last month
- 21