Model Card for Model ID

Legal-BERT Base Entity Classifier

Overview

A fine-tuned Longformer-based model for classifying legal entities (such as locations and dates) within the context of legal decision texts. The model is based on allenai/longformer and is trained to predict the type of a marked entity span, given its context, using special entity markers [E] ... [/E].

Model Details

Model Name: longformer-classifier-refinement-abb
Architecture: Longformer (allenai/longformer)
Task: Entity Classification (NER-style, entity-in-context classification)
Framework: PyTorch, Hugging Face Transformers
Author: S. Vercoutere

Intended Use

Purpose: Automatic classification of legal entities (e.g., location, date) in municipal or governmental decision documents.
Not Intended For: General-purpose NER, non-legal domains, or tasks outside entity classification.

Training Data

Source: Annotated legal decision texts from Ghent/Freiburg/Bamberg.
Entity Types:
- Locations: impact_location, context_location
- Dates: publication_date, session_date, entry_date, expiry_date, legal_date, context_date, validity_period, context_period
Preprocessing:
- XML-like tags in text, with entities wrapped in <entity_type>...</entity_type>.
- For training, one entity per sample is marked with [E] ... [/E] in context.
- Dataset balanced to max 5000 samples per label.

Training Procedure

Model: nlpaueb/legal-bert-base-uncased
Tokenization: Hugging Face AutoTokenizer, with [E] and [/E] as additional special tokens.
Max Sequence Length: 2048 (trained)
Batch Size: 4
Optimizer: AdamW
Learning Rate: 2e-5
Epochs: 10
Mixed Precision: Yes (AMP)
Validation Split: 20%
Evaluation Metrics: Accuracy, F1, confusion matrix

Evaluation

Validation Accuracy: 0.8454 (on held-out validation set)

Detailed Entity-Level Evaluation:

Entity Label	Precision	Recall	F1-score	Support
context_date	0.9272	0.9405	0.9338	975
context_location	0.9671	0.9751	0.9711	843
context_period	0.9744	0.8321	0.8976	137
entry_date	0.9528	0.9587	0.9557	484
expiry_date	0.8980	0.9496	0.9231	139
impact_location	0.9501	0.9559	0.9530	997
legal_date	1.0000	0.9926	0.9963	943
publication_date	0.9501	0.9870	0.9682	386
session_date	0.9597	0.9597	0.9597	347
validity_period	0.9932	0.9379	0.9648	467
accuracy			0.9601	5718
macro avg	0.9572	0.9489	0.9523	5718
weighted avg	0.9606	0.9601	0.9601	5718

Usage Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("svercoutere/longformer-classifier-refinement-abb")
model = AutoModelForSequenceClassification.from_pretrained("svercoutere/longformer-classifier-refinement-abb")

def classify_entity(entity_text, context_text):
    marked_text = context_text.replace(entity_text, f"[E] {entity_text} [/E]", 1)
    inputs = tokenizer(marked_text, return_tensors="pt", truncation=True, max_length=2048, padding="max_length")
    with torch.no_grad():
        outputs = model(**inputs)
    pred = torch.argmax(outputs.logits, dim=-1).item()
    return pred  # Map to label using label_encoder.classes_

Limitations & Bias

The model is trained on legal texts from specific municipalities and may not generalize to other domains or languages.
Only entity types present in the training data are supported.
The model expects entities to be marked with [E] ... [/E] in the input.

Citation

If you use this model, please cite:

@misc{longformer-classifier-refinement-abb,
  author = {S. Vercoutere},
  title = {Longformer Entity Refinement},
  year = {2026},
  howpublished = {\url{https://huggingface.co/svercoutere/longformer-classifier-refinement-abb}}
}

Downloads last month: 19

Safetensors

Model size

0.1B params

Tensor type

F32