svercoutere's picture
Update README.md
4955506 verified
---
library_name: transformers
tags:
- transformers
- pytorch
- bert
- legal-domain
- entity-classification
- sequence-classification
- NER
- longformer
- token-classification
- label-studio
- english
- fine-tuned
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
# Legal-BERT Base Entity Classifier
## Overview
A fine-tuned Longformer-based model for classifying legal entities (such as locations and dates) within the context of legal decision texts.
The model is based on `allenai/longformer` and is trained to predict the type of a marked entity span, given its context, using special entity markers `[E] ... [/E]`.
## Model Details
- **Model Name:** longformer-classifier-refinement-abb
- **Architecture:** Longformer (allenai/longformer)
- **Task:** Entity Classification (NER-style, entity-in-context classification)
- **Framework:** PyTorch, Hugging Face Transformers
- **Author:** S. Vercoutere
## Intended Use
- **Purpose:** Automatic classification of legal entities (e.g., location, date) in municipal or governmental decision documents.
- **Not Intended For:** General-purpose NER, non-legal domains, or tasks outside entity classification.
## Training Data
- **Source:** Annotated legal decision texts from Ghent/Freiburg/Bamberg.
- **Entity Types:**
- Locations: `impact_location`, `context_location`
- Dates: `publication_date`, `session_date`, `entry_date`, `expiry_date`, `legal_date`, `context_date`, `validity_period`, `context_period`
- **Preprocessing:**
- XML-like tags in text, with entities wrapped in `<entity_type>...</entity_type>`.
- For training, one entity per sample is marked with `[E] ... [/E]` in context.
- Dataset balanced to max 5000 samples per label.
## Training Procedure
- **Model:** `nlpaueb/legal-bert-base-uncased`
- **Tokenization:** Hugging Face AutoTokenizer, with `[E]` and `[/E]` as additional special tokens.
- **Max Sequence Length:** 2048 (trained)
- **Batch Size:** 4
- **Optimizer:** AdamW
- **Learning Rate:** 2e-5
- **Epochs:** 10
- **Mixed Precision:** Yes (AMP)
- **Validation Split:** 20%
- **Evaluation Metrics:** Accuracy, F1, confusion matrix
## Evaluation
**Validation Accuracy:** 0.8454 (on held-out validation set)
**Detailed Entity-Level Evaluation:**
| Entity Label | Precision | Recall | F1-score | Support |
| ---------------- | --------- | ------ | ---------- | ------- |
| context_date | 0.9272 | 0.9405 | 0.9338 | 975 |
| context_location | 0.9671 | 0.9751 | 0.9711 | 843 |
| context_period | 0.9744 | 0.8321 | 0.8976 | 137 |
| entry_date | 0.9528 | 0.9587 | 0.9557 | 484 |
| expiry_date | 0.8980 | 0.9496 | 0.9231 | 139 |
| impact_location | 0.9501 | 0.9559 | 0.9530 | 997 |
| legal_date | 1.0000 | 0.9926 | 0.9963 | 943 |
| publication_date | 0.9501 | 0.9870 | 0.9682 | 386 |
| session_date | 0.9597 | 0.9597 | 0.9597 | 347 |
| validity_period | 0.9932 | 0.9379 | 0.9648 | 467 |
| **accuracy** | | | **0.9601** | 5718 |
| **macro avg** | 0.9572 | 0.9489 | 0.9523 | 5718 |
| **weighted avg** | 0.9606 | 0.9601 | 0.9601 | 5718 |
## Usage Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("svercoutere/longformer-classifier-refinement-abb")
model = AutoModelForSequenceClassification.from_pretrained("svercoutere/longformer-classifier-refinement-abb")
def classify_entity(entity_text, context_text):
marked_text = context_text.replace(entity_text, f"[E] {entity_text} [/E]", 1)
inputs = tokenizer(marked_text, return_tensors="pt", truncation=True, max_length=2048, padding="max_length")
with torch.no_grad():
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=-1).item()
return pred # Map to label using label_encoder.classes_
```
## Limitations & Bias
- The model is trained on legal texts from specific municipalities and may not generalize to other domains or languages.
- Only entity types present in the training data are supported.
- The model expects entities to be marked with `[E] ... [/E]` in the input.
## Citation
If you use this model, please cite:
```
@misc{longformer-classifier-refinement-abb,
author = {S. Vercoutere},
title = {Longformer Entity Refinement},
year = {2026},
howpublished = {\url{https://huggingface.co/svercoutere/longformer-classifier-refinement-abb}}
}
```