|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- transformers |
|
|
- pytorch |
|
|
- bert |
|
|
- legal-domain |
|
|
- entity-classification |
|
|
- sequence-classification |
|
|
- NER |
|
|
- longformer |
|
|
- token-classification |
|
|
- label-studio |
|
|
- english |
|
|
- fine-tuned |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
|
# Legal-BERT Base Entity Classifier |
|
|
|
|
|
## Overview |
|
|
A fine-tuned Longformer-based model for classifying legal entities (such as locations and dates) within the context of legal decision texts. |
|
|
The model is based on `allenai/longformer` and is trained to predict the type of a marked entity span, given its context, using special entity markers `[E] ... [/E]`. |
|
|
|
|
|
## Model Details |
|
|
- **Model Name:** longformer-classifier-refinement-abb |
|
|
- **Architecture:** Longformer (allenai/longformer) |
|
|
- **Task:** Entity Classification (NER-style, entity-in-context classification) |
|
|
- **Framework:** PyTorch, Hugging Face Transformers |
|
|
- **Author:** S. Vercoutere |
|
|
|
|
|
## Intended Use |
|
|
- **Purpose:** Automatic classification of legal entities (e.g., location, date) in municipal or governmental decision documents. |
|
|
- **Not Intended For:** General-purpose NER, non-legal domains, or tasks outside entity classification. |
|
|
|
|
|
## Training Data |
|
|
- **Source:** Annotated legal decision texts from Ghent/Freiburg/Bamberg. |
|
|
- **Entity Types:** |
|
|
- Locations: `impact_location`, `context_location` |
|
|
- Dates: `publication_date`, `session_date`, `entry_date`, `expiry_date`, `legal_date`, `context_date`, `validity_period`, `context_period` |
|
|
- **Preprocessing:** |
|
|
- XML-like tags in text, with entities wrapped in `<entity_type>...</entity_type>`. |
|
|
- For training, one entity per sample is marked with `[E] ... [/E]` in context. |
|
|
- Dataset balanced to max 5000 samples per label. |
|
|
|
|
|
## Training Procedure |
|
|
- **Model:** `nlpaueb/legal-bert-base-uncased` |
|
|
- **Tokenization:** Hugging Face AutoTokenizer, with `[E]` and `[/E]` as additional special tokens. |
|
|
- **Max Sequence Length:** 2048 (trained) |
|
|
- **Batch Size:** 4 |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning Rate:** 2e-5 |
|
|
- **Epochs:** 10 |
|
|
- **Mixed Precision:** Yes (AMP) |
|
|
- **Validation Split:** 20% |
|
|
- **Evaluation Metrics:** Accuracy, F1, confusion matrix |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
**Validation Accuracy:** 0.8454 (on held-out validation set) |
|
|
|
|
|
**Detailed Entity-Level Evaluation:** |
|
|
|
|
|
| Entity Label | Precision | Recall | F1-score | Support | |
|
|
| ---------------- | --------- | ------ | ---------- | ------- | |
|
|
| context_date | 0.9272 | 0.9405 | 0.9338 | 975 | |
|
|
| context_location | 0.9671 | 0.9751 | 0.9711 | 843 | |
|
|
| context_period | 0.9744 | 0.8321 | 0.8976 | 137 | |
|
|
| entry_date | 0.9528 | 0.9587 | 0.9557 | 484 | |
|
|
| expiry_date | 0.8980 | 0.9496 | 0.9231 | 139 | |
|
|
| impact_location | 0.9501 | 0.9559 | 0.9530 | 997 | |
|
|
| legal_date | 1.0000 | 0.9926 | 0.9963 | 943 | |
|
|
| publication_date | 0.9501 | 0.9870 | 0.9682 | 386 | |
|
|
| session_date | 0.9597 | 0.9597 | 0.9597 | 347 | |
|
|
| validity_period | 0.9932 | 0.9379 | 0.9648 | 467 | |
|
|
| **accuracy** | | | **0.9601** | 5718 | |
|
|
| **macro avg** | 0.9572 | 0.9489 | 0.9523 | 5718 | |
|
|
| **weighted avg** | 0.9606 | 0.9601 | 0.9601 | 5718 | |
|
|
|
|
|
|
|
|
## Usage Example |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("svercoutere/longformer-classifier-refinement-abb") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("svercoutere/longformer-classifier-refinement-abb") |
|
|
|
|
|
def classify_entity(entity_text, context_text): |
|
|
marked_text = context_text.replace(entity_text, f"[E] {entity_text} [/E]", 1) |
|
|
inputs = tokenizer(marked_text, return_tensors="pt", truncation=True, max_length=2048, padding="max_length") |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
pred = torch.argmax(outputs.logits, dim=-1).item() |
|
|
return pred # Map to label using label_encoder.classes_ |
|
|
``` |
|
|
|
|
|
## Limitations & Bias |
|
|
- The model is trained on legal texts from specific municipalities and may not generalize to other domains or languages. |
|
|
- Only entity types present in the training data are supported. |
|
|
- The model expects entities to be marked with `[E] ... [/E]` in the input. |
|
|
|
|
|
## Citation |
|
|
If you use this model, please cite: |
|
|
|
|
|
``` |
|
|
@misc{longformer-classifier-refinement-abb, |
|
|
author = {S. Vercoutere}, |
|
|
title = {Longformer Entity Refinement}, |
|
|
year = {2026}, |
|
|
howpublished = {\url{https://huggingface.co/svercoutere/longformer-classifier-refinement-abb}} |
|
|
} |
|
|
``` |
|
|
|
|
|
|