---
library_name: transformers
tags:
  - transformers
  - pytorch
  - bert
  - legal-domain
  - entity-classification
  - sequence-classification
  - NER
  - longformer
  - token-classification
  - label-studio
  - english
  - fine-tuned
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


# Legal-BERT Base Entity Classifier

## Overview
A fine-tuned Longformer-based model for classifying legal entities (such as locations and dates) within the context of legal decision texts.
The model is based on `allenai/longformer` and is trained to predict the type of a marked entity span, given its context, using special entity markers `[E] ... [/E]`.

## Model Details
- **Model Name:** longformer-classifier-refinement-abb
- **Architecture:** Longformer (allenai/longformer)
- **Task:** Entity Classification (NER-style, entity-in-context classification)
- **Framework:** PyTorch, Hugging Face Transformers
- **Author:** S. Vercoutere

## Intended Use
- **Purpose:** Automatic classification of legal entities (e.g., location, date) in municipal or governmental decision documents.
- **Not Intended For:** General-purpose NER, non-legal domains, or tasks outside entity classification.

## Training Data
- **Source:** Annotated legal decision texts from Ghent/Freiburg/Bamberg.
- **Entity Types:**
  - Locations: `impact_location`, `context_location`
  - Dates: `publication_date`, `session_date`, `entry_date`, `expiry_date`, `legal_date`, `context_date`, `validity_period`, `context_period`
- **Preprocessing:**
  - XML-like tags in text, with entities wrapped in `<entity_type>...</entity_type>`.
  - For training, one entity per sample is marked with `[E] ... [/E]` in context.
  - Dataset balanced to max 5000 samples per label.

## Training Procedure
- **Model:** `nlpaueb/legal-bert-base-uncased`
- **Tokenization:** Hugging Face AutoTokenizer, with `[E]` and `[/E]` as additional special tokens.
- **Max Sequence Length:** 2048 (trained)
- **Batch Size:** 4
- **Optimizer:** AdamW
- **Learning Rate:** 2e-5
- **Epochs:** 10
- **Mixed Precision:** Yes (AMP)
- **Validation Split:** 20%
- **Evaluation Metrics:** Accuracy, F1, confusion matrix

## Evaluation

**Validation Accuracy:** 0.8454 (on held-out validation set)

**Detailed Entity-Level Evaluation:**

| Entity Label     | Precision | Recall | F1-score   | Support |
| ---------------- | --------- | ------ | ---------- | ------- |
| context_date     | 0.9272    | 0.9405 | 0.9338     | 975     |
| context_location | 0.9671    | 0.9751 | 0.9711     | 843     |
| context_period   | 0.9744    | 0.8321 | 0.8976     | 137     |
| entry_date       | 0.9528    | 0.9587 | 0.9557     | 484     |
| expiry_date      | 0.8980    | 0.9496 | 0.9231     | 139     |
| impact_location  | 0.9501    | 0.9559 | 0.9530     | 997     |
| legal_date       | 1.0000    | 0.9926 | 0.9963     | 943     |
| publication_date | 0.9501    | 0.9870 | 0.9682     | 386     |
| session_date     | 0.9597    | 0.9597 | 0.9597     | 347     |
| validity_period  | 0.9932    | 0.9379 | 0.9648     | 467     |
| **accuracy**     |           |        | **0.9601** | 5718    |
| **macro avg**    | 0.9572    | 0.9489 | 0.9523     | 5718    |
| **weighted avg** | 0.9606    | 0.9601 | 0.9601     | 5718    |


## Usage Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("svercoutere/longformer-classifier-refinement-abb")
model = AutoModelForSequenceClassification.from_pretrained("svercoutere/longformer-classifier-refinement-abb")

def classify_entity(entity_text, context_text):
    marked_text = context_text.replace(entity_text, f"[E] {entity_text} [/E]", 1)
    inputs = tokenizer(marked_text, return_tensors="pt", truncation=True, max_length=2048, padding="max_length")
    with torch.no_grad():
        outputs = model(**inputs)
    pred = torch.argmax(outputs.logits, dim=-1).item()
    return pred  # Map to label using label_encoder.classes_
```

## Limitations & Bias
- The model is trained on legal texts from specific municipalities and may not generalize to other domains or languages.
- Only entity types present in the training data are supported.
- The model expects entities to be marked with `[E] ... [/E]` in the input.

## Citation
If you use this model, please cite:

```
@misc{longformer-classifier-refinement-abb,
  author = {S. Vercoutere},
  title = {Longformer Entity Refinement},
  year = {2026},
  howpublished = {\url{https://huggingface.co/svercoutere/longformer-classifier-refinement-abb}}
}
```