README.md · svercoutere/longformer-classifier-refinement-abb at main

longformer-classifier-refinement-abb / README.md

svercoutere

Update README.md

4955506 verified about 1 month ago

preview code

raw

history blame contribute delete

4.6 kB

	---
	library_name: transformers
	tags:
	- transformers
	- pytorch
	- bert
	- legal-domain
	- entity-classification
	- sequence-classification
	- NER
	- longformer
	- token-classification
	- label-studio
	- english
	- fine-tuned
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->


	# Legal-BERT Base Entity Classifier

	## Overview
	A fine-tuned Longformer-based model for classifying legal entities (such as locations and dates) within the context of legal decision texts.
	The model is based on `allenai/longformer` and is trained to predict the type of a marked entity span, given its context, using special entity markers `[E] ... [/E]`.

	## Model Details
	- Model Name: longformer-classifier-refinement-abb
	- Architecture: Longformer (allenai/longformer)
	- Task: Entity Classification (NER-style, entity-in-context classification)
	- Framework: PyTorch, Hugging Face Transformers
	- Author: S. Vercoutere

	## Intended Use
	- Purpose: Automatic classification of legal entities (e.g., location, date) in municipal or governmental decision documents.
	- Not Intended For: General-purpose NER, non-legal domains, or tasks outside entity classification.

	## Training Data
	- Source: Annotated legal decision texts from Ghent/Freiburg/Bamberg.
	- Entity Types:
	- Locations: `impact_location`, `context_location`
	- Dates: `publication_date`, `session_date`, `entry_date`, `expiry_date`, `legal_date`, `context_date`, `validity_period`, `context_period`
	- Preprocessing:
	- XML-like tags in text, with entities wrapped in `<entity_type>...</entity_type>`.
	- For training, one entity per sample is marked with `[E] ... [/E]` in context.
	- Dataset balanced to max 5000 samples per label.

	## Training Procedure
	- Model: `nlpaueb/legal-bert-base-uncased`
	- Tokenization: Hugging Face AutoTokenizer, with `[E]` and `[/E]` as additional special tokens.
	- Max Sequence Length: 2048 (trained)
	- Batch Size: 4
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Epochs: 10
	- Mixed Precision: Yes (AMP)
	- Validation Split: 20%
	- Evaluation Metrics: Accuracy, F1, confusion matrix

	## Evaluation

	Validation Accuracy: 0.8454 (on held-out validation set)

	Detailed Entity-Level Evaluation:

	\| Entity Label \| Precision \| Recall \| F1-score \| Support \|
	\| ---------------- \| --------- \| ------ \| ---------- \| ------- \|
	\| context_date \| 0.9272 \| 0.9405 \| 0.9338 \| 975 \|
	\| context_location \| 0.9671 \| 0.9751 \| 0.9711 \| 843 \|
	\| context_period \| 0.9744 \| 0.8321 \| 0.8976 \| 137 \|
	\| entry_date \| 0.9528 \| 0.9587 \| 0.9557 \| 484 \|
	\| expiry_date \| 0.8980 \| 0.9496 \| 0.9231 \| 139 \|
	\| impact_location \| 0.9501 \| 0.9559 \| 0.9530 \| 997 \|
	\| legal_date \| 1.0000 \| 0.9926 \| 0.9963 \| 943 \|
	\| publication_date \| 0.9501 \| 0.9870 \| 0.9682 \| 386 \|
	\| session_date \| 0.9597 \| 0.9597 \| 0.9597 \| 347 \|
	\| validity_period \| 0.9932 \| 0.9379 \| 0.9648 \| 467 \|
	\| accuracy \| \| \| 0.9601 \| 5718 \|
	\| macro avg \| 0.9572 \| 0.9489 \| 0.9523 \| 5718 \|
	\| weighted avg \| 0.9606 \| 0.9601 \| 0.9601 \| 5718 \|


	## Usage Example
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("svercoutere/longformer-classifier-refinement-abb")
	model = AutoModelForSequenceClassification.from_pretrained("svercoutere/longformer-classifier-refinement-abb")

	def classify_entity(entity_text, context_text):
	marked_text = context_text.replace(entity_text, f"[E] {entity_text} [/E]", 1)
	inputs = tokenizer(marked_text, return_tensors="pt", truncation=True, max_length=2048, padding="max_length")
	with torch.no_grad():
	outputs = model(**inputs)
	pred = torch.argmax(outputs.logits, dim=-1).item()
	return pred # Map to label using label_encoder.classes_
	```

	## Limitations & Bias
	- The model is trained on legal texts from specific municipalities and may not generalize to other domains or languages.
	- Only entity types present in the training data are supported.
	- The model expects entities to be marked with `[E] ... [/E]` in the input.

	## Citation
	If you use this model, please cite:

	```
	@misc{longformer-classifier-refinement-abb,
	author = {S. Vercoutere},
	title = {Longformer Entity Refinement},
	year = {2026},
	howpublished = {\url{https://huggingface.co/svercoutere/longformer-classifier-refinement-abb}}
	}
	```