NERCat / README.md

Update README.md

0882b65 verified 11 months ago

4.33 kB

	---
	language:
	- ca
	- es
	multilinguality:
	- multilingual
	pretty_name: NERCat
	tags:
	- NER
	- Catalan
	- NLP
	- television transcriptions
	- manual annotation
	- GLiNER
	task_categories:
	- text-classification
	- token-classification
	task_ids:
	- multi-label-classification
	- named-entity-recognition
	license: apache-2.0
	datasets:
	- Ugiat/ner-cat
	---
	# NERCat Classifier

	## Model Overview

	The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan.

	The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`.

	## Quickstart
	```py
	import torch
	from gliner import GLiNER

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = GLiNER.from_pretrained("Ugiat/NERCat").to(device)

	text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya."

	labels = [
	"Person",
	"Facility",
	"Organization",
	"Location",
	"Product",
	"Event",
	"Date",
	"Law"
	]

	entities = model.predict_entities(text, labels, threshold=0.5)

	for entity in entities:
	print(entity["text"], "=>", entity["label"])
	```


	## Performance Evaluation

	We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories:

	\| Entity Type \| NERCat Precision \| NERCat Recall \| NERCat F1 \| GLiNER Precision \| GLiNER Recall \| GLiNER F1 \| Δ Precision \| Δ Recall \| Δ F1 \|
	\|----------------\|------------------\|---------------\|-----------\|------------------\|---------------\|-----------\|-------------\|----------\|-------\|
	\| Person \| 1.00 \| 1.00 \| 1.00 \| 0.92 \| 0.80 \| 0.86 \| +0.08 \| +0.20 \| +0.14 \|
	\| Facility \| 0.89 \| 1.00 \| 0.94 \| 0.67 \| 0.25 \| 0.36 \| +0.22 \| +0.75 \| +0.58 \|
	\| Organization \| 1.00 \| 1.00 \| 1.00 \| 0.72 \| 0.62 \| 0.67 \| +0.28 \| +0.38 \| +0.33 \|
	\| Location \| 1.00 \| 0.97 \| 0.99 \| 0.83 \| 0.54 \| 0.66 \| +0.17 \| +0.43 \| +0.33 \|
	\| Product \| 0.96 \| 1.00 \| 0.98 \| 0.63 \| 0.21 \| 0.31 \| +0.34 \| +0.79 \| +0.67 \|
	\| Event \| 0.88 \| 0.88 \| 0.88 \| 0.60 \| 0.38 \| 0.46 \| +0.28 \| +0.50 \| +0.41 \|
	\| Date \| 0.88 \| 1.00 \| 0.93 \| 1.00 \| 0.07 \| 0.13 \| -0.13 \| +0.93 \| +0.80 \|
	\| Law \| 0.67 \| 1.00 \| 0.80 \| 0.00 \| 0.00 \| 0.00 \| +0.67 \| +1.00 \| +0.80 \|


	## Fine-Tuning Process

	The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization:

	- Data Splitting: The dataset was shuffled and split into training (90%) and testing (10%) subsets.
	- Training Setup:
	- Batch size: 8
	- Steps: 500
	- Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances
	- Learning rates:
	- Entity layers: $5 \times 10^{-6}$
	- Other model parameters: $1 \times 10^{-5}$
	- Scheduler: Linear with a warmup ratio of 0.1
	- Evaluation frequency: Every 100 steps
	- Checkpointing: Every 1000 steps

	The dataset included 13,732 named entity instances across eight categories:

	## Other

	### Citation Information

	```
	@misc{article_id,
	title = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan},
	author = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basullas, Francesc Tarres Ruiz, Raul Quijada Ferrero},
	year = {2025},
	archivePrefix = {arXiv},
	url = {https://arxiv.org/abs/2503.14173}
	}
	```