|
|
--- |
|
|
language: |
|
|
- ca |
|
|
- es |
|
|
multilinguality: |
|
|
- multilingual |
|
|
pretty_name: NERCat |
|
|
tags: |
|
|
- NER |
|
|
- Catalan |
|
|
- NLP |
|
|
- television transcriptions |
|
|
- manual annotation |
|
|
- GLiNER |
|
|
task_categories: |
|
|
- text-classification |
|
|
- token-classification |
|
|
task_ids: |
|
|
- multi-label-classification |
|
|
- named-entity-recognition |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- Ugiat/ner-cat |
|
|
--- |
|
|
# NERCat Classifier |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
The NERCat classifier is a fine-tuned version of the GLiNER Knowledgator model, designed specifically for Named Entity Recognition (NER) in the Catalan language. By leveraging a manually annotated dataset of Catalan-language television transcriptions, this classifier significantly improves the recognition of named entities across diverse categories, addressing the challenges posed by the scarcity of high-quality training data for Catalan. |
|
|
|
|
|
The pre-trained version used for fine-tuning was: `knowledgator/gliner-bi-large-v1.0`. |
|
|
|
|
|
## Quickstart |
|
|
```py |
|
|
import torch |
|
|
from gliner import GLiNER |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = GLiNER.from_pretrained("Ugiat/NERCat").to(device) |
|
|
|
|
|
text = "La Universitat de Barcelona és una de les institucions educatives més importants de Catalunya." |
|
|
|
|
|
labels = [ |
|
|
"Person", |
|
|
"Facility", |
|
|
"Organization", |
|
|
"Location", |
|
|
"Product", |
|
|
"Event", |
|
|
"Date", |
|
|
"Law" |
|
|
] |
|
|
|
|
|
entities = model.predict_entities(text, labels, threshold=0.5) |
|
|
|
|
|
for entity in entities: |
|
|
print(entity["text"], "=>", entity["label"]) |
|
|
``` |
|
|
|
|
|
|
|
|
## Performance Evaluation |
|
|
|
|
|
We evaluated the fine-tuned NERCat classifier against the baseline GLiNER model using a manually classified evaluation dataset of 100 sentences. The results demonstrate significant performance improvements across all named entity categories: |
|
|
|
|
|
| Entity Type | NERCat Precision | NERCat Recall | NERCat F1 | GLiNER Precision | GLiNER Recall | GLiNER F1 | Δ Precision | Δ Recall | Δ F1 | |
|
|
|----------------|------------------|---------------|-----------|------------------|---------------|-----------|-------------|----------|-------| |
|
|
| Person | 1.00 | 1.00 | 1.00 | 0.92 | 0.80 | 0.86 | +0.08 | +0.20 | +0.14 | |
|
|
| Facility | 0.89 | 1.00 | 0.94 | 0.67 | 0.25 | 0.36 | +0.22 | +0.75 | +0.58 | |
|
|
| Organization | 1.00 | 1.00 | 1.00 | 0.72 | 0.62 | 0.67 | +0.28 | +0.38 | +0.33 | |
|
|
| Location | 1.00 | 0.97 | 0.99 | 0.83 | 0.54 | 0.66 | +0.17 | +0.43 | +0.33 | |
|
|
| Product | 0.96 | 1.00 | 0.98 | 0.63 | 0.21 | 0.31 | +0.34 | +0.79 | +0.67 | |
|
|
| Event | 0.88 | 0.88 | 0.88 | 0.60 | 0.38 | 0.46 | +0.28 | +0.50 | +0.41 | |
|
|
| Date | 0.88 | 1.00 | 0.93 | 1.00 | 0.07 | 0.13 | -0.13 | +0.93 | +0.80 | |
|
|
| Law | 0.67 | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 | +0.67 | +1.00 | +0.80 | |
|
|
|
|
|
|
|
|
## Fine-Tuning Process |
|
|
|
|
|
The fine-tuning process followed a structured approach, including dataset preparation, model training, and optimization: |
|
|
|
|
|
- **Data Splitting:** The dataset was shuffled and split into training (90%) and testing (10%) subsets. |
|
|
- **Training Setup:** |
|
|
- Batch size: 8 |
|
|
- Steps: 500 |
|
|
- Loss function: Focal loss (α = 0.75, γ = 2) to address class imbalances |
|
|
- Learning rates: |
|
|
- Entity layers: $5 \times 10^{-6}$ |
|
|
- Other model parameters: $1 \times 10^{-5}$ |
|
|
- Scheduler: Linear with a warmup ratio of 0.1 |
|
|
- Evaluation frequency: Every 100 steps |
|
|
- Checkpointing: Every 1000 steps |
|
|
|
|
|
The dataset included 13,732 named entity instances across eight categories: |
|
|
|
|
|
## Other |
|
|
|
|
|
### Citation Information |
|
|
|
|
|
``` |
|
|
@misc{article_id, |
|
|
title = {NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan}, |
|
|
author = {Guillem Cadevall Ferreres, Marc Bardeli Gámez, Marc Serrano Sanz, Pol Gerdt Basullas, Francesc Tarres Ruiz, Raul Quijada Ferrero}, |
|
|
year = {2025}, |
|
|
archivePrefix = {arXiv}, |
|
|
url = {https://arxiv.org/abs/2503.14173} |
|
|
} |
|
|
``` |