c-ho's picture
Update README.md
e983baf verified
---
library_name: transformers
license: mit
base_model: FacebookAI/xlm-roberta-large
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: xlm_roberta_large_linsearch_classification
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# xlm_roberta_large_linsearch_classification
This model is an attempt at the shared task LLMs4Subjects (https://sites.google.com/view/llms4subjects-germeval/home) for Subtask 1, to classify subject domains of German and English documents.
It is fine-tuned to classify documents into 28 predefined domains according to the LinSearch domain-specific taxonomy (more about the Fachsystematik LinSearch domains: https://terminology.tib.eu/ts/ontologies/linsearch).
The task is a multi-class multi-label multilingual classification task, trained and evaluated on the TIBKAT dataset (more about the TIBKAT dataset TIB Open Data Services: https://www.tib.eu/en/services/open-data).
The model has received the 1st place for the subtask, with a macro F1 score of 0.653 on the test set evaluated by the organisers on CodaBench (https://www.codabench.org/competitions/8373/#/results-tab).
This model is a fine-tuned version of [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) on the TIBKAT dataset provided by organisers of the shared task.
It achieves the following results on the evaluation set: (automatically generated from the fine-tuning process)
- Loss: 1.3318
- Accuracy: 0.5931
- F1 Macro: 0.5650
- Precision Macro: 0.5755
- Recall Macro: 0.5602
![Test Set Evaluation](./subtask1_quantitative.png)
** The model has better performance on the test set than the eval set during training becuase the model here only takes one gold label instead of multiple for evaluation.
## Intended uses & limitations
## Training and evaluation data
- Training set: https://huggingface.co/datasets/ubffm/linsearch_train_data
- Development set: https://huggingface.co/datasets/ubffm/linsearch_dev_data
- Preprocessing: Mapping multi-label entries to a 1-to-1 for document-to-label by duplicating entries with multiple labels
- Size: 135k examples (116k training set, 18.7k development set)
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 16
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10
- mixed_precision_training: Native AMP
### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 Macro | Precision Macro | Recall Macro |
|:-------------:|:------:|:-----:|:---------------:|:--------:|:--------:|:---------------:|:------------:|
| 1.3362 | 0.9999 | 7024 | 1.3026 | 0.5520 | 0.4904 | 0.5266 | 0.4959 |
| 1.2241 | 1.9999 | 14048 | 1.2036 | 0.5773 | 0.5348 | 0.5535 | 0.5395 |
| 1.1337 | 2.9999 | 21072 | 1.1903 | 0.5760 | 0.5303 | 0.5466 | 0.5278 |
| 1.069 | 3.9999 | 28096 | 1.1570 | 0.5876 | 0.5418 | 0.5564 | 0.5439 |
| 0.9832 | 4.9999 | 35120 | 1.1723 | 0.5872 | 0.5461 | 0.5570 | 0.5461 |
| 0.9197 | 5.9999 | 42144 | 1.1752 | 0.5871 | 0.5455 | 0.5456 | 0.5572 |
| 0.8278 | 6.9999 | 49168 | 1.2078 | 0.5928 | 0.5537 | 0.5616 | 0.5589 |
| 0.7568 | 7.9999 | 56192 | 1.2398 | 0.5923 | 0.5563 | 0.5612 | 0.5568 |
| 0.6863 | 8.9999 | 63216 | 1.2840 | 0.5937 | 0.5556 | 0.5723 | 0.5465 |
| 0.63 | 9.9999 | 70240 | 1.3318 | 0.5931 | 0.5650 | 0.5755 | 0.5602 |
### Framework versions
- Transformers 4.48.0
- Pytorch 2.5.1+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0
## How to use
### from transformers import pipeline
<pre>
classifier = pipeline("text-classification", model="ubffm/xlm_roberta_large_linsearch_classification")
classifier("Your input text here")
</pre>
### with transformers
<pre>
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("ubffm/xlm_roberta_large_linsearch_classification")
model = AutoModelForSequenceClassification.from_pretrained("ubffm/xlm_roberta_large_linsearch_classification")
inputs = tokenizer("Your input text here", return_tensors="pt")
outputs = model(**inputs)
</pre>
## Contact and Citation
<pre>
@inproceedings{ho-2025-ubffm,
title = "{UBFFM} at the {G}erm{E}val-2025 {LLM}s4{S}ubjects Task: What if we take ``You are an expert in subject indexing'' seriously?",
author = "Ho, Clara Wan Ching",
editor = "Wartena, Christian and
Heid, Ulrich",
booktitle = "Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops",
month = sep,
year = "2025",
address = "Hannover, Germany",
publisher = "HsH Applied Academics",
url = "https://aclanthology.org/2025.konvens-2.44/",
pages = "471--478"
}
</pre>
<pre>
@misc{ubffm/xlm_roberta_large_linsearch_classification,
title={xlm_roberta_large_linsearch_classification},
author={UBFFM},
year={2025},
howpublished={\url{https://huggingface.co/ubffm/xlm_roberta_large_linsearch_classification/}},
}
</pre>
Contact email: info@linguistik.de