Update README.md

e983baf verified 5 months ago

5.74 kB

	---
	library_name: transformers
	license: mit
	base_model: FacebookAI/xlm-roberta-large
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: xlm_roberta_large_linsearch_classification
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# xlm_roberta_large_linsearch_classification
	This model is an attempt at the shared task LLMs4Subjects (https://sites.google.com/view/llms4subjects-germeval/home) for Subtask 1, to classify subject domains of German and English documents.
	It is fine-tuned to classify documents into 28 predefined domains according to the LinSearch domain-specific taxonomy (more about the Fachsystematik LinSearch domains: https://terminology.tib.eu/ts/ontologies/linsearch).
	The task is a multi-class multi-label multilingual classification task, trained and evaluated on the TIBKAT dataset (more about the TIBKAT dataset TIB Open Data Services: https://www.tib.eu/en/services/open-data).
	The model has received the 1st place for the subtask, with a macro F1 score of 0.653 on the test set evaluated by the organisers on CodaBench (https://www.codabench.org/competitions/8373/#/results-tab).

	This model is a fine-tuned version of [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) on the TIBKAT dataset provided by organisers of the shared task.
	It achieves the following results on the evaluation set: (automatically generated from the fine-tuning process)
	- Loss: 1.3318
	- Accuracy: 0.5931
	- F1 Macro: 0.5650
	- Precision Macro: 0.5755
	- Recall Macro: 0.5602

	![Test Set Evaluation](./subtask1_quantitative.png)
	** The model has better performance on the test set than the eval set during training becuase the model here only takes one gold label instead of multiple for evaluation.

	## Intended uses & limitations


	## Training and evaluation data

	- Training set: https://huggingface.co/datasets/ubffm/linsearch_train_data
	- Development set: https://huggingface.co/datasets/ubffm/linsearch_dev_data
	- Preprocessing: Mapping multi-label entries to a 1-to-1 for document-to-label by duplicating entries with multiple labels
	- Size: 135k examples (116k training set, 18.7k development set)

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 16
	- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 10
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| F1 Macro \| Precision Macro \| Recall Macro \|
	\|:-------------:\|:------:\|:-----:\|:---------------:\|:--------:\|:--------:\|:---------------:\|:------------:\|
	\| 1.3362 \| 0.9999 \| 7024 \| 1.3026 \| 0.5520 \| 0.4904 \| 0.5266 \| 0.4959 \|
	\| 1.2241 \| 1.9999 \| 14048 \| 1.2036 \| 0.5773 \| 0.5348 \| 0.5535 \| 0.5395 \|
	\| 1.1337 \| 2.9999 \| 21072 \| 1.1903 \| 0.5760 \| 0.5303 \| 0.5466 \| 0.5278 \|
	\| 1.069 \| 3.9999 \| 28096 \| 1.1570 \| 0.5876 \| 0.5418 \| 0.5564 \| 0.5439 \|
	\| 0.9832 \| 4.9999 \| 35120 \| 1.1723 \| 0.5872 \| 0.5461 \| 0.5570 \| 0.5461 \|
	\| 0.9197 \| 5.9999 \| 42144 \| 1.1752 \| 0.5871 \| 0.5455 \| 0.5456 \| 0.5572 \|
	\| 0.8278 \| 6.9999 \| 49168 \| 1.2078 \| 0.5928 \| 0.5537 \| 0.5616 \| 0.5589 \|
	\| 0.7568 \| 7.9999 \| 56192 \| 1.2398 \| 0.5923 \| 0.5563 \| 0.5612 \| 0.5568 \|
	\| 0.6863 \| 8.9999 \| 63216 \| 1.2840 \| 0.5937 \| 0.5556 \| 0.5723 \| 0.5465 \|
	\| 0.63 \| 9.9999 \| 70240 \| 1.3318 \| 0.5931 \| 0.5650 \| 0.5755 \| 0.5602 \|

	### Framework versions

	- Transformers 4.48.0
	- Pytorch 2.5.1+cu124
	- Datasets 3.2.0
	- Tokenizers 0.21.0


	## How to use

	### from transformers import pipeline
	<pre>
	classifier = pipeline("text-classification", model="ubffm/xlm_roberta_large_linsearch_classification")
	classifier("Your input text here")
	</pre>

	### with transformers
	<pre>
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("ubffm/xlm_roberta_large_linsearch_classification")
	model = AutoModelForSequenceClassification.from_pretrained("ubffm/xlm_roberta_large_linsearch_classification")

	inputs = tokenizer("Your input text here", return_tensors="pt")
	outputs = model(**inputs)
	</pre>

	## Contact and Citation
	<pre>
	@inproceedings{ho-2025-ubffm,
	title = "{UBFFM} at the {G}erm{E}val-2025 {LLM}s4{S}ubjects Task: What if we take ``You are an expert in subject indexing'' seriously?",
	author = "Ho, Clara Wan Ching",
	editor = "Wartena, Christian and
	Heid, Ulrich",
	booktitle = "Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops",
	month = sep,
	year = "2025",
	address = "Hannover, Germany",
	publisher = "HsH Applied Academics",
	url = "https://aclanthology.org/2025.konvens-2.44/",
	pages = "471--478"
	}
	</pre>
	<pre>
	@misc{ubffm/xlm_roberta_large_linsearch_classification,
	title={xlm_roberta_large_linsearch_classification},
	author={UBFFM},
	year={2025},
	howpublished={\url{https://huggingface.co/ubffm/xlm_roberta_large_linsearch_classification/}},
	}
	</pre>
	Contact email: info@linguistik.de