TurkuNLP
/

web-register-classification-multilingual

Text Classification

Model card Files Files and versions

web-register-classification-multilingual / README.md

erikhenriksson's picture

Update README.md

25a7d8d verified almost 2 years ago

|

3.36 kB

	---
	license: apache-2.0
	language:
	- en
	- fi
	- fr
	- sv
	- tr
	metrics:
	- f1
	---
	# Web register classification (multilingual model)

	A multilingual web register classification model fine-tuned from XLM-RoBERTa-large.

	## Model Details

	### Model Description

	- Developed by: TurkuNLP
	- Funded by: The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
	- Shared by: TurkuNLP
	- Model type: Language model
	- Language(s) (NLP): En, Fi, Fr, Sv, Tr
	- License: apache-2.0
	- Finetuned from model: FacebookAI/xlm-roberta-large

	### Model Sources

	- Repository: https://github.com/TurkuNLP/pytorch-registerlabeling
	- Paper: Coming soon!

	## Uses

	This model is designed for classifying texts scraped from the unrestricted web into 25 pre-defined categories based on a hierarchical register taxonomy.
	The taxonomy, based on the [CORE taxonomy](https://www.cambridge.org/core/books/register-variation-online/D1D0F0E0BFEA077107F4686C357AA66B), is detailed [here](https://turkunlp.org/register-annotation-docs/abbreviations).
	It is trained on English, Finnish, French, Swedish, and Turkish, and performs well in zero-shot labeling for other languages.
	It is designed to support the development of open language models and for linguists analyzing register variation.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	model_id = "TurkuNLP/multilingual-web-register-classification"

	# Load model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	# Text to be categorized
	text = "A text to be categorized"

	# Tokenize text
	inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)

	with torch.no_grad():
	outputs = model(**inputs)

	# Apply sigmoid to the logits to get probabilities
	probabilities = torch.sigmoid(outputs.logits).squeeze()

	# Determine a threshold for predicting labels (e.g., 0.5)
	threshold = 0.5
	predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]

	# Extract readable labels using id2label
	id2label = model.config.id2label
	predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]

	print("Predicted labels:", predicted_labels)

	```

	## Training Details

	### Training Data

	The model was trained using the Multilingual CORE Corpora, which will be published soon.

	### Training Procedure

	#### Training Hyperparameters

	- Batch size: 8
	- Epochs: 7
	- Learning rate: 0.00005
	- Precision: bfloat16 (non-mixed precision)
	- TF32: Enabled
	- Seed: 42
	- Max Size: 512 tokens

	#### Speeds, Sizes, Times [optional]

	Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is 17 ms for a single example.

	## Evaluation

	Coming soon


	## Technical Specifications

	### Compute Infrastructure

	CSC - IT Center for Science, Finland.

	#### GPU

	1 x NVIDIA A100-SXM4-40GB

	#### Software

	- torch 2.2.1
	- transformers 4.39.3

	## Citation

	BibTeX:

	[TBA]

	## Model Card Contact

	Erik Henriksson, Hugging Face username: erikhenriksson