license: apache-2.0
language:
- en
- fi
- fr
- sv
- tr
metrics:
- f1
Web register classification (multilingual model)
A multilingual web register classification model fine-tuned from XLM-RoBERTa-large.
Model Details
Model Description
- Developed by: TurkuNLP
- Funded by: The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
- Shared by: TurkuNLP
- Model type: Language model
- Language(s) (NLP): En, Fi, Fr, Sv, Tr
- License: apache-2.0
- Finetuned from model: FacebookAI/xlm-roberta-large
Model Sources
- Repository: https://github.com/TurkuNLP/pytorch-registerlabeling
- Paper: Coming soon!
Uses
This model is designed for classifying texts scraped from the unrestricted web into 25 pre-defined categories based on a hierarchical register taxonomy. The taxonomy, based on the CORE taxonomy, is detailed here. It is trained on English, Finnish, French, Swedish, and Turkish, and performs well in zero-shot labeling for other languages. It is designed to support the development of open language models and for linguists analyzing register variation.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "TurkuNLP/multilingual-web-register-classification"
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Text to be categorized
text = "A text to be categorized"
# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
with torch.no_grad():
outputs = model(**inputs)
# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()
# Determine a threshold for predicting labels (e.g., 0.5)
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
print("Predicted labels:", predicted_labels)
Training Details
Training Data
The model was trained using the Multilingual CORE Corpora, which will be published soon.
Training Procedure
Training Hyperparameters
- Batch size: 8
- Epochs: 7
- Learning rate: 0.00005
- Precision: bfloat16 (non-mixed precision)
- TF32: Enabled
- Seed: 42
- Max Size: 512 tokens
Speeds, Sizes, Times [optional]
Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is 17 ms for a single example.
Evaluation
Coming soon
Technical Specifications
Compute Infrastructure
CSC - IT Center for Science, Finland.
GPU
1 x NVIDIA A100-SXM4-40GB
Software
- torch 2.2.1
- transformers 4.39.3
Citation
BibTeX:
[TBA]
Model Card Contact
Erik Henriksson, Hugging Face username: erikhenriksson