Update README.md

25a7d8d verified over 1 year ago

3.36 kB

metadata

license: apache-2.0
language:
  - en
  - fi
  - fr
  - sv
  - tr
metrics:
  - f1

Web register classification (multilingual model)

A multilingual web register classification model fine-tuned from XLM-RoBERTa-large.

Model Details

Model Description

Developed by: TurkuNLP
Funded by: The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
Shared by: TurkuNLP
Model type: Language model
Language(s) (NLP): En, Fi, Fr, Sv, Tr
License: apache-2.0
Finetuned from model: FacebookAI/xlm-roberta-large

Model Sources

Repository: https://github.com/TurkuNLP/pytorch-registerlabeling
Paper: Coming soon!

Uses

This model is designed for classifying texts scraped from the unrestricted web into 25 pre-defined categories based on a hierarchical register taxonomy. The taxonomy, based on the CORE taxonomy, is detailed here. It is trained on English, Finnish, French, Swedish, and Turkish, and performs well in zero-shot labeling for other languages. It is designed to support the development of open language models and for linguists analyzing register variation.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "TurkuNLP/multilingual-web-register-classification"

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Text to be categorized
text = "A text to be categorized"

# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)

with torch.no_grad():
    outputs = model(**inputs)

# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()

# Determine a threshold for predicting labels (e.g., 0.5)
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]

# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]

print("Predicted labels:", predicted_labels)

Training Details

Training Data

The model was trained using the Multilingual CORE Corpora, which will be published soon.

Training Procedure

Training Hyperparameters

Batch size: 8
Epochs: 7
Learning rate: 0.00005
Precision: bfloat16 (non-mixed precision)
TF32: Enabled
Seed: 42
Max Size: 512 tokens

Speeds, Sizes, Times [optional]

Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is 17 ms for a single example.

Evaluation

Coming soon

Technical Specifications

Compute Infrastructure

CSC - IT Center for Science, Finland.

GPU

1 x NVIDIA A100-SXM4-40GB

Software

torch 2.2.1
transformers 4.39.3

Citation

BibTeX:

[TBA]

Model Card Contact

Erik Henriksson, Hugging Face username: erikhenriksson