erikhenriksson's picture
Update README.md
25a7d8d verified
|
raw
history blame
3.36 kB
metadata
license: apache-2.0
language:
  - en
  - fi
  - fr
  - sv
  - tr
metrics:
  - f1

Web register classification (multilingual model)

A multilingual web register classification model fine-tuned from XLM-RoBERTa-large.

Model Details

Model Description

  • Developed by: TurkuNLP
  • Funded by: The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
  • Shared by: TurkuNLP
  • Model type: Language model
  • Language(s) (NLP): En, Fi, Fr, Sv, Tr
  • License: apache-2.0
  • Finetuned from model: FacebookAI/xlm-roberta-large

Model Sources

Uses

This model is designed for classifying texts scraped from the unrestricted web into 25 pre-defined categories based on a hierarchical register taxonomy. The taxonomy, based on the CORE taxonomy, is detailed here. It is trained on English, Finnish, French, Swedish, and Turkish, and performs well in zero-shot labeling for other languages. It is designed to support the development of open language models and for linguists analyzing register variation.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "TurkuNLP/multilingual-web-register-classification"

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Text to be categorized
text = "A text to be categorized"

# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)

with torch.no_grad():
    outputs = model(**inputs)

# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()

# Determine a threshold for predicting labels (e.g., 0.5)
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]

# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]

print("Predicted labels:", predicted_labels)

Training Details

Training Data

The model was trained using the Multilingual CORE Corpora, which will be published soon.

Training Procedure

Training Hyperparameters

  • Batch size: 8
  • Epochs: 7
  • Learning rate: 0.00005
  • Precision: bfloat16 (non-mixed precision)
  • TF32: Enabled
  • Seed: 42
  • Max Size: 512 tokens

Speeds, Sizes, Times [optional]

Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is 17 ms for a single example.

Evaluation

Coming soon

Technical Specifications

Compute Infrastructure

CSC - IT Center for Science, Finland.

GPU

1 x NVIDIA A100-SXM4-40GB

Software

  • torch 2.2.1
  • transformers 4.39.3

Citation

BibTeX:

[TBA]

Model Card Contact

Erik Henriksson, Hugging Face username: erikhenriksson