| --- |
| license: apache-2.0 |
| language: |
| - en |
| - fi |
| - fr |
| - sv |
| - tr |
| metrics: |
| - f1 |
| --- |
| # Web register classification (multilingual model) |
|
|
| A multilingual web register classification model fine-tuned from XLM-RoBERTa-large. |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| - **Developed by:** TurkuNLP |
| - **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku |
| - **Shared by:** TurkuNLP |
| - **Model type:** Language model |
| - **Language(s) (NLP):** En, Fi, Fr, Sv, Tr |
| - **License:** apache-2.0 |
| - **Finetuned from model:** FacebookAI/xlm-roberta-large |
|
|
| ### Model Sources |
|
|
| - **Repository:** https://github.com/TurkuNLP/pytorch-registerlabeling |
| - **Paper:** Coming soon! |
|
|
| ## Uses |
|
|
| This model is designed for classifying texts scraped from the unrestricted web into 25 pre-defined categories based on a hierarchical register taxonomy. |
| The taxonomy, based on the [CORE taxonomy](https://www.cambridge.org/core/books/register-variation-online/D1D0F0E0BFEA077107F4686C357AA66B), is detailed [here](https://turkunlp.org/register-annotation-docs/abbreviations). |
| It is trained on English, Finnish, French, Swedish, and Turkish, and performs well in zero-shot labeling for other languages. |
| It is designed to support the development of open language models and for linguists analyzing register variation. |
|
|
| ## How to Get Started with the Model |
|
|
| Use the code below to get started with the model. |
|
|
| ``` |
| import torch |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| |
| model_id = "TurkuNLP/multilingual-web-register-classification" |
| |
| # Load model and tokenizer |
| model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device) |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| |
| # Text to be categorized |
| text = "A text to be categorized" |
| |
| # Tokenize text |
| inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| |
| # Apply sigmoid to the logits to get probabilities |
| probabilities = torch.sigmoid(outputs.logits).squeeze() |
| |
| # Determine a threshold for predicting labels (e.g., 0.5) |
| threshold = 0.5 |
| predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0] |
| |
| # Extract readable labels using id2label |
| id2label = model.config.id2label |
| predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices] |
| |
| print("Predicted labels:", predicted_labels) |
| |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| The model was trained using the Multilingual CORE Corpora, which will be published soon. |
|
|
| ### Training Procedure |
|
|
| #### Training Hyperparameters |
|
|
| - **Batch size:** 8 |
| - **Epochs:** 7 |
| - **Learning rate:** 0.00005 |
| - **Precision:** bfloat16 (non-mixed precision) |
| - **TF32:** Enabled |
| - **Seed:** 42 |
| - **Max Size:** 512 tokens |
|
|
| #### Speeds, Sizes, Times [optional] |
|
|
| Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example. |
|
|
| ## Evaluation |
|
|
| Coming soon |
|
|
|
|
| ## Technical Specifications |
|
|
| ### Compute Infrastructure |
|
|
| CSC - IT Center for Science, Finland. |
|
|
| #### GPU |
|
|
| 1 x NVIDIA A100-SXM4-40GB |
|
|
| #### Software |
|
|
| - torch 2.2.1 |
| - transformers 4.39.3 |
|
|
| ## Citation |
|
|
| **BibTeX:** |
|
|
| [TBA] |
|
|
| ## Model Card Contact |
|
|
| Erik Henriksson, Hugging Face username: erikhenriksson |