TurkuNLP
/

web-register-classification-en

+---
+license: cc-by-sa-4.0
+language:
+  - multilingual
+  - af
+  - am
+  - ar
+  - as
+  - az
+  - be
+  - bg
+  - bn
+  - br
+  - bs
+  - ca
+  - cs
+  - cy
+  - da
+  - de
+  - el
+  - en
+  - eo
+  - es
+  - et
+  - eu
+  - fa
+  - fi
+  - fr
+  - fy
+  - ga
+  - gd
+  - gl
+  - gu
+  - ha
+  - he
+  - hi
+  - hr
+  - hu
+  - hy
+  - id
+  - is
+  - it
+  - ja
+  - jv
+  - ka
+  - kk
+  - km
+  - kn
+  - ko
+  - ku
+  - ky
+  - la
+  - lo
+  - lt
+  - lv
+  - mg
+  - mk
+  - ml
+  - mn
+  - mr
+  - ms
+  - my
+  - ne
+  - nl
+  - 'no'
+  - om
+  - or
+  - pa
+  - pl
+  - ps
+  - pt
+  - ro
+  - ru
+  - sa
+  - sd
+  - si
+  - sk
+  - sl
+  - so
+  - sq
+  - sr
+  - su
+  - sv
+  - sw
+  - ta
+  - te
+  - th
+  - tl
+  - tr
+  - ug
+  - uk
+  - ur
+  - uz
+  - vi
+  - xh
+  - yi
+  - zh
+tags:
+- text-classification
+- register
+- web-register
+- genre
+---
+# Web register classification (multilingual model)
+A web register classifier for texts in English, fine-tuned from [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large).
+The model is trained with the [Corpus of Online Registers of English (CORE)](https://github.com/TurkuNLP/CORE-corpus) to classify documents based on the [CORE taxonomy](https://turkunlp.org/register-annotation-docs/).
+It is designed to support the development of open language models and for linguists analyzing register variation.
+For a multilingual CORE classifier, see [here](https://huggingface.co/TurkuNLP/web-register-classification-multilingual).
+## Model Details
+### Model Description
+- **Developed by:** TurkuNLP
+- **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
+- **Shared by:** TurkuNLP
+- **Model type:** Language model
+- **Language(s) (NLP):** English
+- **License:** apache-2.0
+- **Finetuned from model:** FacebookAI/xlm-roberta-large
+### Model Sources
+- **Repository:** https://github.com/TurkuNLP/pytorch-registerlabeling
+- **Paper:** Coming soon!
+## Register labels and their abbreviations
+Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted.
+For a more detailed description of the label scheme, see [here](https://turkunlp.org/register-annotation-docs/).
+The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels.
+- **LY:** Lyrical
+- **SP:** Spoken
+    - **it:** Interview
+- **ID:** Interactive discussion
+- **NA:** Narrative
+    - **ne:** News report
+    - **sr:** Sports report
+    - **nb:** Narrative blog
+- **HI:** How-to or instructions
+    - **re:** Recipe
+- **IN:** Informational description
+    - **en:** Encyclopedia article
+    - **ra:** Research article
+    - **dtp:** Description of a thing or person
+    - **fi:** Frequently asked questions
+    - **lt:** Legal terms and conditions
+- **OP:** Opinion
+    - **rv:** Review
+    - **ob:** Opinion blog
+    - **rs:** Denominational religious blog or sermon
+    - **av:** Advice
+- **IP:** Informational persuasion
+    - **ds:** Description with intent to sell
+    - **ed:** News & opinion blog or editorial
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model_id = "TurkuNLP/web-register-classification-en"
+# Load model and tokenizer
+model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Text to be categorized
+text = "A text to be categorized"
+# Tokenize text
+inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
+with torch.no_grad():
+    outputs = model(**inputs)
+# Apply sigmoid to the logits to get probabilities
+probabilities = torch.sigmoid(outputs.logits).squeeze()
+# Determine a threshold for predicting labels
+threshold = 0.5
+predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
+# Extract readable labels using id2label
+id2label = model.config.id2label
+predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
+print("Predicted labels:", predicted_labels)
+```
+## Training Details
+### Training Data
+The model was trained using the Multilingual CORE Corpora, which will be published soon.
+### Training Procedure
+#### Training Hyperparameters
+- **Batch size:** 8
+- **Epochs:** 21
+- **Learning rate:** 0.00005
+- **Precision:** bfloat16 (non-mixed precision)
+- **TF32:** Enabled
+- **Seed:** 42
+- **Max Size:** 512 tokens
+#### Inference time
+Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example. Wirh bigger batches, inference can be considerably faster.
+## Evaluation
+Micro-averaged F1 scores and optimized prediction thresholds (test set):
+| Language | F1 (All labels) | F1 (Main labels) | Threshold |
+| -------- | --------------- | ---------------- | ----------|
+| English  | 0.74            | 0.75             | 0.40      |
+## Technical Specifications
+### Compute Infrastructure
+- Mahti supercomputer (CSC - IT Center for Science, Finland)
+- 1 x NVIDIA A100-SXM4-40GB
+#### Software
+- torch 2.2.1
+- transformers 4.39.3
+## Citation
+The citation for this work will be available soon. In the meantime, please refer to earlier related work for citation:
+```bibtex
+@article{Laippala.etal2022,
+  title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}},
+  author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo},
+  year = {2022},
+  journal = {Language Resources and Evaluation},
+  issn = {1574-0218},
+  doi = {10.1007/s10579-022-09624-1},
+  url = {https://doi.org/10.1007/s10579-022-09624-1},
+}
+```
+## Model Card Contact
+Erik Henriksson, Hugging Face username: erikhenriksson