--- license: apache-2.0 language: - multilingual - af - am - ar - as - az - be - bg - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - fy - ga - gd - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - jv - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - 'no' - om - or - pa - pl - ps - pt - ro - ru - sa - sd - si - sk - sl - so - sq - sr - su - sv - sw - ta - te - th - tl - tr - ug - uk - ur - uz - vi - xh - yi - zh tags: - text-classification - register - web-register - genre --- # Web register classification (English model) A web register classifier for texts in English, fine-tuned from [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large). The model is trained with the [Corpus of Online Registers of English (CORE)](https://github.com/TurkuNLP/CORE-corpus) to classify documents based on the [CORE taxonomy](https://turkunlp.org/register-annotation-docs/). It is designed to support the development of open language models and for linguists analyzing register variation. For a multilingual CORE classifier, see [here](https://huggingface.co/TurkuNLP/web-register-classification-multilingual). ## Model Details ### Model Description - **Developed by:** TurkuNLP - **Funded by:** The Research Council of Finland, Emil Aaltonen Foundation, University of Turku - **Shared by:** TurkuNLP - **Model type:** Language model - **Language(s) (NLP):** English - **License:** apache-2.0 - **Finetuned from model:** FacebookAI/xlm-roberta-large ### Model Sources - **Repository:** Coming soon! - **Paper:** Coming soon! ## Register labels and their abbreviations Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted. For a more detailed description of the label scheme, see [here](https://turkunlp.org/register-annotation-docs/). The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels. - **LY:** Lyrical - **SP:** Spoken - **it:** Interview - **ID:** Interactive discussion - **NA:** Narrative - **ne:** News report - **sr:** Sports report - **nb:** Narrative blog - **HI:** How-to or instructions - **re:** Recipe - **IN:** Informational description - **en:** Encyclopedia article - **ra:** Research article - **dtp:** Description of a thing or person - **fi:** Frequently asked questions - **lt:** Legal terms and conditions - **OP:** Opinion - **rv:** Review - **ob:** Opinion blog - **rs:** Denominational religious blog or sermon - **av:** Advice - **IP:** Informational persuasion - **ds:** Description with intent to sell - **ed:** News & opinion blog or editorial ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_id = "TurkuNLP/web-register-classification-en" # Load model and tokenizer model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device) tokenizer = AutoTokenizer.from_pretrained(model_id) # Text to be categorized text = "A text to be categorized" # Tokenize text inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device) with torch.no_grad(): outputs = model(**inputs) # Apply sigmoid to the logits to get probabilities probabilities = torch.sigmoid(outputs.logits).squeeze() # Determine a threshold for predicting labels threshold = 0.5 predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0] # Extract readable labels using id2label id2label = model.config.id2label predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices] print("Predicted labels:", predicted_labels) ``` ## Training Details ### Training Data The model was trained using the Multilingual CORE Corpora, which will be published soon. ### Training Procedure #### Training Hyperparameters - **Batch size:** 8 - **Epochs:** 9 - **Learning rate:** 0.00003 - **Precision:** bfloat16 (non-mixed precision) - **TF32:** Enabled - **Seed:** 42 - **Max Size:** 512 tokens #### Inference time Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example. Wirh bigger batches, inference can be considerably faster. ## Evaluation Micro-averaged F1 scores and optimized prediction thresholds (test set): | Language | F1 (All labels) | F1 (Main labels) | Threshold | | -------- | --------------- | ---------------- | ----------| | English | 0.74 | 0.76 | 0.40 | ## Technical Specifications ### Compute Infrastructure - Mahti supercomputer (CSC - IT Center for Science, Finland) - 1 x NVIDIA A100-SXM4-40GB #### Software - torch 2.2.1 - transformers 4.39.3 ## Citation If you use this model, please cite the following publication: ```bibtex @misc{henriksson2024untanglingunrestrictedwebautomatic, title={Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers}, author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellström and Veronika Laippala}, year={2024}, eprint={2406.19892}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.19892}, } ``` Earlier related work include the following: ```bibtex @article{Laippala.etal2022, title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}}, author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo}, year = {2022}, journal = {Language Resources and Evaluation}, issn = {1574-0218}, doi = {10.1007/s10579-022-09624-1}, url = {https://doi.org/10.1007/s10579-022-09624-1}, } @article{Skantsi_Laippala_2023, title = {Analyzing the unrestricted web: The finnish corpus of online registers}, doi = {10.1017/S0332586523000021}, journal = {Nordic Journal of Linguistics}, author = {Skantsi, Valtteri and Laippala, Veronika}, year = {2023}, pages = {1–31} } ``` ## Model Card Contact Erik Henriksson, Hugging Face username: erikhenriksson