erikhenriksson's picture
Update README.md
9396915 verified
---
license: apache-2.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- register
- web-register
- genre
---
# Web register classification (English model)
A web register classifier for texts in English, fine-tuned from [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large).
The model is trained with the [Corpus of Online Registers of English (CORE)](https://github.com/TurkuNLP/CORE-corpus) to classify documents based on the [CORE taxonomy](https://turkunlp.org/register-annotation-docs/).
It is designed to support the development of open language models and for linguists analyzing register variation.
For a multilingual CORE classifier, see [here](https://huggingface.co/TurkuNLP/web-register-classification-multilingual).
## Model Details
### Model Description
- **Developed by:** TurkuNLP
- **Funded by:** The Research Council of Finland, Emil Aaltonen Foundation, University of Turku
- **Shared by:** TurkuNLP
- **Model type:** Language model
- **Language(s) (NLP):** English
- **License:** apache-2.0
- **Finetuned from model:** FacebookAI/xlm-roberta-large
### Model Sources
- **Repository:** Coming soon!
- **Paper:** Coming soon!
## Register labels and their abbreviations
Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted.
For a more detailed description of the label scheme, see [here](https://turkunlp.org/register-annotation-docs/).
The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels.
- **LY:** Lyrical
- **SP:** Spoken
- **it:** Interview
- **ID:** Interactive discussion
- **NA:** Narrative
- **ne:** News report
- **sr:** Sports report
- **nb:** Narrative blog
- **HI:** How-to or instructions
- **re:** Recipe
- **IN:** Informational description
- **en:** Encyclopedia article
- **ra:** Research article
- **dtp:** Description of a thing or person
- **fi:** Frequently asked questions
- **lt:** Legal terms and conditions
- **OP:** Opinion
- **rv:** Review
- **ob:** Opinion blog
- **rs:** Denominational religious blog or sermon
- **av:** Advice
- **IP:** Informational persuasion
- **ds:** Description with intent to sell
- **ed:** News & opinion blog or editorial
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "TurkuNLP/web-register-classification-en"
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Text to be categorized
text = "A text to be categorized"
# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
with torch.no_grad():
outputs = model(**inputs)
# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()
# Determine a threshold for predicting labels
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
print("Predicted labels:", predicted_labels)
```
## Training Details
### Training Data
The model was trained using the Multilingual CORE Corpora, which will be published soon.
### Training Procedure
#### Training Hyperparameters
- **Batch size:** 8
- **Epochs:** 9
- **Learning rate:** 0.00003
- **Precision:** bfloat16 (non-mixed precision)
- **TF32:** Enabled
- **Seed:** 42
- **Max Size:** 512 tokens
#### Inference time
Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example. Wirh bigger batches, inference can be considerably faster.
## Evaluation
Micro-averaged F1 scores and optimized prediction thresholds (test set):
| Language | F1 (All labels) | F1 (Main labels) | Threshold |
| -------- | --------------- | ---------------- | ----------|
| English | 0.74 | 0.76 | 0.40 |
## Technical Specifications
### Compute Infrastructure
- Mahti supercomputer (CSC - IT Center for Science, Finland)
- 1 x NVIDIA A100-SXM4-40GB
#### Software
- torch 2.2.1
- transformers 4.39.3
## Citation
If you use this model, please cite the following publication:
```bibtex
@misc{henriksson2024untanglingunrestrictedwebautomatic,
title={Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers},
author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellström and Veronika Laippala},
year={2024},
eprint={2406.19892},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19892},
}
```
Earlier related work include the following:
```bibtex
@article{Laippala.etal2022,
title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}},
author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo},
year = {2022},
journal = {Language Resources and Evaluation},
issn = {1574-0218},
doi = {10.1007/s10579-022-09624-1},
url = {https://doi.org/10.1007/s10579-022-09624-1},
}
@article{Skantsi_Laippala_2023,
title = {Analyzing the unrestricted web: The finnish corpus of online registers},
doi = {10.1017/S0332586523000021},
journal = {Nordic Journal of Linguistics},
author = {Skantsi, Valtteri and Laippala, Veronika},
year = {2023},
pages = {1–31}
}
```
## Model Card Contact
Erik Henriksson, Hugging Face username: erikhenriksson