web-register-classification-en / README.md

Update README.md

9396915 verified over 1 year ago

6.6 kB

	---
	license: apache-2.0
	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- 'no'
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- ug
	- uk
	- ur
	- uz
	- vi
	- xh
	- yi
	- zh
	tags:
	- text-classification
	- register
	- web-register
	- genre
	---
	# Web register classification (English model)

	A web register classifier for texts in English, fine-tuned from [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large).
	The model is trained with the [Corpus of Online Registers of English (CORE)](https://github.com/TurkuNLP/CORE-corpus) to classify documents based on the [CORE taxonomy](https://turkunlp.org/register-annotation-docs/).
	It is designed to support the development of open language models and for linguists analyzing register variation.

	For a multilingual CORE classifier, see [here](https://huggingface.co/TurkuNLP/web-register-classification-multilingual).

	## Model Details

	### Model Description

	- Developed by: TurkuNLP
	- Funded by: The Research Council of Finland, Emil Aaltonen Foundation, University of Turku
	- Shared by: TurkuNLP
	- Model type: Language model
	- Language(s) (NLP): English
	- License: apache-2.0
	- Finetuned from model: FacebookAI/xlm-roberta-large

	### Model Sources

	- Repository: Coming soon!
	- Paper: Coming soon!

	## Register labels and their abbreviations

	Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted.
	For a more detailed description of the label scheme, see [here](https://turkunlp.org/register-annotation-docs/).

	The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels.

	- LY: Lyrical
	- SP: Spoken
	- it: Interview
	- ID: Interactive discussion
	- NA: Narrative
	- ne: News report
	- sr: Sports report
	- nb: Narrative blog
	- HI: How-to or instructions
	- re: Recipe
	- IN: Informational description
	- en: Encyclopedia article
	- ra: Research article
	- dtp: Description of a thing or person
	- fi: Frequently asked questions
	- lt: Legal terms and conditions
	- OP: Opinion
	- rv: Review
	- ob: Opinion blog
	- rs: Denominational religious blog or sermon
	- av: Advice
	- IP: Informational persuasion
	- ds: Description with intent to sell
	- ed: News & opinion blog or editorial

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	model_id = "TurkuNLP/web-register-classification-en"

	# Load model and tokenizer
	model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	# Text to be categorized
	text = "A text to be categorized"

	# Tokenize text
	inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)

	with torch.no_grad():
	outputs = model(**inputs)

	# Apply sigmoid to the logits to get probabilities
	probabilities = torch.sigmoid(outputs.logits).squeeze()

	# Determine a threshold for predicting labels
	threshold = 0.5
	predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]

	# Extract readable labels using id2label
	id2label = model.config.id2label
	predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]

	print("Predicted labels:", predicted_labels)

	```

	## Training Details

	### Training Data

	The model was trained using the Multilingual CORE Corpora, which will be published soon.

	### Training Procedure

	#### Training Hyperparameters

	- Batch size: 8
	- Epochs: 9
	- Learning rate: 0.00003
	- Precision: bfloat16 (non-mixed precision)
	- TF32: Enabled
	- Seed: 42
	- Max Size: 512 tokens

	#### Inference time

	Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is 17 ms for a single example. Wirh bigger batches, inference can be considerably faster.

	## Evaluation

	Micro-averaged F1 scores and optimized prediction thresholds (test set):

	\| Language \| F1 (All labels) \| F1 (Main labels) \| Threshold \|
	\| -------- \| --------------- \| ---------------- \| ----------\|
	\| English \| 0.74 \| 0.76 \| 0.40 \|


	## Technical Specifications

	### Compute Infrastructure

	- Mahti supercomputer (CSC - IT Center for Science, Finland)
	- 1 x NVIDIA A100-SXM4-40GB

	#### Software

	- torch 2.2.1
	- transformers 4.39.3

	## Citation

	If you use this model, please cite the following publication:

	```bibtex
	@misc{henriksson2024untanglingunrestrictedwebautomatic,
	title={Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers},
	author={Erik Henriksson and Amanda Myntti and Anni Eskelinen and Selcen Erten-Johansson and Saara Hellström and Veronika Laippala},
	year={2024},
	eprint={2406.19892},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2406.19892},
	}
	```

	Earlier related work include the following:

	```bibtex
	@article{Laippala.etal2022,
	title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}},
	author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo},
	year = {2022},
	journal = {Language Resources and Evaluation},
	issn = {1574-0218},
	doi = {10.1007/s10579-022-09624-1},
	url = {https://doi.org/10.1007/s10579-022-09624-1},
	}

	@article{Skantsi_Laippala_2023,
	title = {Analyzing the unrestricted web: The finnish corpus of online registers},
	doi = {10.1017/S0332586523000021},
	journal = {Nordic Journal of Linguistics},
	author = {Skantsi, Valtteri and Laippala, Veronika},
	year = {2023},
	pages = {1–31}
	}
	```

	## Model Card Contact

	Erik Henriksson, Hugging Face username: erikhenriksson