erikhenriksson's picture
Update README.md
25a7d8d verified
|
raw
history blame
3.36 kB
---
license: apache-2.0
language:
- en
- fi
- fr
- sv
- tr
metrics:
- f1
---
# Web register classification (multilingual model)
A multilingual web register classification model fine-tuned from XLM-RoBERTa-large.
## Model Details
### Model Description
- **Developed by:** TurkuNLP
- **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
- **Shared by:** TurkuNLP
- **Model type:** Language model
- **Language(s) (NLP):** En, Fi, Fr, Sv, Tr
- **License:** apache-2.0
- **Finetuned from model:** FacebookAI/xlm-roberta-large
### Model Sources
- **Repository:** https://github.com/TurkuNLP/pytorch-registerlabeling
- **Paper:** Coming soon!
## Uses
This model is designed for classifying texts scraped from the unrestricted web into 25 pre-defined categories based on a hierarchical register taxonomy.
The taxonomy, based on the [CORE taxonomy](https://www.cambridge.org/core/books/register-variation-online/D1D0F0E0BFEA077107F4686C357AA66B), is detailed [here](https://turkunlp.org/register-annotation-docs/abbreviations).
It is trained on English, Finnish, French, Swedish, and Turkish, and performs well in zero-shot labeling for other languages.
It is designed to support the development of open language models and for linguists analyzing register variation.
## How to Get Started with the Model
Use the code below to get started with the model.
```
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "TurkuNLP/multilingual-web-register-classification"
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Text to be categorized
text = "A text to be categorized"
# Tokenize text
inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
with torch.no_grad():
outputs = model(**inputs)
# Apply sigmoid to the logits to get probabilities
probabilities = torch.sigmoid(outputs.logits).squeeze()
# Determine a threshold for predicting labels (e.g., 0.5)
threshold = 0.5
predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
# Extract readable labels using id2label
id2label = model.config.id2label
predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
print("Predicted labels:", predicted_labels)
```
## Training Details
### Training Data
The model was trained using the Multilingual CORE Corpora, which will be published soon.
### Training Procedure
#### Training Hyperparameters
- **Batch size:** 8
- **Epochs:** 7
- **Learning rate:** 0.00005
- **Precision:** bfloat16 (non-mixed precision)
- **TF32:** Enabled
- **Seed:** 42
- **Max Size:** 512 tokens
#### Speeds, Sizes, Times [optional]
Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example.
## Evaluation
Coming soon
## Technical Specifications
### Compute Infrastructure
CSC - IT Center for Science, Finland.
#### GPU
1 x NVIDIA A100-SXM4-40GB
#### Software
- torch 2.2.1
- transformers 4.39.3
## Citation
**BibTeX:**
[TBA]
## Model Card Contact
Erik Henriksson, Hugging Face username: erikhenriksson