TurkuNLP
/

web-register-classification-multilingual

@@ -9,18 +9,14 @@ language:
 metrics:
 - f1
 ---
-# Web register classification model (multilingual model)
-A web register classification model
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
 - **Developed by:** TurkuNLP
 - **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
 - **Shared by:** TurkuNLP
@@ -43,34 +39,56 @@ It is designed to support the development of open language models and for lingui
 ## How to Get Started with the Model
 ```
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-# Init model
-model = AutoModelForSequenceClassification.from_pretrained(cfg.model_path).to(
-    device
-)
-model.eval()
-# Get the original model's name and init tokenizer
-with open(f"{cfg.model_path}/config.json", "r") as config_file:
-    config = json.load(config_file)
-tokenizer = AutoTokenizer.from_pretrained(config.get("_name_or_path"))
-```
 ## Training Details
 ### Training Data
-The dataset that the model was trained on will be published soon!
 ### Training Procedure
 #### Training Hyperparameters
 - **Batch size:** 8
 - **Learning rate:** 0.00005
 - **Precision:** bfloat16 (non-mixed precision)
 - **TF32:** Enabled
@@ -85,34 +103,6 @@ Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU
 Coming soon
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
 ## Technical Specifications
@@ -126,7 +116,8 @@ NVIDIA A100-SXM4-40GB
 #### Software
-Pytorch
 ## Citation

 metrics:
 - f1
 ---
+# Web register classification (multilingual model)
+A web register classification model fine-tuned from XLM-RoBERTa-large.
 ## Model Details
 ### Model Description
 - **Developed by:** TurkuNLP
 - **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
 - **Shared by:** TurkuNLP
 ## How to Get Started with the Model
+Use the code below to get started with the model.
 ```
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model_id = "TurkuNLP/multilingual-web-register-classification"
+# Load model and tokenizer
+model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Text to be categorized
+text = "A text to be categorized"
+# Tokenize text
+inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
+with torch.no_grad():
+    outputs = model(**inputs)
+# Apply sigmoid to the logits to get probabilities
+probabilities = torch.sigmoid(outputs.logits).squeeze()
+# Determine a threshold for predicting labels (e.g., 0.5)
+threshold = 0.5
+predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
+# Extract readable labels using id2label
+id2label = model.config.id2label
+predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
+print("Predicted labels:", predicted_labels)
+```
 ## Training Details
 ### Training Data
+The model was trained using the Multilingual CORE Corpora, which will be published soon.
 ### Training Procedure
 #### Training Hyperparameters
 - **Batch size:** 8
+- **Epochs:** 7
 - **Learning rate:** 0.00005
 - **Precision:** bfloat16 (non-mixed precision)
 - **TF32:** Enabled
 Coming soon
 ## Technical Specifications
 #### Software
+torch 2.2.1
+transformers 4.39.3
 ## Citation