erikhenriksson commited on
Commit
ef4c959
·
verified ·
1 Parent(s): c2a33d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +263 -3
README.md CHANGED
@@ -1,3 +1,263 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ language:
4
+ - multilingual
5
+ - af
6
+ - am
7
+ - ar
8
+ - as
9
+ - az
10
+ - be
11
+ - bg
12
+ - bn
13
+ - br
14
+ - bs
15
+ - ca
16
+ - cs
17
+ - cy
18
+ - da
19
+ - de
20
+ - el
21
+ - en
22
+ - eo
23
+ - es
24
+ - et
25
+ - eu
26
+ - fa
27
+ - fi
28
+ - fr
29
+ - fy
30
+ - ga
31
+ - gd
32
+ - gl
33
+ - gu
34
+ - ha
35
+ - he
36
+ - hi
37
+ - hr
38
+ - hu
39
+ - hy
40
+ - id
41
+ - is
42
+ - it
43
+ - ja
44
+ - jv
45
+ - ka
46
+ - kk
47
+ - km
48
+ - kn
49
+ - ko
50
+ - ku
51
+ - ky
52
+ - la
53
+ - lo
54
+ - lt
55
+ - lv
56
+ - mg
57
+ - mk
58
+ - ml
59
+ - mn
60
+ - mr
61
+ - ms
62
+ - my
63
+ - ne
64
+ - nl
65
+ - 'no'
66
+ - om
67
+ - or
68
+ - pa
69
+ - pl
70
+ - ps
71
+ - pt
72
+ - ro
73
+ - ru
74
+ - sa
75
+ - sd
76
+ - si
77
+ - sk
78
+ - sl
79
+ - so
80
+ - sq
81
+ - sr
82
+ - su
83
+ - sv
84
+ - sw
85
+ - ta
86
+ - te
87
+ - th
88
+ - tl
89
+ - tr
90
+ - ug
91
+ - uk
92
+ - ur
93
+ - uz
94
+ - vi
95
+ - xh
96
+ - yi
97
+ - zh
98
+ tags:
99
+ - text-classification
100
+ - register
101
+ - web-register
102
+ - genre
103
+ ---
104
+ # Web register classification (multilingual model)
105
+
106
+ A web register classifier for texts in English, fine-tuned from [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large).
107
+ The model is trained with the [Corpus of Online Registers of English (CORE)](https://github.com/TurkuNLP/CORE-corpus) to classify documents based on the [CORE taxonomy](https://turkunlp.org/register-annotation-docs/).
108
+ It is designed to support the development of open language models and for linguists analyzing register variation.
109
+
110
+ For a multilingual CORE classifier, see [here](https://huggingface.co/TurkuNLP/web-register-classification-multilingual).
111
+
112
+ ## Model Details
113
+
114
+ ### Model Description
115
+
116
+ - **Developed by:** TurkuNLP
117
+ - **Funded by:** The Research Council of Finland, Eemil Aaltonen Foundation, University of Turku
118
+ - **Shared by:** TurkuNLP
119
+ - **Model type:** Language model
120
+ - **Language(s) (NLP):** English
121
+ - **License:** apache-2.0
122
+ - **Finetuned from model:** FacebookAI/xlm-roberta-large
123
+
124
+ ### Model Sources
125
+
126
+ - **Repository:** https://github.com/TurkuNLP/pytorch-registerlabeling
127
+ - **Paper:** Coming soon!
128
+
129
+ ## Register labels and their abbreviations
130
+
131
+ Below is a list of the register labels predicted by the model. Note that some labels are hierarchical; when a sublabel is predicted, its parent label is also predicted.
132
+ For a more detailed description of the label scheme, see [here](https://turkunlp.org/register-annotation-docs/).
133
+
134
+ The main labels are uppercase. To only include these main labels in the predictions, simply slice the model's output to keep only the uppercase labels.
135
+
136
+ - **LY:** Lyrical
137
+ - **SP:** Spoken
138
+ - **it:** Interview
139
+ - **ID:** Interactive discussion
140
+ - **NA:** Narrative
141
+ - **ne:** News report
142
+ - **sr:** Sports report
143
+ - **nb:** Narrative blog
144
+ - **HI:** How-to or instructions
145
+ - **re:** Recipe
146
+ - **IN:** Informational description
147
+ - **en:** Encyclopedia article
148
+ - **ra:** Research article
149
+ - **dtp:** Description of a thing or person
150
+ - **fi:** Frequently asked questions
151
+ - **lt:** Legal terms and conditions
152
+ - **OP:** Opinion
153
+ - **rv:** Review
154
+ - **ob:** Opinion blog
155
+ - **rs:** Denominational religious blog or sermon
156
+ - **av:** Advice
157
+ - **IP:** Informational persuasion
158
+ - **ds:** Description with intent to sell
159
+ - **ed:** News & opinion blog or editorial
160
+
161
+ ## How to Get Started with the Model
162
+
163
+ Use the code below to get started with the model.
164
+
165
+ ```python
166
+ import torch
167
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
168
+
169
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
170
+
171
+ model_id = "TurkuNLP/web-register-classification-en"
172
+
173
+ # Load model and tokenizer
174
+ model = AutoModelForSequenceClassification.from_pretrained(model_id).to(device)
175
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
176
+
177
+ # Text to be categorized
178
+ text = "A text to be categorized"
179
+
180
+ # Tokenize text
181
+ inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
182
+
183
+ with torch.no_grad():
184
+ outputs = model(**inputs)
185
+
186
+ # Apply sigmoid to the logits to get probabilities
187
+ probabilities = torch.sigmoid(outputs.logits).squeeze()
188
+
189
+ # Determine a threshold for predicting labels
190
+ threshold = 0.5
191
+ predicted_label_indices = (probabilities > threshold).nonzero(as_tuple=True)[0]
192
+
193
+ # Extract readable labels using id2label
194
+ id2label = model.config.id2label
195
+ predicted_labels = [id2label[idx.item()] for idx in predicted_label_indices]
196
+
197
+ print("Predicted labels:", predicted_labels)
198
+
199
+ ```
200
+
201
+ ## Training Details
202
+
203
+ ### Training Data
204
+
205
+ The model was trained using the Multilingual CORE Corpora, which will be published soon.
206
+
207
+ ### Training Procedure
208
+
209
+ #### Training Hyperparameters
210
+
211
+ - **Batch size:** 8
212
+ - **Epochs:** 21
213
+ - **Learning rate:** 0.00005
214
+ - **Precision:** bfloat16 (non-mixed precision)
215
+ - **TF32:** Enabled
216
+ - **Seed:** 42
217
+ - **Max Size:** 512 tokens
218
+
219
+ #### Inference time
220
+
221
+ Average inference time (across 1000 iterations), using a single NVIDIA A100 GPU and a batch size of one is **17 ms** for a single example. Wirh bigger batches, inference can be considerably faster.
222
+
223
+ ## Evaluation
224
+
225
+ Micro-averaged F1 scores and optimized prediction thresholds (test set):
226
+
227
+ | Language | F1 (All labels) | F1 (Main labels) | Threshold |
228
+ | -------- | --------------- | ---------------- | ----------|
229
+ | English | 0.74 | 0.75 | 0.40 |
230
+
231
+
232
+ ## Technical Specifications
233
+
234
+ ### Compute Infrastructure
235
+
236
+ - Mahti supercomputer (CSC - IT Center for Science, Finland)
237
+ - 1 x NVIDIA A100-SXM4-40GB
238
+
239
+ #### Software
240
+
241
+ - torch 2.2.1
242
+ - transformers 4.39.3
243
+
244
+ ## Citation
245
+
246
+ The citation for this work will be available soon. In the meantime, please refer to earlier related work for citation:
247
+
248
+ ```bibtex
249
+ @article{Laippala.etal2022,
250
+ title = {Register Identification from the Unrestricted Open {{Web}} Using the {{Corpus}} of {{Online Registers}} of {{English}}},
251
+ author = {Laippala, Veronika and R{\"o}nnqvist, Samuel and Oinonen, Miika and Kyr{\"o}l{\"a}inen, Aki-Juhani and Salmela, Anna and Biber, Douglas and Egbert, Jesse and Pyysalo, Sampo},
252
+ year = {2022},
253
+ journal = {Language Resources and Evaluation},
254
+ issn = {1574-0218},
255
+ doi = {10.1007/s10579-022-09624-1},
256
+ url = {https://doi.org/10.1007/s10579-022-09624-1},
257
+ }
258
+
259
+ ```
260
+
261
+ ## Model Card Contact
262
+
263
+ Erik Henriksson, Hugging Face username: erikhenriksson