DerivedFunction commited on
Commit
630accb
·
verified ·
1 Parent(s): f3f95a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -15
README.md CHANGED
@@ -9,35 +9,160 @@ metrics:
9
  - recall
10
  - f1
11
  - accuracy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  model-index:
13
- - name: polyglot-tagger-v2.2
14
  results: []
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # polyglot-tagger-v2.2
 
 
 
 
 
 
21
 
22
- This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.0345
25
- - Precision: 0.9508
26
- - Recall: 0.9647
27
- - F1: 0.9577
28
- - Accuracy: 0.9908
29
 
30
  ## Model description
31
 
32
- More information needed
 
33
 
34
  ## Intended uses & limitations
 
 
 
 
 
35
 
36
- More information needed
37
 
38
- ## Training and evaluation data
 
 
39
 
40
- More information needed
 
 
 
 
 
 
41
 
42
  ## Training procedure
43
 
 
9
  - recall
10
  - f1
11
  - accuracy
12
+ language:
13
+ - multilingual
14
+ - af
15
+ - am
16
+ - ar
17
+ - as
18
+ - ba
19
+ - be
20
+ - bg
21
+ - bn
22
+ - bo
23
+ - br
24
+ - bs
25
+ - ca
26
+ - ce
27
+ - ckb
28
+ - cs
29
+ - cy
30
+ - da
31
+ - de
32
+ - dv
33
+ - el
34
+ - en
35
+ - eo
36
+ - es
37
+ - et
38
+ - eu
39
+ - fa
40
+ - fi
41
+ - fr
42
+ - ga
43
+ - gd
44
+ - gl
45
+ - gu
46
+ - he
47
+ - hi
48
+ - hr
49
+ - hu
50
+ - hy
51
+ - id
52
+ - is
53
+ - it
54
+ - ja
55
+ - jv
56
+ - ka
57
+ - kk
58
+ - km
59
+ - kn
60
+ - ko
61
+ - ku
62
+ - ky
63
+ - la
64
+ - lb
65
+ - lo
66
+ - lt
67
+ - lv
68
+ - mg
69
+ - mk
70
+ - ml
71
+ - mn
72
+ - mr
73
+ - ms
74
+ - mt
75
+ - my
76
+ - ne
77
+ - nl
78
+ - 'no'
79
+ - ny
80
+ - oc
81
+ - om
82
+ - or
83
+ - pa
84
+ - pl
85
+ - ps
86
+ - pt
87
+ - rm
88
+ - ro
89
+ - ru
90
+ - sd
91
+ - si
92
+ - sk
93
+ - sl
94
+ - so
95
+ - sq
96
+ - sr
97
+ - su
98
+ - sv
99
+ - sw
100
+ - ta
101
+ - te
102
+ - tg
103
+ - th
104
+ - ti
105
+ - tl
106
+ - tr
107
+ - tt
108
+ - ug
109
+ - uk
110
+ - ur
111
+ - uz
112
+ - vi
113
+ - yo
114
+ - yi
115
+ - zh
116
+ - zu
117
  model-index:
118
+ - name: polyglot-tagger
119
  results: []
120
+ datasets:
121
+ - wikimedia/wikipedia
122
+ - HuggingFaceFW/finetranslations
123
+ - google/smol
124
+ - DerivedFunction/nlp-noise-snippets
125
+ - DerivedFunction/wikipedia-language-snippets-filtered
126
+ - DerivedFunction/finetranslations-filtered
127
+ - DerivedFunction/tatoeba-filtered
128
+ pipeline_tag: token-classification
129
  ---
130
 
 
 
131
 
132
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/67ee3f0a66388136438834cc/OnfV_fN2br5c4cPnOn6O0.png)
133
+
134
+
135
+ Fine-tuned `xlm-roberta-base` for sentence-level language tagging across 100 languages.
136
+ The model predicts BIO-style language tags over tokens, which makes it useful for
137
+ language identification, code-switch detection, and multilingual document analysis.
138
+
139
 
 
 
 
 
 
 
 
140
 
141
  ## Model description
142
 
143
+ Introducing Polyglot Tagger, a new way to classify multi-lingual documents. By training specifically on token classification on individual sentences, the model
144
+ generalizes well on a variety of languages, while also behaves as a multi-label classifier, and extracts sentences based on its language.
145
 
146
  ## Intended uses & limitations
147
+ This model can be treated as a base model for further fine-tuning on specific language identification extraction tasks.
148
+ Note that as a general language tagging model, it can potentially get confused from shared language families or from short texts. For example, English and German, Spanish and Portuguese, and Russian and Ukrainian.
149
+
150
+ The model is trained on a sentence with a minimum of four tokens, so it may not accurately classify very short and ambigous statements. Note that this model is experimental
151
+ and may produce unexpected results compared to generic text classifiers. It is trained on cleaned text, therefore, "messy" text may unexpectedly produce different results.
152
 
153
+ > Note that Romanized versions of any language may only have minor representation in the training set, such as Romanized Russian, and Hindi.
154
 
155
+ ### Training and Evaluation Data
156
+ A synthetic training row consists of 1-4 individual and mostly independent sentences extracted from various sources. The actual training and evaluation data, as well as coverage
157
+ is found in `DerivedFunction/lang-ner-v2`.
158
 
159
+ This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on an unknown dataset.
160
+ It achieves the following results on the evaluation set:
161
+ - Loss: 0.0345
162
+ - Precision: 0.9508
163
+ - Recall: 0.9647
164
+ - F1: 0.9577
165
+ - Accuracy: 0.9908
166
 
167
  ## Training procedure
168