Text Classification
Transformers
Safetensors
Latvian
bert

New model version trained on GoEmotions and Twitter dataset

#2
by SkyWater21 - opened
Files changed (8) hide show
  1. .gitattributes +35 -35
  2. README.md +81 -101
  3. config.json +50 -50
  4. model.safetensors +1 -1
  5. special_tokens_map.json +7 -7
  6. tokenizer.json +0 -0
  7. tokenizer_config.json +57 -57
  8. vocab.txt +0 -0
.gitattributes CHANGED
@@ -1,35 +1,35 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,101 +1,81 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - AiLab-IMCS-UL/go_emotions-lv
5
- language:
6
- - lv
7
- base_model:
8
- - AiLab-IMCS-UL/lvbert
9
- ---
10
- Fine-tuned [LVBERT](https://huggingface.co/AiLab-IMCS-UL/lvbert) for multi-label emotion classification task.
11
-
12
- Model was trained on [lv_go_emotions](https://huggingface.co/datasets/SkyWater21/lv_go_emotions) dataset. This dataset is Latvian translation of [GoEmotions](https://huggingface.co/datasets/go_emotions) dataset. Google Translate was used to generate the machine translation.
13
-
14
- Original 26 emotions were mapped to 6 base emotions as per Dr. Ekman theory.
15
-
16
- Labels predicted by classifier:
17
- ```yaml
18
- 0: anger
19
- 1: disgust
20
- 2: fear
21
- 3: joy
22
- 4: sadness
23
- 5: surprise
24
- 6: neutral
25
- ```
26
-
27
- Label mapping from 27 emotions from GoEmotion to 6 base emotions as per Dr. Ekman theory:
28
- |GoEmotion|Ekman|
29
- |---|---|
30
- | admiration | joy|
31
- | amusement | joy|
32
- | anger | anger|
33
- | annoyance | anger|
34
- | approval | joy|
35
- | caring | joy|
36
- | confusion | surprise|
37
- | curiosity | surprise|
38
- | desire | joy|
39
- | disappointment | sadness|
40
- | disapproval | anger|
41
- | disgust | disgust|
42
- | embarrassment | sadness|
43
- | excitement | joy|
44
- | fear | fear|
45
- | gratitude | joy|
46
- | grief | sadness|
47
- | joy | joy|
48
- | love | joy|
49
- | nervousness | fear|
50
- | optimism | joy|
51
- | pride | joy|
52
- | realization | surprise|
53
- | relief | joy|
54
- | remorse | sadness|
55
- | sadness | sadness|
56
- | surprise | surprise|
57
- | neutral | neutral|
58
-
59
- Seed used for random number generator is 42:
60
- ```python
61
- def set_seed(seed=42):
62
- random.seed(seed)
63
- np.random.seed(seed)
64
- torch.manual_seed(seed)
65
- if torch.cuda.is_available():
66
- torch.cuda.manual_seed_all(seed)
67
- ```
68
-
69
- Training parameters:
70
- ```yaml
71
- max_length: null
72
- batch_size: 32
73
- shuffle: True
74
- num_workers: 2
75
- pin_memory: False
76
- drop_last: False
77
-
78
- optimizer: adam
79
- lr: 0.00001
80
- weight_decay: 0
81
-
82
- problem_type: multi_label_classification
83
-
84
- num_epochs: 3
85
- ```
86
-
87
-
88
- Evaluation results on test split of [lv_go_emotions](https://huggingface.co/datasets/SkyWater21/lv_go_emotions/viewer/simplified_ekman)
89
- | |Precision|Recall|F1-Score|AUC-ROC|Support|
90
- |--------------|---------|------|--------|-------|-------|
91
- |anger | 0.57| 0.40| 0.47| 0.85| 726|
92
- |disgust | 0.64| 0.28| 0.39| 0.93| 123|
93
- |fear | 0.63| 0.54| 0.58| 0.95| 98|
94
- |joy | 0.80| 0.79| 0.79| 0.91| 2104|
95
- |sadness | 0.70| 0.44| 0.54| 0.90| 379|
96
- |surprise | 0.63| 0.44| 0.52| 0.89| 677|
97
- |neutral | 0.65| 0.62| 0.64| 0.83| 1787|
98
- |micro avg | 0.70| 0.61| 0.66| 0.93| 5894|
99
- |macro avg | 0.66| 0.50| 0.56| 0.89| 5894|
100
- |weighted avg | 0.69| 0.61| 0.65| 0.88| 5894|
101
- |samples avg | 0.65| 0.63| 0.63| nan| 5894|
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - SkyWater21/lv_emotions
5
+ language:
6
+ - lv
7
+ base_model:
8
+ - AiLab-IMCS-UL/lvbert
9
+ ---
10
+ Fine-tuned [LVBERT](https://huggingface.co/AiLab-IMCS-UL/lvbert) for multi-label emotion classification task.
11
+
12
+ Model was trained on [lv_emotions](https://huggingface.co/datasets/SkyWater21/lv_emotions) dataset. This dataset is Latvian translation of [GoEmotions](https://huggingface.co/datasets/go_emotions) and [Twitter Emotions](https://huggingface.co/datasets/SkyWater21/lv_twitter_emotions) dataset. Google Translate was used to generate the machine translation.
13
+
14
+ Original 26 emotions were mapped to 6 base emotions as per Dr. Ekman theory.
15
+
16
+ Labels predicted by classifier:
17
+ ```yaml
18
+ 0: anger
19
+ 1: disgust
20
+ 2: fear
21
+ 3: joy
22
+ 4: sadness
23
+ 5: surprise
24
+ 6: neutral
25
+ ```
26
+
27
+ Seed used for random number generator is 42:
28
+ ```python
29
+ def set_seed(seed=42):
30
+ random.seed(seed)
31
+ np.random.seed(seed)
32
+ torch.manual_seed(seed)
33
+ if torch.cuda.is_available():
34
+ torch.cuda.manual_seed_all(seed)
35
+ ```
36
+
37
+ Training parameters:
38
+ ```yaml
39
+ max_length: null
40
+ batch_size: 32
41
+ shuffle: True
42
+ num_workers: 4
43
+ pin_memory: False
44
+ drop_last: False
45
+ optimizer: adam
46
+ lr: 0.000005
47
+ weight_decay: 0
48
+ problem_type: multi_label_classification
49
+ num_epochs: 3
50
+ ```
51
+
52
+
53
+ Evaluation results on test split of [lv_go_emotions](https://huggingface.co/datasets/SkyWater21/lv_emotions/viewer/combined/lv_go_emotions_test)
54
+ | |Precision|Recall|F1-Score|Support|
55
+ |--------------|---------|------|--------|-------|
56
+ |anger | 0.57| 0.36| 0.44| 726|
57
+ |disgust | 0.42| 0.29| 0.35| 123|
58
+ |fear | 0.59| 0.43| 0.50| 98|
59
+ |joy | 0.78| 0.80| 0.79| 2104|
60
+ |sadness | 0.65| 0.42| 0.51| 379|
61
+ |surprise | 0.62| 0.38| 0.47| 677|
62
+ |neutral | 0.66| 0.58| 0.62| 1787|
63
+ |micro avg | 0.70| 0.59| 0.64| 5894|
64
+ |macro avg | 0.61| 0.46| 0.52| 5894|
65
+ |weighted avg | 0.68| 0.59| 0.63| 5894|
66
+ |samples avg | 0.62| 0.61| 0.61| 5894|
67
+
68
+ Evaluation results on test split of [lv_twitter_emotions](https://huggingface.co/datasets/SkyWater21/lv_emotions/viewer/combined/lv_twitter_emotions_test)
69
+ | |Precision|Recall|F1-Score|Support|
70
+ |--------------|---------|------|--------|-------|
71
+ |anger | 0.94| 0.87| 0.90| 12013|
72
+ |disgust | 0.92| 0.92| 0.92| 14117|
73
+ |fear | 0.74| 0.80| 0.77| 3342|
74
+ |joy | 0.87| 0.88| 0.87| 5913|
75
+ |sadness | 0.81| 0.80| 0.81| 4786|
76
+ |surprise | 0.93| 0.57| 0.71| 1510|
77
+ |neutral | 0.00| 0.00| 0.00| 0|
78
+ |micro avg | 0.89| 0.87| 0.88| 41681|
79
+ |macro avg | 0.74| 0.69| 0.71| 41681|
80
+ |weighted avg | 0.89| 0.87| 0.88| 41681|
81
+ |samples avg | 0.86| 0.87| 0.86| 41681|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,50 +1,50 @@
1
- {
2
- "_name_or_path": "AiLab-IMCS-UL/lvbert",
3
- "architectures": [
4
- "BertForSequenceClassification"
5
- ],
6
- "attention_probs_dropout_prob": 0.1,
7
- "classifier_dropout": null,
8
- "directionality": "bidi",
9
- "hidden_act": "gelu",
10
- "hidden_dropout_prob": 0.1,
11
- "hidden_size": 768,
12
- "id2label": {
13
- "0": "anger",
14
- "1": "disgust",
15
- "2": "fear",
16
- "3": "joy",
17
- "4": "sadness",
18
- "5": "surprise",
19
- "6": "neutral"
20
- },
21
- "initializer_range": 0.02,
22
- "intermediate_size": 3072,
23
- "label2id": {
24
- "anger": 0,
25
- "disgust": 1,
26
- "fear": 2,
27
- "joy": 3,
28
- "neutral": 6,
29
- "sadness": 4,
30
- "surprise": 5
31
- },
32
- "layer_norm_eps": 1e-12,
33
- "max_position_embeddings": 512,
34
- "model_type": "bert",
35
- "num_attention_heads": 12,
36
- "num_hidden_layers": 12,
37
- "pad_token_id": 0,
38
- "pooler_fc_size": 768,
39
- "pooler_num_attention_heads": 12,
40
- "pooler_num_fc_layers": 3,
41
- "pooler_size_per_head": 128,
42
- "pooler_type": "first_token_transform",
43
- "position_embedding_type": "absolute",
44
- "problem_type": "multi_label_classification",
45
- "torch_dtype": "float32",
46
- "transformers_version": "4.39.3",
47
- "type_vocab_size": 2,
48
- "use_cache": true,
49
- "vocab_size": 32004
50
- }
 
1
+ {
2
+ "_name_or_path": "AiLab-IMCS-UL/lvbert",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "directionality": "bidi",
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "anger",
14
+ "1": "disgust",
15
+ "2": "fear",
16
+ "3": "joy",
17
+ "4": "sadness",
18
+ "5": "surprise",
19
+ "6": "neutral"
20
+ },
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 3072,
23
+ "label2id": {
24
+ "anger": 0,
25
+ "disgust": 1,
26
+ "fear": 2,
27
+ "joy": 3,
28
+ "neutral": 6,
29
+ "sadness": 4,
30
+ "surprise": 5
31
+ },
32
+ "layer_norm_eps": 1e-12,
33
+ "max_position_embeddings": 512,
34
+ "model_type": "bert",
35
+ "num_attention_heads": 12,
36
+ "num_hidden_layers": 12,
37
+ "pad_token_id": 0,
38
+ "pooler_fc_size": 768,
39
+ "pooler_num_attention_heads": 12,
40
+ "pooler_num_fc_layers": 3,
41
+ "pooler_size_per_head": 128,
42
+ "pooler_type": "first_token_transform",
43
+ "position_embedding_type": "absolute",
44
+ "problem_type": "multi_label_classification",
45
+ "torch_dtype": "float32",
46
+ "transformers_version": "4.45.1",
47
+ "type_vocab_size": 2,
48
+ "use_cache": true,
49
+ "vocab_size": 32004
50
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3fdf83f57d45707e774742a21d24d695e6222f14e18d41b5748a8c0f28a9e1d3
3
  size 442526732
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:775169c1aa5b607d4d2f9200db7ffb0a6ab65da7b0f7fa4baab6ffa437407b61
3
  size 442526732
special_tokens_map.json CHANGED
@@ -1,7 +1,7 @@
1
- {
2
- "cls_token": "[CLS]",
3
- "mask_token": "[MASK]",
4
- "pad_token": "[PAD]",
5
- "sep_token": "[SEP]",
6
- "unk_token": "[UNK]"
7
- }
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,57 +1,57 @@
1
- {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "[PAD]",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "[UNK]",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "[CLS]",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "[SEP]",
29
- "lstrip": false,
30
- "normalized": false,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "4": {
36
- "content": "[MASK]",
37
- "lstrip": false,
38
- "normalized": false,
39
- "rstrip": false,
40
- "single_word": false,
41
- "special": true
42
- }
43
- },
44
- "clean_up_tokenization_spaces": true,
45
- "cls_token": "[CLS]",
46
- "do_basic_tokenize": true,
47
- "do_lower_case": false,
48
- "mask_token": "[MASK]",
49
- "model_max_length": 1000000000000000019884624838656,
50
- "never_split": null,
51
- "pad_token": "[PAD]",
52
- "sep_token": "[SEP]",
53
- "strip_accents": null,
54
- "tokenize_chinese_chars": true,
55
- "tokenizer_class": "BertTokenizer",
56
- "unk_token": "[UNK]"
57
- }
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt CHANGED
The diff for this file is too large to render. See raw diff