haukelicht commited on
Commit
65f9683
·
verified ·
1 Parent(s): 2fbb89c

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: sentence-transformers/all-mpnet-base-v2
3
+ language:
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - economic-attributes
8
+ - mention-classification
9
+ - mpnet-base-v2
10
+ - setfit
11
+ - multi-label-classification
12
+ model-index:
13
+ - name: all-mpnet-base-v2_economic-attributes-classifier
14
+ results:
15
+ - task:
16
+ type: multi-label-classification
17
+ name: Multi-label classification
18
+ metrics:
19
+ - type: _tba_
20
+ value: -1.0
21
+ dataset:
22
+ type: custom
23
+ name: custom human-labeled multi-label annotation dataset
24
+ ---
25
+
26
+ # Group mention economic attributes classifier
27
+
28
+ A multi-label classifier for detecting **economic attribute** categories referred to in a social group mention, trained with `setfit` based on the light-weight [`sentence-transformers/all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) sentence embedding model.
29
+
30
+ The economic attributes classified are:
31
+
32
+ | attribute | definition |
33
+ |:------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
34
+ | class membership | People described with their membership in or belonging to a social class such as the upper class, the middle class, lower class, or the working class. |
35
+ | employment status | People described or categorized by their employment status such as employers, employees, self-employed, or unemployed people. |
36
+ | education level | People described with or categorized by their education level such as students, apprentices, higher education, tertiary education, vocational training or graduates. |
37
+ | income/wealth/economic status | People defined or categorized by their income, wealth, or economic status such as high/medium/low income groups, rich/poor people, homeowners/tenants/homeless. |
38
+ | occupation/profession | People referred to with or categorized according to their occupation or profession such as teachers, farmers, public servants, police officers |
39
+ | ecology of group | People categorized by their relation to the ecology of society such as carbon emitters, coal miners, green employers, green workers, sustainable farmers, those working in the fossil sector |
40
+
41
+ ## Model Details
42
+
43
+ ### Model Description
44
+
45
+ Group mention economic attributes classifier
46
+
47
+ - **Developed by:** Hauke Licht
48
+ - **Model type:** mpnet
49
+ - **Language(s) (NLP):** ['en']
50
+ - **License:** apache-2.0
51
+ - **Finetuned from model:** sentence-transformers/all-mpnet-base-v2
52
+ - **Funded by:** The *Deutsche Forschungsgemeinschaft* (DFG, German Research Foundation) under Germany's Excellence Strategy – EXC 2126/1 – 390838866
53
+
54
+ ### Model Sources
55
+
56
+ - **Repository:** _tba_
57
+ - **Paper:** _tba_
58
+ - **Demo:** [More Information Needed]
59
+
60
+ ## Uses
61
+
62
+ ### Bias, Risks, and Limitations
63
+
64
+ - Evaluation of the classifier in held-out data shows that it makes mistakes.
65
+ - The model has been finetuned only on human-annotated labeled social group mentions recorded in sentences sampled from party manifestos of European parties (mostly far-right and Green parties). Applying the classifier in other domains can lead to higher error rates.
66
+ - The data used to finetune the model come from human annotators. Human annotators can be biased and factors like gender and social background can impact their annotations judgments. This may lead to bias in the detection of specific social groups.
67
+
68
+ #### Recommendations
69
+
70
+ - Users who want to apply the model outside its training data domain should evaluate its performance in the target data.
71
+ - Users who want to apply the model outside its training data domain should contuninue to finetune this model on labeled data.
72
+
73
+ ### How to Get Started with the Model
74
+
75
+ Use the code below to get started with the model.
76
+
77
+ ## Usage
78
+
79
+ You can use the model with the [`setfit` python library](https://github.com/huggingface/setfit) (>=1.1.0):
80
+
81
+ *Note:* It is recommended to use transformers version >=4.5.5,<=5.0.0 and sentence-transformers version >=4.0.1,<=5.1.0 for compatibility.
82
+
83
+ ### Classification
84
+
85
+ ```
86
+ import torch
87
+ from setfit import SetFitModel
88
+
89
+ model_name = "hauke-licht/all-mpnet-base-v2_economic-attributes-classifier"
90
+ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
91
+ classifier = SetFitModel.from_pretrained(model_name)
92
+ classifier.to(device);
93
+
94
+ # Example mentions
95
+ mentions = ["working class people", "highly-educated professionals", "people without a stable job"]
96
+
97
+ # Get predictions
98
+ predictions = classifier.predict(mentions)
99
+ print(predictions)
100
+
101
+ # Map predictions to labels
102
+ [
103
+ [
104
+ classifier.id2label[l]
105
+ for l, p in enumerate(pred) if p==1
106
+ ]
107
+ for pred in predictions
108
+ ]
109
+ ```
110
+
111
+ ### Mention embedding
112
+
113
+ ```python
114
+ import torch
115
+ from sentence_transformers import SentenceTransformer
116
+
117
+ model_name = "hauke-licht/all-mpnet-base-v2_economic-attributes-classifier"
118
+ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
119
+
120
+ # Load the sentence transformer component of the pre-trained classifier
121
+ model = SentenceTransformer(model_name, device=device)
122
+
123
+ # Example mentions
124
+ mentions = ["working class people", "highly-educated professionals", "people without a stable job"]
125
+
126
+ # Compute mention embeddings
127
+ embeddings = model.encode(mentions)
128
+ ````
129
+
130
+ ## Training Details
131
+
132
+ ### Training Data
133
+
134
+ The train, dev, and test splits used for model finetuning and evaluation will be made available on Github upon publication of the associated research paper.
135
+
136
+ ### Training Procedure
137
+
138
+ #### Training Hyperparameters
139
+
140
+ - num epochs: (1, 4)
141
+ - train batch sizes: (16, 4)
142
+ - body train max teps: 100
143
+ - head learning rate: 0.030
144
+ - L2 weight: 0.015
145
+ - warmup proportion: 0.10
146
+
147
+ ## Evaluation
148
+
149
+ ### Testing Data, Factors & Metrics
150
+
151
+ #### Testing Data
152
+
153
+ The train, dev, and test splits used for model finetuning and evaluation will be made available on Github upon publication of the associated research paper.
154
+
155
+ ## Citation
156
+
157
+ **BibTeX:**
158
+
159
+ [More Information Needed]
160
+
161
+ **APA:**
162
+
163
+ [More Information Needed]
164
+
165
+ ## Model Card Contact
166
+
167
+ hauke.licht@uibk.ac.at
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPNetModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "dtype": "float32",
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "mpnet",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "relative_attention_num_buckets": 32,
21
+ "transformers_version": "4.57.1",
22
+ "vocab_size": 30527
23
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.0",
4
+ "transformers": "4.57.1",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
config_setfit.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "normalize_embeddings": true,
3
+ "labels": [
4
+ "economic__class_membership",
5
+ "economic__ecology_of_group",
6
+ "economic__education_level",
7
+ "economic__employment_status",
8
+ "economic__income_wealth_economic_status",
9
+ "economic__occupation_profession"
10
+ ]
11
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:962fb01b274737b31d49e50d1e2572a1e78a7c88018af35a8a1ec3d9d9bbfccf
3
+ size 437967672
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67b014a2acc4397eb28049ee99646eb0361e1a7cf90f074f0479b57c0c14abf6
3
+ size 20377
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "extra_special_tokens": {},
58
+ "mask_token": "<mask>",
59
+ "max_length": 128,
60
+ "model_max_length": 384,
61
+ "pad_to_multiple_of": null,
62
+ "pad_token": "<pad>",
63
+ "pad_token_type_id": 0,
64
+ "padding_side": "right",
65
+ "sep_token": "</s>",
66
+ "stride": 0,
67
+ "strip_accents": null,
68
+ "tokenize_chinese_chars": true,
69
+ "tokenizer_class": "MPNetTokenizer",
70
+ "truncation_side": "right",
71
+ "truncation_strategy": "longest_first",
72
+ "unk_token": "[UNK]"
73
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff