E-katrin commited on
Commit
ff83b1d
·
verified ·
1 Parent(s): 468417a

Upload ConlluTokenClassificationPipeline

Browse files
Files changed (10) hide show
  1. README.md +199 -0
  2. config.json +781 -0
  3. configuration.py +51 -0
  4. dependency_classifier.py +301 -0
  5. encoder.py +144 -0
  6. mlp_classifier.py +46 -0
  7. model.safetensors +3 -0
  8. modeling_parser.py +171 -0
  9. pipeline.py +236 -0
  10. utils.py +69 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
config.json ADDED
@@ -0,0 +1,781 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation": "relu",
3
+ "architectures": [
4
+ "CobaldParser"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration.CobaldParserConfig",
8
+ "AutoModel": "modeling_parser.CobaldParser"
9
+ },
10
+ "consecutive_null_limit": 3,
11
+ "custom_pipelines": {
12
+ "conllu-parsing": {
13
+ "impl": "pipeline.ConlluTokenClassificationPipeline",
14
+ "pt": [
15
+ "AutoModel"
16
+ ],
17
+ "tf": [],
18
+ "type": "text"
19
+ }
20
+ },
21
+ "deepslot_classifier_hidden_size": 256,
22
+ "dependency_classifier_hidden_size": 128,
23
+ "dropout": 0.1,
24
+ "encoder_model_name": "xlm-roberta-base",
25
+ "lemma_classifier_hidden_size": 512,
26
+ "lora_alpha": 16,
27
+ "lora_dropout": 0.05,
28
+ "lora_r": 8,
29
+ "lora_target_modules": [
30
+ "q_proj",
31
+ "v_proj"
32
+ ],
33
+ "misc_classifier_hidden_size": 512,
34
+ "model_type": "cobald_parser",
35
+ "morphology_classifier_hidden_size": 512,
36
+ "null_classifier_hidden_size": 512,
37
+ "semclass_classifier_hidden_size": 512,
38
+ "torch_dtype": "float32",
39
+ "transformers_version": "4.52.2",
40
+ "use_lora": true,
41
+ "vocabulary": {
42
+ "deepslot": {
43
+ "0": "Addition",
44
+ "1": "AdditionalParticipant",
45
+ "2": "Addressee",
46
+ "3": "Agent",
47
+ "4": "Agent_Metaphoric",
48
+ "5": "BeneMalefactive",
49
+ "6": "Cause",
50
+ "7": "Ch_Parameter",
51
+ "8": "Ch_Reference",
52
+ "9": "Characteristic",
53
+ "10": "ClassifiedEntity",
54
+ "11": "Comparison",
55
+ "12": "ComparisonBase",
56
+ "13": "Concession",
57
+ "14": "Concurrent",
58
+ "15": "Condition",
59
+ "16": "ContrAgent",
60
+ "17": "ContrObject",
61
+ "18": "Correlative",
62
+ "19": "Criterion",
63
+ "20": "Degree",
64
+ "21": "DegreeNumerative",
65
+ "22": "Elective",
66
+ "23": "Empty_Subject_It",
67
+ "24": "Experiencer",
68
+ "25": "Experiencer_Metaphoric",
69
+ "26": "Function",
70
+ "27": "Instrument_Situation",
71
+ "28": "Landmark",
72
+ "29": "Limitation",
73
+ "30": "Locative",
74
+ "31": "Locative_FinalPoint",
75
+ "32": "Member",
76
+ "33": "MetaphoricLocative",
77
+ "34": "Motive",
78
+ "35": "Name_Title",
79
+ "36": "Object",
80
+ "37": "Object_Relation",
81
+ "38": "Object_Situation",
82
+ "39": "Opposition",
83
+ "40": "OrderInTimeAndSpace",
84
+ "41": "Parenthetical",
85
+ "42": "Part",
86
+ "43": "Part_Situation",
87
+ "44": "ParticipleRelativeClause",
88
+ "45": "Possessor",
89
+ "46": "Possessor_Metaphoric",
90
+ "47": "Predicate",
91
+ "48": "Predicate_Noun",
92
+ "49": "PrincipleOfOrganization",
93
+ "50": "Purpose",
94
+ "51": "QuantifiedEntity",
95
+ "52": "Quantity",
96
+ "53": "Raising_Target",
97
+ "54": "Relative",
98
+ "55": "Resultative",
99
+ "56": "SetEnvironment",
100
+ "57": "Set_General",
101
+ "58": "Source",
102
+ "59": "Specification",
103
+ "60": "Specifier_Number",
104
+ "61": "Sphere",
105
+ "62": "StaffOfPossessors",
106
+ "63": "Standpoint",
107
+ "64": "State",
108
+ "65": "SupportedEntity",
109
+ "66": "Theme",
110
+ "67": "Time",
111
+ "68": "Vocative"
112
+ },
113
+ "eud_deprel": {
114
+ "0": "acl",
115
+ "1": "acl:att",
116
+ "2": "acl:cleft",
117
+ "3": "acl:med",
118
+ "4": "acl:mot",
119
+ "5": "acl:om",
120
+ "6": "acl:p\u00e5",
121
+ "7": "acl:relcl",
122
+ "8": "acl:som",
123
+ "9": "acl:\u00e4n",
124
+ "10": "advcl",
125
+ "11": "advcl:att",
126
+ "12": "advcl:d\u00e4rf\u00f6r_att",
127
+ "13": "advcl:d\u00e5",
128
+ "14": "advcl:eftersom",
129
+ "15": "advcl:f\u00f6r_att",
130
+ "16": "advcl:f\u00f6rutsatt_att",
131
+ "17": "advcl:innan",
132
+ "18": "advcl:liksom",
133
+ "19": "advcl:med_att",
134
+ "20": "advcl:n\u00e4r",
135
+ "21": "advcl:om",
136
+ "22": "advcl:p\u00e5",
137
+ "23": "advcl:samtidigt_som",
138
+ "24": "advcl:sedan",
139
+ "25": "advcl:som",
140
+ "26": "advcl:\u00e4n",
141
+ "27": "advmod",
142
+ "28": "amod",
143
+ "29": "appos",
144
+ "30": "aux",
145
+ "31": "aux:pass",
146
+ "32": "case",
147
+ "33": "cc",
148
+ "34": "ccomp",
149
+ "35": "compound:prt",
150
+ "36": "conj",
151
+ "37": "conj:and",
152
+ "38": "conj:eller",
153
+ "39": "conj:fast",
154
+ "40": "conj:men",
155
+ "41": "conj:och",
156
+ "42": "conj:respektive",
157
+ "43": "conj:samt",
158
+ "44": "conj:som",
159
+ "45": "conj:ty",
160
+ "46": "conj:utan",
161
+ "47": "cop",
162
+ "48": "csubj",
163
+ "49": "csubj:pass",
164
+ "50": "det",
165
+ "51": "dislocated",
166
+ "52": "expl",
167
+ "53": "fixed",
168
+ "54": "flat",
169
+ "55": "iobj",
170
+ "56": "mark",
171
+ "57": "nmod",
172
+ "58": "nmod:av",
173
+ "59": "nmod:efter",
174
+ "60": "nmod:fr\u00e5n",
175
+ "61": "nmod:f\u00f6r",
176
+ "62": "nmod:hos",
177
+ "63": "nmod:i",
178
+ "64": "nmod:inom",
179
+ "65": "nmod:med",
180
+ "66": "nmod:mellan",
181
+ "67": "nmod:mot",
182
+ "68": "nmod:oavsett",
183
+ "69": "nmod:om",
184
+ "70": "nmod:poss",
185
+ "71": "nmod:p\u00e5",
186
+ "72": "nmod:till",
187
+ "73": "nmod:under",
188
+ "74": "nmod:utanf\u00f6r",
189
+ "75": "nmod:vid",
190
+ "76": "nmod:\u00e5t",
191
+ "77": "nsubj",
192
+ "78": "nsubj:pass",
193
+ "79": "nsubj:xsubj",
194
+ "80": "nummod",
195
+ "81": "obj",
196
+ "82": "obl",
197
+ "83": "obl:agent",
198
+ "84": "obl:as",
199
+ "85": "obl:av",
200
+ "86": "obl:bland",
201
+ "87": "obl:efter",
202
+ "88": "obl:enligt",
203
+ "89": "obl:for",
204
+ "90": "obl:fr\u00e5n",
205
+ "91": "obl:f\u00f6r",
206
+ "92": "obl:genom",
207
+ "93": "obl:hos",
208
+ "94": "obl:i",
209
+ "95": "obl:inom",
210
+ "96": "obl:med",
211
+ "97": "obl:med_avseende_p\u00e5",
212
+ "98": "obl:mellan",
213
+ "99": "obl:mot",
214
+ "100": "obl:om",
215
+ "101": "obl:omkring",
216
+ "102": "obl:p\u00e5",
217
+ "103": "obl:runtomkring",
218
+ "104": "obl:som",
219
+ "105": "obl:till",
220
+ "106": "obl:trots",
221
+ "107": "obl:under",
222
+ "108": "obl:ur",
223
+ "109": "obl:utan",
224
+ "110": "obl:utanf\u00f6r",
225
+ "111": "obl:vid",
226
+ "112": "obl:\u00e4n",
227
+ "113": "obl:\u00e5",
228
+ "114": "obl:\u00e5t",
229
+ "115": "parataxis",
230
+ "116": "punct",
231
+ "117": "ref",
232
+ "118": "root",
233
+ "119": "vocative",
234
+ "120": "xcomp"
235
+ },
236
+ "joint_feats": {
237
+ "0": "ADJ#Adjective#Abbr=Yes",
238
+ "1": "ADJ#Adjective#Case=Nom|Definite=Def|Degree=Pos",
239
+ "2": "ADJ#Adjective#Case=Nom|Definite=Def|Degree=Pos|Gender=Com|Number=Sing",
240
+ "3": "ADJ#Adjective#Case=Nom|Definite=Def|Degree=Pos|Tense=Past|VerbForm=Part",
241
+ "4": "ADJ#Adjective#Case=Nom|Definite=Def|Degree=Sup",
242
+ "5": "ADJ#Adjective#Case=Nom|Definite=Ind|Degree=Pos",
243
+ "6": "ADJ#Adjective#Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing",
244
+ "7": "ADJ#Adjective#Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing|Tense=Past|VerbForm=Part",
245
+ "8": "ADJ#Adjective#Case=Nom|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing",
246
+ "9": "ADJ#Adjective#Case=Nom|Definite=Ind|Degree=Pos|Number=Plur",
247
+ "10": "ADJ#Adjective#Case=Nom|Definite=Ind|Degree=Pos|Number=Sing",
248
+ "11": "ADJ#Adjective#Case=Nom|Degree=Cmp",
249
+ "12": "ADJ#Adjective#Case=Nom|Degree=Pos",
250
+ "13": "ADJ#Adjective#Case=Nom|Degree=Pos|Number=Plur",
251
+ "14": "ADJ#Adjective#Case=Nom|Degree=Pos|Tense=Pres|VerbForm=Part",
252
+ "15": "ADJ#Adjective#Case=Nom|Number=Plur|Tense=Past|VerbForm=Part",
253
+ "16": "ADJ#Adjective#Degree=Pos|Foreign=Yes",
254
+ "17": "ADJ#Adverb#Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing",
255
+ "18": "ADJ#Adverb#Case=Nom|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing",
256
+ "19": "ADJ#Adverb#Case=Nom|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part",
257
+ "20": "ADJ#Adverb#Case=Nom|Definite=Ind|Degree=Pos|Number=Plur",
258
+ "21": "ADJ#Noun#Case=Nom|Definite=Def|Degree=Pos",
259
+ "22": "ADJ#Noun#Case=Nom|Degree=Pos",
260
+ "23": "ADJ#Numeral#Case=Nom|Definite=Def|Degree=Pos",
261
+ "24": "ADJ#Numeral#Case=Nom|NumType=Ord",
262
+ "25": "ADJ#Verb#Case=Nom|Definite=Def|Degree=Pos|Tense=Past|VerbForm=Part",
263
+ "26": "ADJ#Verb#Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing|Tense=Past|VerbForm=Part",
264
+ "27": "ADJ#Verb#Case=Nom|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part",
265
+ "28": "ADJ#Verb#Case=Nom|Definite=Ind|Degree=Pos|Number=Plur",
266
+ "29": "ADJ#Verb#Case=Nom|Definite=Ind|Degree=Pos|Number=Plur|Tense=Past|VerbForm=Part",
267
+ "30": "ADJ#Verb#Case=Nom|Definite=Ind|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part",
268
+ "31": "ADJ#Verb#Case=Nom|Degree=Pos|Tense=Pres|VerbForm=Part",
269
+ "32": "ADJ#_#Case=Nom|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing",
270
+ "33": "ADJ#_#Case=Nom|Definite=Ind|Degree=Pos|Number=Plur",
271
+ "34": "ADJ#_#Case=Nom|Definite=Ind|Degree=Pos|Number=Sing|Tense=Past|VerbForm=Part",
272
+ "35": "ADJ#_#Case=Nom|Degree=Pos",
273
+ "36": "ADP#Adjective#_",
274
+ "37": "ADP#Adverb#_",
275
+ "38": "ADP#Conjunction#_",
276
+ "39": "ADP#Preposition#_",
277
+ "40": "ADP#_#_",
278
+ "41": "ADV#Adjective#_",
279
+ "42": "ADV#Adverb#Abbr=Yes",
280
+ "43": "ADV#Adverb#Degree=Cmp",
281
+ "44": "ADV#Adverb#Degree=Pos",
282
+ "45": "ADV#Adverb#Degree=Sup",
283
+ "46": "ADV#Adverb#Degree=Sup|Polarity=Neg",
284
+ "47": "ADV#Adverb#Polarity=Neg",
285
+ "48": "ADV#Adverb#_",
286
+ "49": "ADV#Conjunction#_",
287
+ "50": "ADV#Invariable#Degree=Cmp",
288
+ "51": "ADV#Invariable#Degree=Sup",
289
+ "52": "ADV#Noun#_",
290
+ "53": "ADV#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Prs",
291
+ "54": "ADV#Pronoun#_",
292
+ "55": "ADV#_#Degree=Cmp",
293
+ "56": "ADV#_#Degree=Sup",
294
+ "57": "ADV#_#_",
295
+ "58": "AUX#Verb#Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Act",
296
+ "59": "AUX#Verb#Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act",
297
+ "60": "AUX#Verb#VerbForm=Inf|Voice=Act",
298
+ "61": "AUX#Verb#VerbForm=Sup|Voice=Act",
299
+ "62": "CCONJ#Conjunction#_",
300
+ "63": "CCONJ#_#_",
301
+ "64": "DET#Adjective#Gender=Com|Number=Sing|PronType=Tot",
302
+ "65": "DET#Adjective#Gender=Neut|Number=Sing|PronType=Tot",
303
+ "66": "DET#Adjective#Number=Plur|PronType=Tot",
304
+ "67": "DET#Article#Definite=Def|Gender=Com|Number=Sing|PronType=Art",
305
+ "68": "DET#Article#Definite=Def|Gender=Neut|Number=Sing|PronType=Art",
306
+ "69": "DET#Article#Definite=Def|Number=Plur|PronType=Art",
307
+ "70": "DET#Article#Definite=Ind|Gender=Com|Number=Sing|PronType=Art",
308
+ "71": "DET#Article#Definite=Ind|Gender=Neut|Number=Sing|PronType=Art",
309
+ "72": "DET#Article#Definite=Ind|Gender=Neut|Number=Sing|PronType=Artt",
310
+ "73": "DET#Article#Definite=Ind|PronType=Art",
311
+ "74": "DET#Numeral#Definite=Ind|Gender=Neut|Number=Sing|PronType=Art",
312
+ "75": "DET#Pronoun#Definite=Def|Gender=Com|Number=Sing|PronType=Art",
313
+ "76": "DET#Pronoun#Definite=Def|Gender=Com|Number=Sing|PronType=Dem",
314
+ "77": "DET#Pronoun#Definite=Def|Gender=Neut|Number=Sing|PronType=Art",
315
+ "78": "DET#Pronoun#Definite=Def|Gender=Neut|Number=Sing|PronType=Dem",
316
+ "79": "DET#Pronoun#Definite=Def|Number=Plur|PronType=Art",
317
+ "80": "DET#Pronoun#Definite=Def|Number=Plur|PronType=Dem",
318
+ "81": "DET#Pronoun#Definite=Def|Number=Plur|PronType=Tot",
319
+ "82": "DET#Pronoun#Definite=Ind|Gender=Com|Number=Sing|PronType=Ind",
320
+ "83": "DET#Pronoun#Definite=Ind|Gender=Com|Number=Sing|PronType=Int",
321
+ "84": "DET#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Ind",
322
+ "85": "DET#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Int",
323
+ "86": "DET#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Tot",
324
+ "87": "DET#Pronoun#Definite=Ind|Number=Plur|PronType=Ind",
325
+ "88": "DET#Pronoun#Definite=Ind|Number=Sing|PronType=Tot",
326
+ "89": "DET#Pronoun#PronType=Ind",
327
+ "90": "DET#_#Gender=Neut|Number=Sing|PronType=Tot",
328
+ "91": "NOUN#Noun#Abbr=Yes",
329
+ "92": "NOUN#Noun#Case=Gen|Definite=Def|Gender=Com|Number=Plur",
330
+ "93": "NOUN#Noun#Case=Gen|Definite=Def|Gender=Com|Number=Sing",
331
+ "94": "NOUN#Noun#Case=Gen|Definite=Def|Gender=Neut|Number=Plur",
332
+ "95": "NOUN#Noun#Case=Gen|Definite=Def|Gender=Neut|Number=Sing",
333
+ "96": "NOUN#Noun#Case=Gen|Definite=Ind|Gender=Com|Number=Plur",
334
+ "97": "NOUN#Noun#Case=Gen|Definite=Ind|Gender=Neut|Number=Plur",
335
+ "98": "NOUN#Noun#Case=Gen|Definite=Ind|Gender=Neut|Number=Sing",
336
+ "99": "NOUN#Noun#Case=Nom|Definite=Def|Gender=Com|Number=Plur",
337
+ "100": "NOUN#Noun#Case=Nom|Definite=Def|Gender=Com|Number=Sing",
338
+ "101": "NOUN#Noun#Case=Nom|Definite=Def|Gender=Neut|Number=Plur",
339
+ "102": "NOUN#Noun#Case=Nom|Definite=Def|Gender=Neut|Number=Sing",
340
+ "103": "NOUN#Noun#Case=Nom|Definite=Ind|Gender=Com|Number=Plur",
341
+ "104": "NOUN#Noun#Case=Nom|Definite=Ind|Gender=Com|Number=Sing",
342
+ "105": "NOUN#Noun#Case=Nom|Definite=Ind|Gender=Neut|Number=Plur",
343
+ "106": "NOUN#Noun#Case=Nom|Definite=Ind|Gender=Neut|Number=Sing",
344
+ "107": "NOUN#Noun#Case=Nom|Definite=Ind|Gender=Neut|Number=Singg",
345
+ "108": "NOUN#Noun#Gender=Com",
346
+ "109": "NOUN#Noun#Number=Plur",
347
+ "110": "NOUN#Noun#Number=Sing",
348
+ "111": "NOUN#Noun#_",
349
+ "112": "NOUN#_#Case=Nom|Definite=Def|Gender=Com|Number=Sing",
350
+ "113": "NOUN#_#Case=Nom|Definite=Def|Gender=Neut|Number=Sing",
351
+ "114": "NOUN#_#Case=Nom|Definite=Ind|Gender=Com|Number=Sing",
352
+ "115": "NOUN#_#Case=Nom|Definite=Ind|Gender=Neut|Number=Sing",
353
+ "116": "NUM#Article#Case=Nom|Definite=Ind|Gender=Com|Number=Sing|NumType=Card",
354
+ "117": "NUM#Noun#Case=Nom|NumType=Card",
355
+ "118": "NUM#Numeral#Case=Nom|Definite=Ind|Gender=Com|Number=Sing|NumType=Card",
356
+ "119": "NUM#Numeral#Case=Nom|NumType=Card",
357
+ "120": "PART#Particle#Polarity=Neg",
358
+ "121": "PART#Preposition#_",
359
+ "122": "PRON#Adjective#Definite=Ind|Number=Plur|PronType=Ind",
360
+ "123": "PRON#Adjective#Definite=Ind|Number=Plur|PronType=Tot",
361
+ "124": "PRON#Adverb#Definite=Def|Gender=Neut|Number=Sing|PronType=Prs",
362
+ "125": "PRON#Adverb#Definite=Ind|Gender=Neut|Number=Sing|PronType=Ind",
363
+ "126": "PRON#Adverb#_",
364
+ "127": "PRON#Article#Case=Nom|Definite=Def|Number=Plur|PronType=Prs",
365
+ "128": "PRON#Conjunction#Definite=Ind|Gender=Neut|Number=Sing|PronType=Int",
366
+ "129": "PRON#Conjunction#PronType=Rel",
367
+ "130": "PRON#Noun#Case=Nom|Definite=Ind|Gender=Com|Number=Sing|PronType=Ind",
368
+ "131": "PRON#Noun#Definite=Def|Gender=Com|Number=Sing|PronType=Prs",
369
+ "132": "PRON#Noun#Definite=Def|Number=Plur|PronType=Prs",
370
+ "133": "PRON#Noun#Definite=Ind|Number=Plur|PronType=Ind",
371
+ "134": "PRON#Numeral#Definite=Ind|Gender=Com|Number=Sing|PronType=Prs",
372
+ "135": "PRON#Numeral#Definite=Ind|Gender=Neut|Number=Sing|PronType=Prs",
373
+ "136": "PRON#Pronoun#Case=Acc|Definite=Def|Gender=Com|Number=Plur|PronType=Prs",
374
+ "137": "PRON#Pronoun#Case=Acc|Definite=Def|Gender=Com|Number=Sing|PronType=Prs",
375
+ "138": "PRON#Pronoun#Case=Acc|Definite=Def|Number=Plur|PronType=Prs",
376
+ "139": "PRON#Pronoun#Case=Acc|Definite=Def|PronType=Prs",
377
+ "140": "PRON#Pronoun#Case=Gen|Definite=Def|Gender=Com|Number=Sing|Poss=Yes|PronType=Prs",
378
+ "141": "PRON#Pronoun#Case=Nom|Definite=Def|Gender=Com|Number=Plur|PronType=Prs",
379
+ "142": "PRON#Pronoun#Case=Nom|Definite=Def|Gender=Com|Number=Sing|PronType=Prs",
380
+ "143": "PRON#Pronoun#Case=Nom|Definite=Def|Number=Plur|PronType=Prs",
381
+ "144": "PRON#Pronoun#Case=Nom|Definite=Ind|Gender=Com|Number=Sing|PronType=Ind",
382
+ "145": "PRON#Pronoun#Case=Nom|Definite=Ind|Gender=Com|Number=Sing|PronType=Rel",
383
+ "146": "PRON#Pronoun#Definite=Def|Gender=Com|Number=Sing|Poss=Yes|PronType=Prs",
384
+ "147": "PRON#Pronoun#Definite=Def|Gender=Com|Number=Sing|PronType=Prs",
385
+ "148": "PRON#Pronoun#Definite=Def|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs",
386
+ "149": "PRON#Pronoun#Definite=Def|Gender=Neut|Number=Sing|PronType=Dem",
387
+ "150": "PRON#Pronoun#Definite=Def|Gender=Neut|Number=Sing|PronType=Prs",
388
+ "151": "PRON#Pronoun#Definite=Def|Number=Plur|Poss=Yes|PronType=Prs",
389
+ "152": "PRON#Pronoun#Definite=Def|Number=Plur|PronType=Dem",
390
+ "153": "PRON#Pronoun#Definite=Def|Number=Plur|PronType=Prs",
391
+ "154": "PRON#Pronoun#Definite=Def|Poss=Yes|PronType=Prs",
392
+ "155": "PRON#Pronoun#Definite=Ind|Gender=Com|Number=Sing|PronType=Ind",
393
+ "156": "PRON#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Ind",
394
+ "157": "PRON#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Int",
395
+ "158": "PRON#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Neg",
396
+ "159": "PRON#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Prs",
397
+ "160": "PRON#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Rel",
398
+ "161": "PRON#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Tot",
399
+ "162": "PRON#Pronoun#Definite=Ind|Number=Plur|PronType=Rel",
400
+ "163": "PRON#Pronoun#Number=Plur",
401
+ "164": "PRON#Pronoun#PronType=Rel",
402
+ "165": "PRON#Verb#Definite=Def|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs",
403
+ "166": "PRON#_#Case=Acc|Definite=Def|PronType=Prs",
404
+ "167": "PRON#_#Definite=Ind|Gender=Neut|Number=Sing|PronType=Ind",
405
+ "168": "PRON#_#Definite=Ind|Gender=Neut|Number=Sing|PronType=Prs",
406
+ "169": "PROPN#Noun#Case=Gen",
407
+ "170": "PROPN#Noun#Case=Nom",
408
+ "171": "PROPN#Noun#Case=Nom|Definite=Ind|Gender=Com|Number=Sing",
409
+ "172": "PUNCT#PUNCT#_",
410
+ "173": "SCONJ#Conjunction#_",
411
+ "174": "SCONJ#Preposition#_",
412
+ "175": "SCONJ#Pronoun#Definite=Ind|Gender=Neut|Number=Sing|PronType=Int",
413
+ "176": "SCONJ#_#_",
414
+ "177": "VERB#Adjective#Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass",
415
+ "178": "VERB#Verb#Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass",
416
+ "179": "VERB#Verb#Mood=Imp|VerbForm=Fin|Voice=Act",
417
+ "180": "VERB#Verb#Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin",
418
+ "181": "VERB#Verb#Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
419
+ "182": "VERB#Verb#Mood=Ind|Tense=Past|VerbForm=Fin",
420
+ "183": "VERB#Verb#Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Act",
421
+ "184": "VERB#Verb#Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass",
422
+ "185": "VERB#Verb#Mood=Ind|Tense=Pres|VerbForm=Fin",
423
+ "186": "VERB#Verb#Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act",
424
+ "187": "VERB#Verb#Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass",
425
+ "188": "VERB#Verb#Tense=Past|VerbForm=Part",
426
+ "189": "VERB#Verb#VerbForm=Inf",
427
+ "190": "VERB#Verb#VerbForm=Inf|Voice=Act",
428
+ "191": "VERB#Verb#VerbForm=Inf|Voice=Pass",
429
+ "192": "VERB#Verb#VerbForm=Sup",
430
+ "193": "VERB#Verb#VerbForm=Sup|Voice=Act",
431
+ "194": "VERB#Verb#VerbForm=Sup|Voice=Pass"
432
+ },
433
+ "lemma_rule": {
434
+ "0": "cut_prefix=0|cut_suffix=0|append_suffix=",
435
+ "1": "cut_prefix=0|cut_suffix=0|append_suffix=a",
436
+ "2": "cut_prefix=0|cut_suffix=0|append_suffix=ma",
437
+ "3": "cut_prefix=0|cut_suffix=1|append_suffix=",
438
+ "4": "cut_prefix=0|cut_suffix=1|append_suffix=a",
439
+ "5": "cut_prefix=0|cut_suffix=1|append_suffix=as",
440
+ "6": "cut_prefix=0|cut_suffix=1|append_suffix=d",
441
+ "7": "cut_prefix=0|cut_suffix=1|append_suffix=en",
442
+ "8": "cut_prefix=0|cut_suffix=1|append_suffix=g",
443
+ "9": "cut_prefix=0|cut_suffix=1|append_suffix=ja",
444
+ "10": "cut_prefix=0|cut_suffix=1|append_suffix=n",
445
+ "11": "cut_prefix=0|cut_suffix=1|append_suffix=na",
446
+ "12": "cut_prefix=0|cut_suffix=1|append_suffix=ola",
447
+ "13": "cut_prefix=0|cut_suffix=1|append_suffix=ym",
448
+ "14": "cut_prefix=0|cut_suffix=2|append_suffix=",
449
+ "15": "cut_prefix=0|cut_suffix=2|append_suffix=a",
450
+ "16": "cut_prefix=0|cut_suffix=2|append_suffix=an",
451
+ "17": "cut_prefix=0|cut_suffix=2|append_suffix=ara",
452
+ "18": "cut_prefix=0|cut_suffix=2|append_suffix=dd",
453
+ "19": "cut_prefix=0|cut_suffix=2|append_suffix=e",
454
+ "20": "cut_prefix=0|cut_suffix=2|append_suffix=en",
455
+ "21": "cut_prefix=0|cut_suffix=2|append_suffix=g",
456
+ "22": "cut_prefix=0|cut_suffix=2|append_suffix=i",
457
+ "23": "cut_prefix=0|cut_suffix=2|append_suffix=igga",
458
+ "24": "cut_prefix=0|cut_suffix=2|append_suffix=ja",
459
+ "25": "cut_prefix=0|cut_suffix=2|append_suffix=mal",
460
+ "26": "cut_prefix=0|cut_suffix=2|append_suffix=n",
461
+ "27": "cut_prefix=0|cut_suffix=2|append_suffix=na",
462
+ "28": "cut_prefix=0|cut_suffix=2|append_suffix=on",
463
+ "29": "cut_prefix=0|cut_suffix=2|append_suffix=u",
464
+ "30": "cut_prefix=0|cut_suffix=2|append_suffix=um",
465
+ "31": "cut_prefix=0|cut_suffix=2|append_suffix=unna",
466
+ "32": "cut_prefix=0|cut_suffix=2|append_suffix=ycket",
467
+ "33": "cut_prefix=0|cut_suffix=2|append_suffix=yda",
468
+ "34": "cut_prefix=0|cut_suffix=2|append_suffix=yta",
469
+ "35": "cut_prefix=0|cut_suffix=2|append_suffix=\u00e5",
470
+ "36": "cut_prefix=0|cut_suffix=2|append_suffix=\u00e5ta",
471
+ "37": "cut_prefix=0|cut_suffix=3|append_suffix=",
472
+ "38": "cut_prefix=0|cut_suffix=3|append_suffix=a",
473
+ "39": "cut_prefix=0|cut_suffix=3|append_suffix=an",
474
+ "40": "cut_prefix=0|cut_suffix=3|append_suffix=and_annat",
475
+ "41": "cut_prefix=0|cut_suffix=3|append_suffix=as",
476
+ "42": "cut_prefix=0|cut_suffix=3|append_suffix=e",
477
+ "43": "cut_prefix=0|cut_suffix=3|append_suffix=er",
478
+ "44": "cut_prefix=0|cut_suffix=3|append_suffix=i",
479
+ "45": "cut_prefix=0|cut_suffix=3|append_suffix=jag",
480
+ "46": "cut_prefix=0|cut_suffix=3|append_suffix=liten",
481
+ "47": "cut_prefix=0|cut_suffix=3|append_suffix=nan",
482
+ "48": "cut_prefix=0|cut_suffix=3|append_suffix=nna",
483
+ "49": "cut_prefix=0|cut_suffix=3|append_suffix=ola",
484
+ "50": "cut_prefix=0|cut_suffix=3|append_suffix=r",
485
+ "51": "cut_prefix=0|cut_suffix=3|append_suffix=ra",
486
+ "52": "cut_prefix=0|cut_suffix=3|append_suffix=vi",
487
+ "53": "cut_prefix=0|cut_suffix=3|append_suffix=ycket",
488
+ "54": "cut_prefix=0|cut_suffix=3|append_suffix=\u00e4ga",
489
+ "55": "cut_prefix=0|cut_suffix=3|append_suffix=\u00e4gga",
490
+ "56": "cut_prefix=0|cut_suffix=3|append_suffix=\u00e5",
491
+ "57": "cut_prefix=0|cut_suffix=3|append_suffix=\u00e5_kallad",
492
+ "58": "cut_prefix=0|cut_suffix=4|append_suffix=",
493
+ "59": "cut_prefix=0|cut_suffix=4|append_suffix=a",
494
+ "60": "cut_prefix=0|cut_suffix=4|append_suffix=ader",
495
+ "61": "cut_prefix=0|cut_suffix=4|append_suffix=an",
496
+ "62": "cut_prefix=0|cut_suffix=4|append_suffix=e",
497
+ "63": "cut_prefix=0|cut_suffix=4|append_suffix=ola",
498
+ "64": "cut_prefix=0|cut_suffix=4|append_suffix=on",
499
+ "65": "cut_prefix=0|cut_suffix=4|append_suffix=or",
500
+ "66": "cut_prefix=0|cut_suffix=4|append_suffix=ot",
501
+ "67": "cut_prefix=0|cut_suffix=4|append_suffix=r",
502
+ "68": "cut_prefix=0|cut_suffix=4|append_suffix=ra",
503
+ "69": "cut_prefix=0|cut_suffix=4|append_suffix=\u00e5g",
504
+ "70": "cut_prefix=0|cut_suffix=4|append_suffix=\u00f6ra",
505
+ "71": "cut_prefix=0|cut_suffix=5|append_suffix=",
506
+ "72": "cut_prefix=0|cut_suffix=5|append_suffix=a",
507
+ "73": "cut_prefix=0|cut_suffix=5|append_suffix=an",
508
+ "74": "cut_prefix=0|cut_suffix=5|append_suffix=d\u00e5lig",
509
+ "75": "cut_prefix=0|cut_suffix=5|append_suffix=er",
510
+ "76": "cut_prefix=0|cut_suffix=5|append_suffix=g\u00e4rna",
511
+ "77": "cut_prefix=0|cut_suffix=5|append_suffix=oder",
512
+ "78": "cut_prefix=0|cut_suffix=5|append_suffix=on",
513
+ "79": "cut_prefix=0|cut_suffix=5|append_suffix=r",
514
+ "80": "cut_prefix=0|cut_suffix=5|append_suffix=ra",
515
+ "81": "cut_prefix=0|cut_suffix=6|append_suffix=er",
516
+ "82": "cut_prefix=0|cut_suffix=8|append_suffix=or",
517
+ "83": "cut_prefix=1|cut_suffix=0|append_suffix=",
518
+ "84": "cut_prefix=1|cut_suffix=0|append_suffix=a",
519
+ "85": "cut_prefix=1|cut_suffix=3|append_suffix=",
520
+ "86": "cut_prefix=1|cut_suffix=3|append_suffix=te",
521
+ "87": "cut_prefix=2|cut_suffix=0|append_suffix=",
522
+ "88": "cut_prefix=2|cut_suffix=0|append_suffix=a",
523
+ "89": "cut_prefix=2|cut_suffix=1|append_suffix=empel",
524
+ "90": "cut_prefix=2|cut_suffix=1|append_suffix=n",
525
+ "91": "cut_prefix=2|cut_suffix=2|append_suffix=",
526
+ "92": "cut_prefix=2|cut_suffix=2|append_suffix=a",
527
+ "93": "cut_prefix=2|cut_suffix=3|append_suffix=",
528
+ "94": "cut_prefix=2|cut_suffix=3|append_suffix=as",
529
+ "95": "cut_prefix=2|cut_suffix=3|append_suffix=n"
530
+ },
531
+ "misc": {
532
+ "0": "Cxn=rc-that-nsubj",
533
+ "1": "Cxn=rc-that-obj",
534
+ "2": "Cxn=rc-wh-nsubj",
535
+ "3": "Cxn=rc-wh-obl",
536
+ "4": "Cxn=rc-wh-obl-pfront",
537
+ "5": "Promoted=Yes|SpaceAfter=No",
538
+ "6": "SpaceAfter=No",
539
+ "7": "ellipsis"
540
+ },
541
+ "semclass": {
542
+ "0": "ABILITY_OF_BEING",
543
+ "1": "ACTIVITY",
544
+ "2": "APPARATUS",
545
+ "3": "AREA_OF_HUMAN_ACTIVITY",
546
+ "4": "ARRANGEMENTS",
547
+ "5": "ARTICLES",
548
+ "6": "ATTRIBUTIVE",
549
+ "7": "AUXILIARY_VERBS",
550
+ "8": "BAD_DANGEROUS_EVENT",
551
+ "9": "BE",
552
+ "10": "BEGIN_TO_TAKE_PLACE",
553
+ "11": "BEHAVIOUR",
554
+ "12": "BEING",
555
+ "13": "BUSINESS",
556
+ "14": "BUSY_FREE_OCCUPIED",
557
+ "15": "CHANGE_OF_POST_AND_JOB",
558
+ "16": "CHARACTERISTIC_GENERAL",
559
+ "17": "CHOOSING_SORTING",
560
+ "18": "CH_APPEARANCE",
561
+ "19": "CH_ASPECT",
562
+ "20": "CH_BENEFIT",
563
+ "21": "CH_BY_SENSORY_PERCEPTION",
564
+ "22": "CH_COMPOSITION",
565
+ "23": "CH_DEGREE",
566
+ "24": "CH_DEGREE_AND_INTENSITY",
567
+ "25": "CH_DISPOSITION_AND_MOTION",
568
+ "26": "CH_DISTRIBUTION",
569
+ "27": "CH_EVALUATION",
570
+ "28": "CH_EVALUATION_OF_HUMAN_TEMPER_AND_ACTIVITY",
571
+ "29": "CH_FUNCTIONING_OF_ENTITY",
572
+ "30": "CH_INFORMATION",
573
+ "31": "CH_INTENTION_CONCENTRATION",
574
+ "32": "CH_MAGNITUDE",
575
+ "33": "CH_OF_CONNECTIONS",
576
+ "34": "CH_PARAMETER_SPEED",
577
+ "35": "CH_POWER_AND_EFFECT",
578
+ "36": "CH_PRICE_AND_SUMS",
579
+ "37": "CH_REFERENCE_AND_QUANTIFICATION",
580
+ "38": "CH_RENOWN",
581
+ "39": "CH_RESISTANCE_TO_IMPACT",
582
+ "40": "CH_SALIENCE",
583
+ "41": "CH_SCALE",
584
+ "42": "CH_SOCIAL_CHARACTERISTIC",
585
+ "43": "CH_SPHERE_OF_COVERAGE",
586
+ "44": "CH_SYSTEM_STRUCTURE",
587
+ "45": "CH_TYPE_OF_POSSESSION_AND_PARTICIPATION",
588
+ "46": "CIRCUMSTANCE",
589
+ "47": "CLOTHES",
590
+ "48": "CONDITION_SITUATION",
591
+ "49": "CONFLICT_INTERACTION",
592
+ "50": "CONJUNCTIONS",
593
+ "51": "CONTAIN_INCLUDE_FORM",
594
+ "52": "CONTINUE_TO_HAVE",
595
+ "53": "CONTINUE_TO_TAKE_PLACE",
596
+ "54": "COORDINATING_CONJUNCTIONS",
597
+ "55": "COSMOS_AND_COSMIC_OBJECTS",
598
+ "56": "COST",
599
+ "57": "COUNTRY_AS_ADMINISTRATIVE_UNIT",
600
+ "58": "CREATION_VERBS",
601
+ "59": "DEFEND_SAVE",
602
+ "60": "DESTRUCTION_VERBS",
603
+ "61": "DIFFICULTIES",
604
+ "62": "DIFFICULT_AND_EASY",
605
+ "63": "DIMENSIONS_CHAR",
606
+ "64": "DISCOURSIVE_UNITS",
607
+ "65": "DOCUMENT",
608
+ "66": "ECONOMY",
609
+ "67": "EMOTIONS_AND_THEIR_EXPRESSION",
610
+ "68": "EMPTY_SUBJECT",
611
+ "69": "END_TO_TAKE_PLACE",
612
+ "70": "ENTITY_AS_RESULT_OF_ACTIVITY",
613
+ "71": "ENTITY_OR_SITUATION_PRONOUN",
614
+ "72": "EVERYDAY_PROCESSING",
615
+ "73": "EXISTENCE_AND_POSSESSION",
616
+ "74": "FACT_INCIDENT",
617
+ "75": "FEELING_AS_CONDITION",
618
+ "76": "FURNISHINGS_AND_DECORATION",
619
+ "77": "GENERAL_ACTION",
620
+ "78": "GRAMMATICAL_ELEMENTS",
621
+ "79": "HIERARCHICAL_VERBS",
622
+ "80": "IDENTIFYING_ATTRIBUTE",
623
+ "81": "IDIOMATICAL_ELEMENTS",
624
+ "82": "INFORMATION",
625
+ "83": "INTELLECTUAL_ACTIVITY",
626
+ "84": "INTERPERSONAL_RELATIONS",
627
+ "85": "KIND",
628
+ "86": "KITCHENWARE_AND_TABLEWARE",
629
+ "87": "KNOWLEDGE_FROM_EXPERIENCE_AND_DEDUCTION",
630
+ "88": "LACK_AND_PLENTY",
631
+ "89": "LAWS_AND_STANDARDS",
632
+ "90": "MANAGE_FAIL_CONDITION",
633
+ "91": "MARKET_AS_AREA_OF_ACTIVITY",
634
+ "92": "MENTAL_OBJECT",
635
+ "93": "METHOD_APPROACH_TECHNIQUE",
636
+ "94": "MODALITY",
637
+ "95": "MONEY",
638
+ "96": "MOTION",
639
+ "97": "NONPRODUCTIVE_AREA",
640
+ "98": "OBJECT_BY_FUNCTION_AND_PROPERTY",
641
+ "99": "ORGANIZATION",
642
+ "100": "PARTICLES",
643
+ "101": "PART_OF_CONSTRUCTION",
644
+ "102": "PART_OF_ORGANISM",
645
+ "103": "PART_OF_WORLD",
646
+ "104": "PART_OR_PORTION_OF_ENTITY",
647
+ "105": "PERCEPTION_ACTIVITY",
648
+ "106": "PHRASAL_PARTICLES",
649
+ "107": "PHYSICAL_AND_BIOLOGICAL_PROPERTIES",
650
+ "108": "PHYSICAL_OBJECT_AND_SUBSTANCE_CHAR",
651
+ "109": "PHYSICAL_PSYCHIC_CONDITION",
652
+ "110": "PHYSIOLOGICAL_PROCESSES",
653
+ "111": "PLACE",
654
+ "112": "POSITION_AS_STATUS",
655
+ "113": "POSITION_IN_SPACE",
656
+ "114": "POWER_RIGHT",
657
+ "115": "PREMISES",
658
+ "116": "PREPOSITION",
659
+ "117": "PROBLEMS_TO_SOLVE",
660
+ "118": "PROCESS_AND_ITS_STAGES",
661
+ "119": "PUBLIC_AND_POLITICAL_ACTIVITY",
662
+ "120": "RELATIVE_SPACE",
663
+ "121": "RESULTS_OF_GIVING_INFORMATION_AND_SPEECH_ACTIVITY",
664
+ "122": "RESULTS_OF_MAKING_DECISIONS",
665
+ "123": "RESULT_CONSEQUENCE",
666
+ "124": "RISK_DANGER",
667
+ "125": "SCHEDULE_FOR_ACTIVITY",
668
+ "126": "SCIENCE",
669
+ "127": "SCIENTIFIC_AND_LITERARY_WORK",
670
+ "128": "SITUATION",
671
+ "129": "SOCIAL_CONDITIONS_OF_BEING",
672
+ "130": "SPHERE_OF_ACTIVITY_GENERAL",
673
+ "131": "STATE_AREA",
674
+ "132": "STATE_OF_MIND",
675
+ "133": "SUBSTANCE",
676
+ "134": "SYMBOLS_FOR_INFORMATION_TRANSFER",
677
+ "135": "TENDENCY_AND_DISPOSITION",
678
+ "136": "TERRITORY_AREA",
679
+ "137": "TEXT_OBJECTS_AND_DOCUMENTS",
680
+ "138": "THE_EARTH_AND_ITS_SPATIAL_PARTS",
681
+ "139": "THE_GOOD_BAD",
682
+ "140": "TIME",
683
+ "141": "TOPIC_SUBJECT",
684
+ "142": "TOTALITY_OF_DEGREE",
685
+ "143": "TO_ADAPT",
686
+ "144": "TO_ADD",
687
+ "145": "TO_ANALYSE_AND_RESEARCH",
688
+ "146": "TO_APPROACH_COME_TO_SOME_POINT_OR_STATE",
689
+ "147": "TO_BE_BASED",
690
+ "148": "TO_CALL_AND_DESIGNATE",
691
+ "149": "TO_CANCEL",
692
+ "150": "TO_CARE_AND_BRING_UP",
693
+ "151": "TO_CHANGE",
694
+ "152": "TO_CHARACTERIZE",
695
+ "153": "TO_COME_OR_TO_LEAVE_SPHERE_OF_ACTIVITY",
696
+ "154": "TO_COMMIT",
697
+ "155": "TO_COMMUNICATE",
698
+ "156": "TO_COMPEL_AND_EVOKE",
699
+ "157": "TO_CONTRIBUTE_AND_HINDER",
700
+ "158": "TO_DECIDE",
701
+ "159": "TO_DEVELOP",
702
+ "160": "TO_DISAPPEAR_LOSE_GET_RID_OF",
703
+ "161": "TO_ECONOMIZE",
704
+ "162": "TO_EXIST",
705
+ "163": "TO_FEEL_AND_EXPRESS_MENTAL_ATTITUDE_TO",
706
+ "164": "TO_FLOW_IN_TIME",
707
+ "165": "TO_GET",
708
+ "166": "TO_GIVE",
709
+ "167": "TO_INTERPRET",
710
+ "168": "TO_INVOLVE",
711
+ "169": "TO_JOIN",
712
+ "170": "TO_KEEP_VIOLATE_NORMS",
713
+ "171": "TO_LEARN_AND_RESEARCH",
714
+ "172": "TO_MAKE",
715
+ "173": "TO_MARRY_DIVORCE_ENGAGE",
716
+ "174": "TO_MEAN",
717
+ "175": "TO_MIX",
718
+ "176": "TO_PARTICIPATE",
719
+ "177": "TO_PERCEIVE",
720
+ "178": "TO_POSSESS",
721
+ "179": "TO_PUNISH",
722
+ "180": "TO_REACT",
723
+ "181": "TO_REBEL",
724
+ "182": "TO_RESTORE",
725
+ "183": "TO_SEEK_FIND",
726
+ "184": "TO_SET",
727
+ "185": "TO_SHARE",
728
+ "186": "TO_SHOW",
729
+ "187": "TO_TAKE",
730
+ "188": "TO_THINK_ABOUT",
731
+ "189": "TO_USE",
732
+ "190": "TO_WAIT",
733
+ "191": "TO_WORK",
734
+ "192": "URBAN_SPACE_AND_ROADS",
735
+ "193": "VALUABLE",
736
+ "194": "VERBAL_COMMUNICATION",
737
+ "195": "VISUAL_CHARACTERISTICS",
738
+ "196": "VISUAL_REPRESENTATION",
739
+ "197": "WORLD_OUTLOOK"
740
+ },
741
+ "ud_deprel": {
742
+ "0": "acl",
743
+ "1": "acl:cleft",
744
+ "2": "acl:relcl",
745
+ "3": "advcl",
746
+ "4": "advmod",
747
+ "5": "amod",
748
+ "6": "appos",
749
+ "7": "aux",
750
+ "8": "aux:pass",
751
+ "9": "case",
752
+ "10": "cc",
753
+ "11": "ccomp",
754
+ "12": "compound:prt",
755
+ "13": "conj",
756
+ "14": "cop",
757
+ "15": "csubj",
758
+ "16": "csubj:pass",
759
+ "17": "det",
760
+ "18": "dislocated",
761
+ "19": "expl",
762
+ "20": "fixed",
763
+ "21": "flat",
764
+ "22": "iobj",
765
+ "23": "mark",
766
+ "24": "nmod",
767
+ "25": "nmod:poss",
768
+ "26": "nsubj",
769
+ "27": "nsubj:pass",
770
+ "28": "nummod",
771
+ "29": "obj",
772
+ "30": "obl",
773
+ "31": "obl:agent",
774
+ "32": "parataxis",
775
+ "33": "punct",
776
+ "34": "root",
777
+ "35": "vocative",
778
+ "36": "xcomp"
779
+ }
780
+ }
781
+ }
configuration.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+
3
+
4
+ class CobaldParserConfig(PretrainedConfig):
5
+ model_type = "cobald_parser"
6
+
7
+ def __init__(
8
+ self,
9
+ encoder_model_name: str = None,
10
+ null_classifier_hidden_size: int = 0,
11
+ lemma_classifier_hidden_size: int = 0,
12
+ morphology_classifier_hidden_size: int = 0,
13
+ dependency_classifier_hidden_size: int = 0,
14
+ misc_classifier_hidden_size: int = 0,
15
+ deepslot_classifier_hidden_size: int = 0,
16
+ semclass_classifier_hidden_size: int = 0,
17
+ activation: str = 'relu',
18
+ dropout: float = 0.1,
19
+ consecutive_null_limit: int = 0,
20
+ vocabulary: dict[dict[int, str]] = {},
21
+ # LoRA params (новое!)
22
+ use_lora: bool = False,
23
+ lora_r: int = 8,
24
+ lora_alpha: int = 16,
25
+ lora_dropout: float = 0.05,
26
+ lora_target_modules: list = None,
27
+ **kwargs
28
+ ):
29
+ self.encoder_model_name = encoder_model_name
30
+ self.null_classifier_hidden_size = null_classifier_hidden_size
31
+ self.consecutive_null_limit = consecutive_null_limit
32
+ self.lemma_classifier_hidden_size = lemma_classifier_hidden_size
33
+ self.morphology_classifier_hidden_size = morphology_classifier_hidden_size
34
+ self.dependency_classifier_hidden_size = dependency_classifier_hidden_size
35
+ self.misc_classifier_hidden_size = misc_classifier_hidden_size
36
+ self.deepslot_classifier_hidden_size = deepslot_classifier_hidden_size
37
+ self.semclass_classifier_hidden_size = semclass_classifier_hidden_size
38
+ self.activation = activation
39
+ self.dropout = dropout
40
+ # The serialized config stores mappings as strings,
41
+ # e.g. {"0": "acl", "1": "conj"}, so we have to convert them to int.
42
+ self.use_lora = use_lora
43
+ self.lora_r = lora_r
44
+ self.lora_alpha = lora_alpha
45
+ self.lora_dropout = lora_dropout
46
+ self.lora_target_modules = lora_target_modules
47
+ self.vocabulary = {
48
+ column: {int(k): v for k, v in labels.items()}
49
+ for column, labels in vocabulary.items()
50
+ }
51
+ super().__init__(**kwargs)
dependency_classifier.py ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from copy import deepcopy
2
+
3
+ import numpy as np
4
+
5
+ import torch
6
+ from torch import nn
7
+ from torch import Tensor, FloatTensor, BoolTensor, LongTensor
8
+ import torch.nn.functional as F
9
+
10
+ from transformers.activations import ACT2FN
11
+
12
+ from cobald_parser.bilinear_matrix_attention import BilinearMatrixAttention
13
+ from cobald_parser.chu_liu_edmonds import decode_mst
14
+ from cobald_parser.utils import pairwise_mask, replace_masked_values
15
+
16
+
17
+ class DependencyHeadBase(nn.Module):
18
+ """
19
+ Base class for scoring arcs and relations between tokens in a dependency tree/graph.
20
+ """
21
+
22
+ def __init__(self, hidden_size: int, n_rels: int):
23
+ super().__init__()
24
+
25
+ self.arc_attention = BilinearMatrixAttention(
26
+ hidden_size,
27
+ hidden_size,
28
+ use_input_biases=True,
29
+ n_labels=1
30
+ )
31
+ self.rel_attention = BilinearMatrixAttention(
32
+ hidden_size,
33
+ hidden_size,
34
+ use_input_biases=True,
35
+ n_labels=n_rels
36
+ )
37
+
38
+ def forward(
39
+ self,
40
+ h_arc_head: Tensor, # [batch_size, seq_len, hidden_size]
41
+ h_arc_dep: Tensor, # ...
42
+ h_rel_head: Tensor, # ...
43
+ h_rel_dep: Tensor, # ...
44
+ gold_arcs: LongTensor, # [batch_size, seq_len, seq_len]
45
+ null_mask: BoolTensor, # [batch_size, seq_len]
46
+ padding_mask: BoolTensor # [batch_size, seq_len]
47
+ ) -> dict[str, Tensor]:
48
+
49
+ # Score arcs.
50
+ # s_arc[:, i, j] = score of edge i -> j.
51
+ s_arc = self.arc_attention(h_arc_head, h_arc_dep)
52
+ # Mask undesirable values (padding, nulls, etc.) with -inf.
53
+ mask2d = pairwise_mask(null_mask & padding_mask)
54
+ replace_masked_values(s_arc, mask2d, replace_with=-1e8)
55
+ # Score arcs' relations.
56
+ # [batch_size, seq_len, seq_len, num_labels]
57
+ s_rel = self.rel_attention(h_rel_head, h_rel_dep).permute(0, 2, 3, 1)
58
+
59
+ # Calculate loss.
60
+ loss = 0.0
61
+ if gold_arcs is not None:
62
+ loss += self.calc_arc_loss(s_arc, gold_arcs)
63
+ loss += self.calc_rel_loss(s_rel, gold_arcs)
64
+
65
+ # Predict arcs based on the scores.
66
+ # [batch_size, seq_len, seq_len]
67
+ pred_arcs_matrix = self.predict_arcs(s_arc, null_mask, padding_mask)
68
+ # [batch_size, seq_len, seq_len]
69
+ pred_rels_matrix = self.predict_rels(s_rel)
70
+ # [n_pred_arcs, 4]
71
+ preds_combined = self.combine_arcs_rels(pred_arcs_matrix, pred_rels_matrix)
72
+ return {
73
+ 'preds': preds_combined,
74
+ 'loss': loss
75
+ }
76
+
77
+ @staticmethod
78
+ def calc_arc_loss(
79
+ s_arc: Tensor, # [batch_size, seq_len, seq_len]
80
+ gold_arcs: LongTensor # [n_arcs, 4]
81
+ ) -> Tensor:
82
+ """Calculate arc loss."""
83
+ raise NotImplementedError
84
+
85
+ @staticmethod
86
+ def calc_rel_loss(
87
+ s_rel: Tensor, # [batch_size, seq_len, seq_len, num_labels]
88
+ gold_arcs: LongTensor # [n_arcs, 4]
89
+ ) -> Tensor:
90
+ batch_idxs, arcs_from, arcs_to, rels = gold_arcs.T
91
+ return F.cross_entropy(s_rel[batch_idxs, arcs_from, arcs_to], rels)
92
+
93
+ def predict_arcs(
94
+ self,
95
+ s_arc: Tensor, # [batch_size, seq_len, seq_len]
96
+ null_mask: BoolTensor, # [batch_size, seq_len]
97
+ padding_mask: BoolTensor # [batch_size, seq_len]
98
+ ) -> LongTensor:
99
+ """Predict arcs from scores."""
100
+ raise NotImplementedError
101
+
102
+ def predict_rels(
103
+ self,
104
+ s_rel: FloatTensor
105
+ ) -> LongTensor:
106
+ return s_rel.argmax(dim=-1).long()
107
+
108
+ @staticmethod
109
+ def combine_arcs_rels(
110
+ pred_arcs: LongTensor,
111
+ pred_rels: LongTensor
112
+ ) -> LongTensor:
113
+ """Select relations towards predicted arcs."""
114
+ assert pred_arcs.shape == pred_rels.shape
115
+ # Get indices where arcs exist
116
+ indices = pred_arcs.nonzero(as_tuple=True)
117
+ batch_idxs, from_idxs, to_idxs = indices
118
+ # Get corresponding relation types
119
+ rel_types = pred_rels[batch_idxs, from_idxs, to_idxs]
120
+ # Stack as [batch_idx, from_idx, to_idx, rel_type]
121
+ return torch.stack([batch_idxs, from_idxs, to_idxs, rel_types], dim=1)
122
+
123
+
124
+ class DependencyHead(DependencyHeadBase):
125
+ """
126
+ Basic UD syntax specialization that predicts single edge for each token.
127
+ """
128
+
129
+
130
+ def predict_arcs(
131
+ self,
132
+ s_arc: Tensor, # [batch_size, seq_len, seq_len]
133
+ null_mask: BoolTensor, # [batch_size, seq_len]
134
+ padding_mask: BoolTensor # [batch_size, seq_len, seq_len]
135
+ ) -> Tensor:
136
+
137
+ if self.training:
138
+ # During training, use fast greedy decoding.
139
+ # - [batch_size, seq_len]
140
+ pred_arcs_seq = s_arc.argmax(dim=1)
141
+ else:
142
+ # FIXME
143
+ # During inference, decode Maximum Spanning Tree.
144
+ # pred_arcs_seq = self._mst_decode(s_arc, padding_mask)
145
+ pred_arcs_seq = s_arc.argmax(dim=1)
146
+
147
+ # Upscale arcs sequence of shape [batch_size, seq_len]
148
+ # to matrix of shape [batch_size, seq_len, seq_len].
149
+ pred_arcs = F.one_hot(pred_arcs_seq, num_classes=pred_arcs_seq.size(1)).long().transpose(1, 2)
150
+ # Apply mask one more time (even though s_arc is already masked),
151
+ # because argmax erases information about masked values.
152
+ mask2d = pairwise_mask(null_mask & padding_mask)
153
+ replace_masked_values(pred_arcs, mask2d, replace_with=0)
154
+ return pred_arcs
155
+
156
+ def _mst_decode(
157
+ self,
158
+ s_arc: Tensor, # [batch_size, seq_len, seq_len]
159
+ padding_mask: Tensor
160
+ ) -> tuple[Tensor, Tensor]:
161
+
162
+ batch_size = s_arc.size(0)
163
+ device = s_arc.device
164
+ s_arc = s_arc.cpu()
165
+
166
+ # Convert scores to probabilities, as `decode_mst` expects non-negative values.
167
+ arc_probs = nn.functional.softmax(s_arc, dim=1)
168
+
169
+ # `decode_mst` knows nothing about UD and ROOT, so we have to manually
170
+ # zero probabilities of arcs leading to ROOT to make sure ROOT is a source node
171
+ # of a graph.
172
+
173
+ # Decode ROOT positions from diagonals.
174
+ # shape: [batch_size]
175
+ root_idxs = arc_probs.diagonal(dim1=1, dim2=2).argmax(dim=-1)
176
+ # Zero out arcs leading to ROOTs.
177
+ arc_probs[torch.arange(batch_size), :, root_idxs] = 0.0
178
+
179
+ pred_arcs = []
180
+ for sample_idx in range(batch_size):
181
+ energy = arc_probs[sample_idx]
182
+ length = padding_mask[sample_idx].sum()
183
+ heads = decode_mst(energy, length)
184
+ # Some nodes may be isolated. Pick heads greedily in this case.
185
+ heads[heads <= 0] = s_arc[sample_idx].argmax(dim=1)[heads <= 0]
186
+ pred_arcs.append(heads)
187
+
188
+ # shape: [batch_size, seq_len]
189
+ pred_arcs = torch.from_numpy(np.stack(pred_arcs)).long().to(device)
190
+ return pred_arcs
191
+
192
+ @staticmethod
193
+ def calc_arc_loss(
194
+ s_arc: Tensor, # [batch_size, seq_len, seq_len]
195
+ gold_arcs: LongTensor # [n_arcs, 4]
196
+ ) -> tuple[Tensor, Tensor]:
197
+ batch_idxs, from_idxs, to_idxs, _ = gold_arcs.T
198
+ return F.cross_entropy(s_arc[batch_idxs, :, to_idxs], from_idxs)
199
+
200
+
201
+ class MultiDependencyHead(DependencyHeadBase):
202
+ """
203
+ Enhanced UD syntax specialization that predicts multiple edges for each token.
204
+ """
205
+
206
+ def predict_arcs(
207
+ self,
208
+ s_arc: Tensor, # [batch_size, seq_len, seq_len]
209
+ null_mask: BoolTensor, # [batch_size, seq_len]
210
+ padding_mask: BoolTensor # [batch_size, seq_len]
211
+ ) -> Tensor:
212
+ # Convert scores to probabilities.
213
+ arc_probs = torch.sigmoid(s_arc)
214
+ # Find confident arcs (with prob > 0.5).
215
+ return arc_probs.round().long()
216
+
217
+ @staticmethod
218
+ def calc_arc_loss(
219
+ s_arc: Tensor, # [batch_size, seq_len, seq_len]
220
+ gold_arcs: LongTensor # [n_arcs, 4]
221
+ ) -> Tensor:
222
+ batch_idxs, from_idxs, to_idxs, _ = gold_arcs.T
223
+ # Gold arcs but as a matrix, where matrix[i, arcs_from, arc_to] = 1.0 if arcs is present.
224
+ gold_arcs_matrix = torch.zeros_like(s_arc)
225
+ gold_arcs_matrix[batch_idxs, from_idxs, to_idxs] = 1.0
226
+ # Padded arcs's logits are huge negative values that doesn't contribute to the loss.
227
+ return F.binary_cross_entropy_with_logits(s_arc, gold_arcs_matrix)
228
+
229
+
230
+ class DependencyClassifier(nn.Module):
231
+ """
232
+ Dozat and Manning's biaffine dependency classifier.
233
+ """
234
+
235
+ def __init__(
236
+ self,
237
+ input_size: int,
238
+ hidden_size: int,
239
+ n_rels_ud: int,
240
+ n_rels_eud: int,
241
+ activation: str,
242
+ dropout: float,
243
+ ):
244
+ super().__init__()
245
+
246
+ self.arc_dep_mlp = nn.Sequential(
247
+ nn.Dropout(dropout),
248
+ nn.Linear(input_size, hidden_size),
249
+ ACT2FN[activation],
250
+ nn.Dropout(dropout)
251
+ )
252
+ # All mlps are equal.
253
+ self.arc_head_mlp = deepcopy(self.arc_dep_mlp)
254
+ self.rel_dep_mlp = deepcopy(self.arc_dep_mlp)
255
+ self.rel_head_mlp = deepcopy(self.arc_dep_mlp)
256
+
257
+ self.dependency_head_ud = DependencyHead(hidden_size, n_rels_ud)
258
+ self.dependency_head_eud = MultiDependencyHead(hidden_size, n_rels_eud)
259
+
260
+ def forward(
261
+ self,
262
+ embeddings: Tensor, # [batch_size, seq_len, embedding_size]
263
+ gold_ud: Tensor, # [n_ud_arcs, 4]
264
+ gold_eud: Tensor, # [n_eud_arcs, 4]
265
+ null_mask: Tensor, # [batch_size, seq_len]
266
+ padding_mask: Tensor # [batch_size, seq_len]
267
+ ) -> dict[str, Tensor]:
268
+
269
+ # - [batch_size, seq_len, hidden_size]
270
+ h_arc_head = self.arc_head_mlp(embeddings)
271
+ h_arc_dep = self.arc_dep_mlp(embeddings)
272
+ h_rel_head = self.rel_head_mlp(embeddings)
273
+ h_rel_dep = self.rel_dep_mlp(embeddings)
274
+
275
+ # Share the h vectors between dependency and multi-dependency heads.
276
+ output_ud = self.dependency_head_ud(
277
+ h_arc_head,
278
+ h_arc_dep,
279
+ h_rel_head,
280
+ h_rel_dep,
281
+ gold_arcs=gold_ud,
282
+ null_mask=null_mask,
283
+ padding_mask=padding_mask
284
+ )
285
+ output_eud = self.dependency_head_eud(
286
+ h_arc_head,
287
+ h_arc_dep,
288
+ h_rel_head,
289
+ h_rel_dep,
290
+ gold_arcs=gold_eud,
291
+ # Ignore null mask in E-UD
292
+ null_mask=torch.ones_like(padding_mask),
293
+ padding_mask=padding_mask
294
+ )
295
+
296
+ return {
297
+ 'preds_ud': output_ud["preds"],
298
+ 'preds_eud': output_eud["preds"],
299
+ 'loss_ud': output_ud["loss"],
300
+ 'loss_eud': output_eud["loss"]
301
+ }
encoder.py ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+ from torch import Tensor, LongTensor
4
+
5
+ from transformers import AutoTokenizer, AutoModel
6
+
7
+ try:
8
+ from peft import LoraConfig, get_peft_model
9
+ PEFT_AVAILABLE = True
10
+ except ImportError:
11
+ PEFT_AVAILABLE = False
12
+
13
+ from typing import Optional, List
14
+
15
+ class WordTransformerEncoder(nn.Module):
16
+ """
17
+ Encodes sentences into word-level embeddings using a pretrained MLM transformer.
18
+ Optionally enables LoRA fine-tuning adapters.
19
+ """
20
+ def __init__(
21
+ self,
22
+ model_name: str,
23
+ use_lora: bool = False,
24
+ lora_r: int = 8,
25
+ lora_alpha: int = 16,
26
+ lora_dropout: float = 0.05,
27
+ lora_target_modules: Optional[List[str]] = None
28
+ ):
29
+ super().__init__()
30
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
31
+ self.model = AutoModel.from_pretrained(model_name)
32
+
33
+ if use_lora:
34
+ if not PEFT_AVAILABLE:
35
+ raise ImportError("peft is required for LoRA fine-tuning. Install with `pip install peft`.")
36
+ if lora_target_modules is None:
37
+ # XLM-RoBERTa и Roberta-family
38
+ if "roberta" in model_name.lower():
39
+ lora_target_modules = ["q_proj", "v_proj"]
40
+ else:
41
+ lora_target_modules = ["query", "value"]
42
+ lora_config = LoraConfig(
43
+ r=lora_r,
44
+ lora_alpha=lora_alpha,
45
+ target_modules=lora_target_modules,
46
+ lora_dropout=lora_dropout,
47
+ bias="none",
48
+ task_type="SEQ_CLS"
49
+ )
50
+ self.model = get_peft_model(self.model, lora_config)
51
+ print(f"LoRA enabled: r={lora_r}, alpha={lora_alpha}, target_modules={lora_target_modules}")
52
+
53
+ def forward(self, words: list[list[str]]) -> Tensor:
54
+ """
55
+ Build words embeddings.
56
+
57
+ - Tokenizes input sentences into subtokens.
58
+ - Passes the subtokens through the pre-trained transformer model.
59
+ - Aggregates subtoken embeddings into word embeddings using mean pooling.
60
+ """
61
+ batch_size = len(words)
62
+
63
+ # BPE tokenization: split words into subtokens, e.g. ['kidding'] -> ['▁ki', 'dding'].
64
+ subtokens = self.tokenizer(
65
+ words,
66
+ padding=True,
67
+ truncation=True,
68
+ is_split_into_words=True,
69
+ return_tensors='pt'
70
+ )
71
+ subtokens = subtokens.to(self.model.device)
72
+ # Index words from 1 and reserve 0 for special subtokens (e.g. <s>, </s>, padding, etc.).
73
+ # Such numeration makes a following aggregation easier.
74
+ words_ids = torch.stack([
75
+ torch.tensor(
76
+ [word_id + 1 if word_id is not None else 0 for word_id in subtokens.word_ids(batch_idx)],
77
+ dtype=torch.long,
78
+ device=self.model.device
79
+ )
80
+ for batch_idx in range(batch_size)
81
+ ])
82
+
83
+ # Run model and extract subtokens embeddings from the last layer.
84
+ subtokens_embeddings = self.model(**subtokens).last_hidden_state
85
+
86
+ # Aggreate subtokens embeddings into words embeddings.
87
+ # [batch_size, n_words, embedding_size]
88
+ words_emeddings = self._aggregate_subtokens_embeddings(subtokens_embeddings, words_ids)
89
+ return words_emeddings
90
+
91
+ def _aggregate_subtokens_embeddings(
92
+ self,
93
+ subtokens_embeddings: Tensor, # [batch_size, n_subtokens, embedding_size]
94
+ words_ids: LongTensor # [batch_size, n_subtokens]
95
+ ) -> Tensor:
96
+ """
97
+ Aggregate subtoken embeddings into word embeddings by averaging.
98
+
99
+ This method ensures that multiple subtokens corresponding to a single word are combined
100
+ into a single embedding.
101
+ """
102
+ batch_size, n_subtokens, embedding_size = subtokens_embeddings.shape
103
+ # The number of words in a sentence plus an "auxiliary" word in the beginnig.
104
+ n_words = torch.max(words_ids) + 1
105
+
106
+ words_embeddings = torch.zeros(
107
+ size=(batch_size, n_words, embedding_size),
108
+ dtype=subtokens_embeddings.dtype,
109
+ device=self.model.device
110
+ )
111
+ words_ids_expanded = words_ids.unsqueeze(-1).expand(batch_size, n_subtokens, embedding_size)
112
+
113
+ # Use scatter_reduce_ to average embeddings of subtokens corresponding to the same word.
114
+ # All the padding and special subtokens will be aggregated into an "auxiliary" first embedding,
115
+ # namely into words_embeddings[:, 0, :].
116
+ words_embeddings.scatter_reduce_(
117
+ dim=1,
118
+ index=words_ids_expanded,
119
+ src=subtokens_embeddings,
120
+ reduce="mean",
121
+ include_self=False
122
+ )
123
+ # Now remove the auxiliary word in the beginning.
124
+ words_embeddings = words_embeddings[:, 1:, :]
125
+ return words_embeddings
126
+
127
+ def get_embedding_size(self) -> int:
128
+ """Returns the embedding size of the transformer model, e.g. 768 for BERT."""
129
+ return self.model.config.hidden_size
130
+
131
+ def get_embeddings_layer(self):
132
+ """Returns the embeddings model."""
133
+ return self.model.embeddings
134
+
135
+ def get_transformer_layers(self) -> list[nn.Module]:
136
+ """
137
+ Return a flat list of all transformer-*block* layers, excluding embeddings/poolers, etc.
138
+ """
139
+ layers = []
140
+ for sub in self.model.modules():
141
+ # find all ModuleLists (these always hold the actual block layers)
142
+ if isinstance(sub, nn.ModuleList):
143
+ layers.extend(list(sub))
144
+ return layers
mlp_classifier.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+ from torch import Tensor, LongTensor
4
+
5
+ from transformers.activations import ACT2FN
6
+
7
+
8
+ class MlpClassifier(nn.Module):
9
+ """ Simple feed-forward multilayer perceptron classifier. """
10
+
11
+ def __init__(
12
+ self,
13
+ input_size: int,
14
+ hidden_size: int,
15
+ n_classes: int,
16
+ activation: str,
17
+ dropout: float,
18
+ class_weights: list[float] = None,
19
+ ):
20
+ super().__init__()
21
+
22
+ self.n_classes = n_classes
23
+ self.classifier = nn.Sequential(
24
+ nn.Dropout(dropout),
25
+ nn.Linear(input_size, hidden_size),
26
+ ACT2FN[activation],
27
+ nn.Dropout(dropout),
28
+ nn.Linear(hidden_size, n_classes)
29
+ )
30
+ if class_weights is not None:
31
+ class_weights = torch.tensor(class_weights, dtype=torch.long)
32
+ self.cross_entropy = nn.CrossEntropyLoss(weight=class_weights)
33
+
34
+ def forward(self, embeddings: Tensor, labels: LongTensor = None) -> dict:
35
+ logits = self.classifier(embeddings)
36
+ # Calculate loss.
37
+ loss = 0.0
38
+ if labels is not None:
39
+ # Reshape tensors to match expected dimensions
40
+ loss = self.cross_entropy(
41
+ logits.view(-1, self.n_classes),
42
+ labels.view(-1)
43
+ )
44
+ # Predictions.
45
+ preds = logits.argmax(dim=-1)
46
+ return {'preds': preds, 'loss': loss}
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a48711fc2496833a9e5a00ca3563b0ed3eab04d3e4c0c3cc3cc49754134ae349
3
+ size 1134190536
modeling_parser.py ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from torch import nn
2
+ from torch import LongTensor
3
+ from transformers import PreTrainedModel
4
+
5
+ from .configuration import CobaldParserConfig
6
+ from .encoder import WordTransformerEncoder
7
+ from .mlp_classifier import MlpClassifier
8
+ from .dependency_classifier import DependencyClassifier
9
+ from .utils import (
10
+ build_padding_mask,
11
+ build_null_mask,
12
+ prepend_cls,
13
+ remove_nulls,
14
+ add_nulls
15
+ )
16
+
17
+
18
+ class CobaldParser(PreTrainedModel):
19
+ """Morpho-Syntax-Semantic Parser."""
20
+
21
+ config_class = CobaldParserConfig
22
+
23
+ def __init__(self, config: CobaldParserConfig):
24
+ super().__init__(config)
25
+
26
+ self.encoder = WordTransformerEncoder(
27
+ model_name=config.encoder_model_name
28
+ )
29
+ embedding_size = self.encoder.get_embedding_size()
30
+
31
+ self.classifiers = nn.ModuleDict()
32
+ self.classifiers["null"] = MlpClassifier(
33
+ input_size=self.encoder.get_embedding_size(),
34
+ hidden_size=config.null_classifier_hidden_size,
35
+ n_classes=config.consecutive_null_limit + 1,
36
+ activation=config.activation,
37
+ dropout=config.dropout
38
+ )
39
+ if "lemma_rule" in config.vocabulary:
40
+ self.classifiers["lemma_rule"] = MlpClassifier(
41
+ input_size=embedding_size,
42
+ hidden_size=config.lemma_classifier_hidden_size,
43
+ n_classes=len(config.vocabulary["lemma_rule"]),
44
+ activation=config.activation,
45
+ dropout=config.dropout
46
+ )
47
+ if "joint_feats" in config.vocabulary:
48
+ self.classifiers["joint_feats"] = MlpClassifier(
49
+ input_size=embedding_size,
50
+ hidden_size=config.morphology_classifier_hidden_size,
51
+ n_classes=len(config.vocabulary["joint_feats"]),
52
+ activation=config.activation,
53
+ dropout=config.dropout
54
+ )
55
+ if "ud_deprel" in config.vocabulary or "eud_deprel" in config.vocabulary:
56
+ self.classifiers["syntax"] = DependencyClassifier(
57
+ input_size=embedding_size,
58
+ hidden_size=config.dependency_classifier_hidden_size,
59
+ n_rels_ud=len(config.vocabulary["ud_deprel"]),
60
+ n_rels_eud=len(config.vocabulary["eud_deprel"]),
61
+ activation=config.activation,
62
+ dropout=config.dropout
63
+ )
64
+ if "misc" in config.vocabulary:
65
+ self.classifiers["misc"] = MlpClassifier(
66
+ input_size=embedding_size,
67
+ hidden_size=config.misc_classifier_hidden_size,
68
+ n_classes=len(config.vocabulary["misc"]),
69
+ activation=config.activation,
70
+ dropout=config.dropout
71
+ )
72
+ if "deepslot" in config.vocabulary:
73
+ self.classifiers["deepslot"] = MlpClassifier(
74
+ input_size=embedding_size,
75
+ hidden_size=config.deepslot_classifier_hidden_size,
76
+ n_classes=len(config.vocabulary["deepslot"]),
77
+ activation=config.activation,
78
+ dropout=config.dropout
79
+ )
80
+ if "semclass" in config.vocabulary:
81
+ self.classifiers["semclass"] = MlpClassifier(
82
+ input_size=embedding_size,
83
+ hidden_size=config.semclass_classifier_hidden_size,
84
+ n_classes=len(config.vocabulary["semclass"]),
85
+ activation=config.activation,
86
+ dropout=config.dropout
87
+ )
88
+
89
+ def forward(
90
+ self,
91
+ words: list[list[str]],
92
+ counting_masks: LongTensor = None,
93
+ lemma_rules: LongTensor = None,
94
+ joint_feats: LongTensor = None,
95
+ deps_ud: LongTensor = None,
96
+ deps_eud: LongTensor = None,
97
+ miscs: LongTensor = None,
98
+ deepslots: LongTensor = None,
99
+ semclasses: LongTensor = None,
100
+ sent_ids: list[str] = None,
101
+ texts: list[str] = None,
102
+ inference_mode: bool = False
103
+ ) -> dict:
104
+ output = {}
105
+
106
+ # Extra [CLS] token accounts for the case when #NULL is the first token in a sentence.
107
+ words_with_cls = prepend_cls(words)
108
+ words_without_nulls = remove_nulls(words_with_cls)
109
+ # Embeddings of words without nulls.
110
+ embeddings_without_nulls = self.encoder(words_without_nulls)
111
+ # Predict nulls.
112
+ null_output = self.classifiers["null"](embeddings_without_nulls, counting_masks)
113
+ output["counting_mask"] = null_output['preds']
114
+ output["loss"] = null_output["loss"]
115
+
116
+ # "Teacher forcing": during training, pass the original words (with gold nulls)
117
+ # to the classification heads, so that they are trained upon correct sentences.
118
+ if inference_mode:
119
+ # Restore predicted nulls in the original sentences.
120
+ output["words"] = add_nulls(words, null_output["preds"])
121
+ else:
122
+ output["words"] = words
123
+
124
+ # Encode words with nulls.
125
+ # [batch_size, seq_len, embedding_size]
126
+ embeddings = self.encoder(output["words"])
127
+
128
+ # Predict lemmas and morphological features.
129
+ if "lemma_rule" in self.classifiers:
130
+ lemma_output = self.classifiers["lemma_rule"](embeddings, lemma_rules)
131
+ output["lemma_rules"] = lemma_output['preds']
132
+ output["loss"] += lemma_output['loss']
133
+
134
+ if "joint_feats" in self.classifiers:
135
+ joint_feats_output = self.classifiers["joint_feats"](embeddings, joint_feats)
136
+ output["joint_feats"] = joint_feats_output['preds']
137
+ output["loss"] += joint_feats_output['loss']
138
+
139
+ # Predict syntax.
140
+ if "syntax" in self.classifiers:
141
+ padding_mask = build_padding_mask(output["words"], self.device)
142
+ null_mask = build_null_mask(output["words"], self.device)
143
+ deps_output = self.classifiers["syntax"](
144
+ embeddings,
145
+ deps_ud,
146
+ deps_eud,
147
+ null_mask,
148
+ padding_mask
149
+ )
150
+ output["deps_ud"] = deps_output['preds_ud']
151
+ output["deps_eud"] = deps_output['preds_eud']
152
+ output["loss"] += deps_output['loss_ud'] + deps_output['loss_eud']
153
+
154
+ # Predict miscellaneous features.
155
+ if "misc" in self.classifiers:
156
+ misc_output = self.classifiers["misc"](embeddings, miscs)
157
+ output["miscs"] = misc_output['preds']
158
+ output["loss"] += misc_output['loss']
159
+
160
+ # Predict semantics.
161
+ if "deepslot" in self.classifiers:
162
+ deepslot_output = self.classifiers["deepslot"](embeddings, deepslots)
163
+ output["deepslots"] = deepslot_output['preds']
164
+ output["loss"] += deepslot_output['loss']
165
+
166
+ if "semclass" in self.classifiers:
167
+ semclass_output = self.classifiers["semclass"](embeddings, semclasses)
168
+ output["semclasses"] = semclass_output['preds']
169
+ output["loss"] += semclass_output['loss']
170
+
171
+ return output
pipeline.py ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ from transformers import Pipeline
3
+
4
+ from src.lemmatize_helper import reconstruct_lemma
5
+
6
+
7
+ class ConlluTokenClassificationPipeline(Pipeline):
8
+ def __init__(
9
+ self,
10
+ model,
11
+ tokenizer: callable = None,
12
+ sentenizer: callable = None,
13
+ **kwargs
14
+ ):
15
+ super().__init__(model=model, **kwargs)
16
+ self.tokenizer = tokenizer
17
+ self.sentenizer = sentenizer
18
+
19
+ #@override
20
+ def _sanitize_parameters(self, output_format: str = 'list', **kwargs):
21
+ if output_format not in ['list', 'str']:
22
+ raise ValueError(
23
+ f"output_format must be 'str' or 'list', not {output_format}"
24
+ )
25
+ # capture output_format for postprocessing
26
+ return {}, {}, {'output_format': output_format}
27
+
28
+
29
+ def preprocess(self, inputs: str) -> dict:
30
+ if not isinstance(inputs, str):
31
+ raise ValueError("pipeline input must be string (text)")
32
+
33
+ sentences = [sentence for sentence in self.sentenizer(inputs)]
34
+ words = [
35
+ [word for word in self.tokenizer(sentence)]
36
+ for sentence in sentences
37
+ ]
38
+ # stash for later post‐processing
39
+ self._texts = sentences
40
+ return {"words": words}
41
+
42
+
43
+ def _forward(self, model_inputs: dict) -> dict:
44
+ return self.model(**model_inputs, inference_mode=True)
45
+
46
+ #@override
47
+ def postprocess(self, model_outputs: dict, output_format: str) -> list[dict] | str:
48
+ sentences = self._decode_model_output(model_outputs)
49
+ # Format sentences into CoNLL-U string if requested.
50
+ if output_format == 'str':
51
+ sentences = self._format_as_conllu(sentences)
52
+ return sentences
53
+
54
+ def _decode_model_output(self, model_outputs: dict) -> list[dict]:
55
+ n_sentences = len(model_outputs["words"])
56
+
57
+ sentences_decoded = []
58
+ for i in range(n_sentences):
59
+
60
+ def select_arcs(arcs, batch_idx):
61
+ # Select arcs where batch index == batch_idx
62
+ # Return tensor of shape [n_selected_arcs, 3]
63
+ return arcs[arcs[:, 0] == batch_idx][:, 1:]
64
+
65
+ # Model outputs are padded tensors, so only leave first `n_words` labels.
66
+ n_words = len(model_outputs["words"][i])
67
+
68
+ optional_tags = {}
69
+ if "lemma_rules" in model_outputs:
70
+ optional_tags["lemma_rule_ids"] = model_outputs["lemma_rules"][i, :n_words].tolist()
71
+ if "joint_feats" in model_outputs:
72
+ optional_tags["joint_feats_ids"] = model_outputs["joint_feats"][i, :n_words].tolist()
73
+ if "deps_ud" in model_outputs:
74
+ optional_tags["deps_ud"] = select_arcs(model_outputs["deps_ud"], i).tolist()
75
+ if "deps_eud" in model_outputs:
76
+ optional_tags["deps_eud"] = select_arcs(model_outputs["deps_eud"], i).tolist()
77
+ if "miscs" in model_outputs:
78
+ optional_tags["misc_ids"] = model_outputs["miscs"][i, :n_words].tolist()
79
+ if "deepslots" in model_outputs:
80
+ optional_tags["deepslot_ids"] = model_outputs["deepslots"][i, :n_words].tolist()
81
+ if "semclasses" in model_outputs:
82
+ optional_tags["semclass_ids"] = model_outputs["semclasses"][i, :n_words].tolist()
83
+
84
+ sentence_decoded = self._decode_sentence(
85
+ text=self._texts[i],
86
+ words=model_outputs["words"][i],
87
+ **optional_tags,
88
+ )
89
+ sentences_decoded.append(sentence_decoded)
90
+ return sentences_decoded
91
+
92
+ def _decode_sentence(
93
+ self,
94
+ text: str,
95
+ words: list[str],
96
+ lemma_rule_ids: list[int] = None,
97
+ joint_feats_ids: list[int] = None,
98
+ deps_ud: list[list[int]] = None,
99
+ deps_eud: list[list[int]] = None,
100
+ misc_ids: list[int] = None,
101
+ deepslot_ids: list[int] = None,
102
+ semclass_ids: list[int] = None
103
+ ) -> dict:
104
+
105
+ # Enumerate words in the sentence, starting from 1.
106
+ ids = self._enumerate_words(words)
107
+
108
+ result = {
109
+ "text": text,
110
+ "words": words,
111
+ "ids": ids
112
+ }
113
+
114
+ # Decode lemmas.
115
+ if lemma_rule_ids:
116
+ result["lemmas"] = [
117
+ reconstruct_lemma(
118
+ word,
119
+ self.model.config.vocabulary["lemma_rule"][lemma_rule_id]
120
+ )
121
+ for word, lemma_rule_id in zip(words, lemma_rule_ids, strict=True)
122
+ ]
123
+ # Decode POS and features.
124
+ if joint_feats_ids:
125
+ upos, xpos, feats = zip(
126
+ *[
127
+ self.model.config.vocabulary["joint_feats"][joint_feats_id].split('#')
128
+ for joint_feats_id in joint_feats_ids
129
+ ],
130
+ strict=True
131
+ )
132
+ result["upos"] = list(upos)
133
+ result["xpos"] = list(xpos)
134
+ result["feats"] = list(feats)
135
+ # Decode syntax.
136
+ renumerate_and_decode_arcs = lambda arcs, id2rel: [
137
+ (
138
+ # ids stores inverse mapping from internal numeration to the standard
139
+ # conllu numeration, so simply use ids[internal_idx] to retrieve token id
140
+ # from internal index.
141
+ ids[arc_from] if arc_from != arc_to else '0',
142
+ ids[arc_to],
143
+ id2rel[deprel_id]
144
+ )
145
+ for arc_from, arc_to, deprel_id in arcs
146
+ ]
147
+ if deps_ud:
148
+ result["deps_ud"] = renumerate_and_decode_arcs(
149
+ deps_ud,
150
+ self.model.config.vocabulary["ud_deprel"]
151
+ )
152
+ if deps_eud:
153
+ result["deps_eud"] = renumerate_and_decode_arcs(
154
+ deps_eud,
155
+ self.model.config.vocabulary["eud_deprel"]
156
+ )
157
+ # Decode misc.
158
+ if misc_ids:
159
+ result["miscs"] = [
160
+ self.model.config.vocabulary["misc"][misc_id]
161
+ for misc_id in misc_ids
162
+ ]
163
+ # Decode semantics.
164
+ if deepslot_ids:
165
+ result["deepslots"] = [
166
+ self.model.config.vocabulary["deepslot"][deepslot_id]
167
+ for deepslot_id in deepslot_ids
168
+ ]
169
+ if semclass_ids:
170
+ result["semclasses"] = [
171
+ self.model.config.vocabulary["semclass"][semclass_id]
172
+ for semclass_id in semclass_ids
173
+ ]
174
+ return result
175
+
176
+ @staticmethod
177
+ def _enumerate_words(words: list[str]) -> list[str]:
178
+ ids = []
179
+ current_id = 0
180
+ current_null_count = 0
181
+ for word in words:
182
+ if word == "#NULL":
183
+ current_null_count += 1
184
+ ids.append(f"{current_id}.{current_null_count}")
185
+ else:
186
+ current_id += 1
187
+ current_null_count = 0
188
+ ids.append(f"{current_id}")
189
+ return ids
190
+
191
+ @staticmethod
192
+ def _format_as_conllu(sentences: list[dict]) -> str:
193
+ """
194
+ Format a list of sentence dicts into a CoNLL-U formatted string.
195
+ """
196
+ formatted = []
197
+ for sentence in sentences:
198
+ # The first line is a text matadata.
199
+ lines = [f"# text = {sentence['text']}"]
200
+
201
+ id2idx = {token_id: idx for idx, token_id in enumerate(sentence['ids'])}
202
+
203
+ # Basic syntax.
204
+ heads = [''] * len(id2idx)
205
+ deprels = [''] * len(id2idx)
206
+ if "deps_ud" in sentence:
207
+ for arc_from, arc_to, deprel in sentence['deps_ud']:
208
+ token_idx = id2idx[arc_to]
209
+ heads[token_idx] = arc_from
210
+ deprels[token_idx] = deprel
211
+
212
+ # Enhanced syntax.
213
+ deps_dicts = [{} for _ in range(len(id2idx))]
214
+ if "deps_eud" in sentence:
215
+ for arc_from, arc_to, deprel in sentence['deps_eud']:
216
+ token_idx = id2idx[arc_to]
217
+ deps_dicts[token_idx][arc_from] = deprel
218
+
219
+ for idx, token_id in enumerate(sentence['ids']):
220
+ word = sentence['words'][idx]
221
+ lemma = sentence['lemmas'][idx] if "lemmas" in sentence else ''
222
+ upos = sentence['upos'][idx] if "upos" in sentence else ''
223
+ xpos = sentence['xpos'][idx] if "xpos" in sentence else ''
224
+ feats = sentence['feats'][idx] if "feats" in sentence else ''
225
+ deps = '|'.join(f"{head}:{rel}" for head, rel in deps_dicts[idx].items()) or '_'
226
+ misc = sentence['miscs'][idx] if "miscs" in sentence else ''
227
+ deepslot = sentence['deepslots'][idx] if "deepslots" in sentence else ''
228
+ semclass = sentence['semclasses'][idx] if "semclasses" in sentence else ''
229
+ # CoNLL-U columns
230
+ line = '\t'.join([
231
+ token_id, word, lemma, upos, xpos, feats, heads[idx],
232
+ deprels[idx], deps, misc, deepslot, semclass
233
+ ])
234
+ lines.append(line)
235
+ formatted.append('\n'.join(lines))
236
+ return '\n\n'.join(formatted)
utils.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import Tensor
3
+
4
+
5
+ def pad_sequences(sequences: list[Tensor], padding_value: int) -> Tensor:
6
+ """
7
+ Stack 1d tensors (sequences) into a single 2d tensor so that each sequence is padded on the
8
+ right.
9
+ """
10
+ return torch.nn.utils.rnn.pad_sequence(sequences, padding_value=padding_value, batch_first=True)
11
+
12
+
13
+ def _build_condition_mask(sentences: list[list[str]], condition_fn: callable, device) -> Tensor:
14
+ masks = [
15
+ torch.tensor([condition_fn(word) for word in sentence], dtype=bool, device=device)
16
+ for sentence in sentences
17
+ ]
18
+ return pad_sequences(masks, padding_value=False)
19
+
20
+ def build_padding_mask(sentences: list[list[str]], device) -> Tensor:
21
+ return _build_condition_mask(sentences, condition_fn=lambda word: True, device=device)
22
+
23
+ def build_null_mask(sentences: list[list[str]], device) -> Tensor:
24
+ return _build_condition_mask(sentences, condition_fn=lambda word: word != "#NULL", device=device)
25
+
26
+
27
+ def pairwise_mask(masks1d: Tensor) -> Tensor:
28
+ """
29
+ Calculate an outer product of a mask, i.e. masks2d[:, i, j] = masks1d[:, i] & masks1d[:, j].
30
+ """
31
+ return masks1d[:, None, :] & masks1d[:, :, None]
32
+
33
+
34
+ # Credits: https://docs.allennlp.org/main/api/nn/util/#replace_masked_values
35
+ def replace_masked_values(tensor: Tensor, mask: Tensor, replace_with: float):
36
+ """
37
+ Replace all masked values in tensor with `replace_with`.
38
+ """
39
+ assert tensor.dim() == mask.dim(), "tensor.dim() of {tensor.dim()} != mask.dim() of {mask.dim()}"
40
+ tensor.masked_fill_(~mask, replace_with)
41
+
42
+
43
+ def prepend_cls(sentences: list[list[str]]) -> list[list[str]]:
44
+ """
45
+ Return a copy of sentences with [CLS] token prepended.
46
+ """
47
+ return [["[CLS]", *sentence] for sentence in sentences]
48
+
49
+ def remove_nulls(sentences: list[list[str]]) -> list[list[str]]:
50
+ """
51
+ Return a copy of sentences with nulls removed.
52
+ """
53
+ return [[word for word in sentence if word != "#NULL"] for sentence in sentences]
54
+
55
+ def add_nulls(sentences: list[list[str]], counting_mask) -> list[list[str]]:
56
+ """
57
+ Return a copy of sentences with nulls restored according to counting masks.
58
+ """
59
+ sentences_with_nulls = []
60
+ for sentence, counting_mask in zip(sentences, counting_mask, strict=True):
61
+ sentence_with_nulls = []
62
+ assert 0 < len(counting_mask)
63
+ # Account for leading (CLS) auxiliary token.
64
+ sentence_with_nulls.extend(["#NULL"] * counting_mask[0])
65
+ for word, n_nulls_to_insert in zip(sentence, counting_mask[1:], strict=True):
66
+ sentence_with_nulls.append(word)
67
+ sentence_with_nulls.extend(["#NULL"] * n_nulls_to_insert)
68
+ sentences_with_nulls.append(sentence_with_nulls)
69
+ return sentences_with_nulls