sergeyzt50 commited on
Commit
10ebc4d
·
verified ·
1 Parent(s): 40b17b0

Upload 27 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ spacy-skmodel/sources/skner/wikiann-sk.bio filter=lfs diff=lfs merge=lfs -text
spacy-skmodel/.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ build
2
+ venv
3
+ dist
4
+ input
5
+ posparser
6
+ nerposparser
spacy-skmodel/Makefile ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ all: input/sk_snk-ud-test.spacy input/sk_snk-ud-train.spacy input/train-ner.spacy input/vectors/config.cfg
2
+
3
+ sources/slovak-treebank/sk_snk-ud-test.conllu:
4
+ mkdir -p sources/slovak-treebank
5
+ cd sources && wget -P slovak-treebank https://raw.githubusercontent.com/UniversalDependencies/UD_Slovak-SNK/master/sk_snk-ud-test.conllu
6
+
7
+ sources/slovak-treebank/sk_snk-ud-train.conllu:
8
+ mkdir -p sources/slovak-treebank
9
+ cd sources && wget -P slovak-treebank https://raw.githubusercontent.com/UniversalDependencies/UD_Slovak-SNK/master/sk_snk-ud-train.conllu
10
+
11
+ sources/floret/vectors.floret.gz:
12
+ mkdir -p sources/floret
13
+ cd sources && wget -P floret https://files.kemt.fei.tuke.sk/models/fasttext/sk-fastext-floretvec-skweb2021/vectors.floret.gz --no-check-certificate
14
+
15
+ input/sk_snk-ud-test.spacy: sources/slovak-treebank/sk_snk-ud-test.conllu
16
+ mkdir -p input
17
+ spacy convert -n 10 sources/slovak-treebank/sk_snk-ud-test.conllu input
18
+
19
+ input/sk_snk-ud-train.spacy: sources/slovak-treebank/sk_snk-ud-train.conllu
20
+ mkdir -p input
21
+ spacy convert -n 10 sources/slovak-treebank/sk_snk-ud-train.conllu input
22
+
23
+ input/train-ner.spacy: sources/skner/wikiann-sk.bio
24
+ python skner2json.py ./sources/skner/wikiann-sk.bio input/train-ner.json input/test-ner.json
25
+ spacy convert input/train-ner.json input
26
+ spacy convert input/test-ner.json input
27
+
28
+ input/vectors/config.cfg: sources/floret/vectors.floret.gz
29
+ mkdir -p input/vectors
30
+ spacy init vectors sk sources/floret/vectors.floret.gz input/vectors -V -m floret
spacy-skmodel/README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Slovak Spacy Model
2
+
3
+ This is Slovak Spacy model.
4
+
5
+ ## Features
6
+
7
+ - Requires Spacy 3.x.
8
+ - Contains Floret Word Vectors.
9
+ - Tagger module uses Slovak National Corpus Tagset.
10
+ - Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
11
+ - Lemmatizer is trained on Slovak dependency treebank.
12
+ - Named entity recognizer is trained separately on WikiAnn database.
13
+
14
+
15
+ ## Downloads
16
+
17
+ # Version 3.4
18
+
19
+ - [Spacy 3.4, Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_dep_web_md-3.4.1.tar.gz).
20
+ - Model for trained lemmatization, POS tagging and dependency relations.
21
+ - Contains Floret Word Vectors, trained on our web corpus.
22
+ - Should be without license issues.
23
+
24
+ - [Spacy 3.4, NER + Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_core_web_md-3.4.1.tar.gz).
25
+ - Includes the dependencies model.
26
+ - This model uses separate fine-tuned model for NER recognition.
27
+
28
+ # Version 3.3
29
+
30
+ - [Spacy 3.3, Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_dep_web_md-3.3.0.tar.gz). Model for trained lemmatization, POS tagging and dependency relations.
31
+ - [Spacy 3.3, NER + Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_core_web_md-3.3.0.tar.gz). This model uses separate fine-tuned model for NER recognition.
32
+
33
+ These models do not have word vectors.
34
+
35
+ ## Training
36
+
37
+ Requirements for training:
38
+
39
+ - Anaconda virtual environment
40
+ - Spacy 3
41
+ - make
42
+ - bash
43
+
44
+ Usage:
45
+
46
+ 1. Install dependencies in the Conda
47
+
48
+ ./prepare-env.sh
49
+
50
+ 2. Download and prepare data:
51
+
52
+ make
53
+
54
+ 3. Train models
55
+
56
+ ./train.sh
57
+
58
+ ## Credits
59
+
60
+ Author:
61
+
62
+ Daniel Hládek daniel.hladek@tuke.sk and Technical University of Košice
63
+
64
+ Sources:
65
+
66
+ - The model uses spacy-transformers and [SlovakBERT](https://huggingface.co/gerulata/slovakbert).
67
+ - [Part of Speech and Dependency relations](https://github.com/UniversalDependencies/UD_Slovak-SNK)
68
+ The Slovak UD treebank with Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
69
+ - [Semi-automatic named entities](https://huggingface.co/datasets/wikiann) - Unspecified License
70
+
spacy-skmodel/changemeta.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import sys
3
+
4
+ pos_dname = sys.argv[1]
5
+ with open(pos_dname + "/meta.json") as f:
6
+ pos_meta = json.load(f)
7
+ pos_performance = pos_meta["performance"]
8
+
9
+
10
+ dname = sys.argv[2]
11
+ meta_name = dname + "/meta.json"
12
+ with open(meta_name) as f:
13
+ doc = json.load(f)
14
+ doc["name"] = "core_web_md"
15
+ if "disabled" in doc:
16
+ del doc["disabled"]
17
+ doc["pipeline"] = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser","ner"]
18
+ for k,v in pos_performance.items():
19
+ doc["performance"][k] = v
20
+
21
+ with open(meta_name,"w") as f:
22
+ json.dump(doc,f,indent=4)
23
+
24
+ clines = []
25
+ config_name = dname + "/config.cfg"
26
+ with open(config_name) as f:
27
+ for l in f:
28
+ line = l.rstrip()
29
+ if "disabled" in line:
30
+ line = "disabled: []"
31
+ clines.append(line)
32
+
33
+
34
+ with open(config_name,"w") as f:
35
+ print("\n".join(clines),file=f)
spacy-skmodel/clean.sh ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ rm -rf traindir
2
+ rm -rf posparser
3
+ rm -rf nerposparser
4
+ rm -rf dist
spacy-skmodel/config-ner.cfg ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [paths]
2
+ train = "input/train-ner.spacy"
3
+ dev = "input/test-ner.spacy"
4
+ vectors = "input/vectors"
5
+ init_tok2vec = null
6
+
7
+ [system]
8
+ gpu_allocator = null
9
+ seed = 0
10
+
11
+ [nlp]
12
+ lang = "sk"
13
+ pipeline = ["tok2vec","parser","tagger","ner"]
14
+ batch_size = 1000
15
+ #disabled = ["parser","tagger"]
16
+ before_creation = null
17
+ after_creation = null
18
+ after_pipeline_creation = null
19
+ tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
20
+
21
+ [components]
22
+
23
+ [components.ner]
24
+ factory = "ner"
25
+ moves = null
26
+ update_with_oracle_cut_size = 100
27
+
28
+ [components.ner.model]
29
+ @architectures = "spacy.TransitionBasedParser.v2"
30
+ state_type = "ner"
31
+ extra_state_tokens = false
32
+ hidden_width = 64
33
+ maxout_pieces = 2
34
+ use_upper = true
35
+ nO = null
36
+
37
+ [components.ner.model.tok2vec]
38
+ @architectures = "spacy.HashEmbedCNN.v2"
39
+ pretrained_vectors = null
40
+ width = 96
41
+ depth = 4
42
+ embed_size = 2000
43
+ window_size = 1
44
+ maxout_pieces = 3
45
+ subword_features = true
46
+
47
+ [components.parser]
48
+ source = "sk_pipeline"
49
+ replace_listeners = ["model.tok2vec"]
50
+
51
+ [components.tagger]
52
+ source = "sk_pipeline"
53
+ replace_listeners = ["model.tok2vec"]
54
+
55
+ [components.tok2vec]
56
+ factory = "tok2vec"
57
+
58
+ [components.tok2vec.model]
59
+ @architectures = "spacy.Tok2Vec.v2"
60
+
61
+ [components.tok2vec.model.embed]
62
+ @architectures = "spacy.MultiHashEmbed.v2"
63
+ width = ${components.tok2vec.model.encode.width}
64
+ attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
65
+ rows = [5000,2500,2500,2500]
66
+ include_static_vectors = true
67
+
68
+ [components.tok2vec.model.encode]
69
+ @architectures = "spacy.MaxoutWindowEncoder.v2"
70
+ width = 96
71
+ depth = 4
72
+ window_size = 1
73
+ maxout_pieces = 3
74
+
75
+ [corpora]
76
+
77
+ [corpora.dev]
78
+ @readers = "spacy.Corpus.v1"
79
+ path = ${paths.dev}
80
+ max_length = 0
81
+ gold_preproc = false
82
+ limit = 0
83
+ augmenter = null
84
+
85
+ [corpora.train]
86
+ @readers = "spacy.Corpus.v1"
87
+ path = ${paths.train}
88
+ max_length = 2000
89
+ gold_preproc = false
90
+ limit = 0
91
+ augmenter = null
92
+
93
+ [training]
94
+ dev_corpus = "corpora.dev"
95
+ train_corpus = "corpora.train"
96
+ seed = ${system.seed}
97
+ gpu_allocator = ${system.gpu_allocator}
98
+ dropout = 0.1
99
+ accumulate_gradient = 1
100
+ patience = 1600
101
+ max_epochs = 0
102
+ max_steps = 20000
103
+ eval_frequency = 200
104
+ frozen_components = ["tagger","parser"]
105
+ before_to_disk = null
106
+
107
+ [training.batcher]
108
+ @batchers = "spacy.batch_by_words.v1"
109
+ discard_oversize = false
110
+ tolerance = 0.2
111
+ get_length = null
112
+
113
+ [training.batcher.size]
114
+ @schedules = "compounding.v1"
115
+ start = 100
116
+ stop = 1000
117
+ compound = 1.001
118
+ t = 0.0
119
+
120
+ [training.logger]
121
+ @loggers = "spacy.ConsoleLogger.v1"
122
+ progress_bar = false
123
+
124
+ [training.optimizer]
125
+ @optimizers = "Adam.v1"
126
+ beta1 = 0.9
127
+ beta2 = 0.999
128
+ L2_is_weight_decay = true
129
+ L2 = 0.01
130
+ grad_clip = 1.0
131
+ use_averages = false
132
+ eps = 0.00000001
133
+ learn_rate = 0.001
134
+
135
+ [training.score_weights]
136
+ dep_las_per_type = null
137
+ sents_p = null
138
+ sents_r = null
139
+ ents_per_type = null
140
+ dep_uas = 0.17
141
+ dep_las = 0.17
142
+ sents_f = 0.0
143
+ tag_acc = 0.33
144
+ ents_f = 0.33
145
+ ents_p = 0.0
146
+ ents_r = 0.0
147
+
148
+ [pretraining]
149
+
150
+ [initialize]
151
+ vectors = ${paths.vectors}
152
+ init_tok2vec = ${paths.init_tok2vec}
153
+ vocab_data = null
154
+ lookups = null
155
+ before_init = null
156
+ after_init = null
157
+
158
+ [initialize.components]
159
+
160
+ [initialize.tokenizer]
spacy-skmodel/config-transformer-ner.cfg ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [paths]
2
+ train = "input/train-ner.spacy"
3
+ dev = "input/test-ner.spacy"
4
+ vectors = "input/vectors"
5
+ init_tok2vec = null
6
+
7
+ [system]
8
+ gpu_allocator = "pytorch"
9
+ seed = 0
10
+
11
+ [nlp]
12
+ lang = "sk"
13
+ pipeline = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser","ner"]
14
+ batch_size = 128
15
+ disabled = []
16
+ before_creation = null
17
+ after_creation = null
18
+ after_pipeline_creation = null
19
+ tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
20
+
21
+ [components]
22
+
23
+ [components.ner]
24
+ factory = "ner"
25
+
26
+ [components.ner.model]
27
+ @architectures = "spacy.TransitionBasedParser.v2"
28
+ state_type = "ner"
29
+ extra_state_tokens = false
30
+ hidden_width = 64
31
+ maxout_pieces = 2
32
+ use_upper = false
33
+ nO = null
34
+
35
+ [components.ner.model.tok2vec]
36
+ @architectures = "spacy-transformers.TransformerListener.v1"
37
+ grad_factor = 1.0
38
+
39
+ [components.ner.model.tok2vec.pooling]
40
+ @layers = "reduce_mean.v1"
41
+
42
+ [components.morphologizer]
43
+ source = "sk_dep_web_md"
44
+ replace_listeners = ["model.tok2vec"]
45
+
46
+ [components.parser]
47
+ source = "sk_dep_web_md"
48
+ replace_listeners = ["model.tok2vec"]
49
+
50
+ [components.tagger]
51
+ source = "sk_dep_web_md"
52
+ replace_listeners = ["model.tok2vec"]
53
+
54
+ [components.trainable_lemmatizer]
55
+ source = "sk_dep_web_md"
56
+ replace_listeners = ["model.tok2vec"]
57
+
58
+ [components.transformer]
59
+ factory = "transformer"
60
+ max_batch_items = 4096
61
+ set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
62
+
63
+ [components.transformer.model]
64
+ @architectures = "spacy-transformers.TransformerModel.v3"
65
+ name = "gerulata/slovakbert"
66
+ mixed_precision = false
67
+
68
+ [components.transformer.model.get_spans]
69
+ @span_getters = "spacy-transformers.strided_spans.v1"
70
+ window = 128
71
+ stride = 96
72
+
73
+ [components.transformer.model.grad_scaler_config]
74
+
75
+ [components.transformer.model.tokenizer_config]
76
+ use_fast = true
77
+
78
+ [components.transformer.model.transformer_config]
79
+
80
+ [corpora]
81
+
82
+ [corpora.dev]
83
+ @readers = "spacy.Corpus.v1"
84
+ path = ${paths.dev}
85
+ max_length = 0
86
+ gold_preproc = false
87
+ limit = 0
88
+ augmenter = null
89
+
90
+ [corpora.train]
91
+ @readers = "spacy.Corpus.v1"
92
+ path = ${paths.train}
93
+ max_length = 0
94
+ gold_preproc = false
95
+ limit = 0
96
+ augmenter = null
97
+
98
+ [training]
99
+ accumulate_gradient = 3
100
+ dev_corpus = "corpora.dev"
101
+ train_corpus = "corpora.train"
102
+ seed = ${system.seed}
103
+ gpu_allocator = ${system.gpu_allocator}
104
+ dropout = 0.1
105
+ patience = 1600
106
+ max_epochs = 0
107
+ max_steps = 20000
108
+ eval_frequency = 200
109
+ frozen_components = ["tagger","morphologizer","trainable_lemmatizer","parser"]
110
+ annotating_components = []
111
+ before_to_disk = null
112
+
113
+ [training.batcher]
114
+ @batchers = "spacy.batch_by_padded.v1"
115
+ discard_oversize = true
116
+ size = 2000
117
+ buffer = 256
118
+ get_length = null
119
+
120
+ [training.logger]
121
+ @loggers = "spacy.ConsoleLogger.v1"
122
+ progress_bar = false
123
+
124
+ [training.optimizer]
125
+ @optimizers = "Adam.v1"
126
+ beta1 = 0.9
127
+ beta2 = 0.999
128
+ L2_is_weight_decay = true
129
+ L2 = 0.01
130
+ grad_clip = 1.0
131
+ use_averages = false
132
+ eps = 0.00000001
133
+
134
+ [training.optimizer.learn_rate]
135
+ @schedules = "warmup_linear.v1"
136
+ warmup_steps = 250
137
+ total_steps = 20000
138
+ initial_rate = 0.00005
139
+
140
+ [training.score_weights]
141
+ tag_acc = 0.26
142
+ pos_acc = 0.12
143
+ morph_acc = 0.12
144
+ morph_per_feat = null
145
+ lemma_acc = 0.26
146
+ dep_uas = 0.12
147
+ dep_las = 0.12
148
+ dep_las_per_type = null
149
+ sents_p = null
150
+ sents_r = null
151
+ sents_f = 0.0
152
+
153
+ [pretraining]
154
+
155
+ [initialize]
156
+ vectors = ${paths.vectors}
157
+ init_tok2vec = ${paths.init_tok2vec}
158
+ vocab_data = null
159
+ lookups = null
160
+ before_init = null
161
+ after_init = null
162
+
163
+ [initialize.components]
164
+
165
+ [initialize.tokenizer]
spacy-skmodel/config-transformer.cfg ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [paths]
2
+ train = "input/sk_snk-ud-train.spacy"
3
+ dev = "input/sk_snk-ud-test.spacy"
4
+ vectors = "input/vectors"
5
+ init_tok2vec = null
6
+
7
+ [system]
8
+ gpu_allocator = "pytorch"
9
+ seed = 0
10
+
11
+ [nlp]
12
+ lang = "sk"
13
+ pipeline = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser"]
14
+ batch_size = 128
15
+ disabled = []
16
+ before_creation = null
17
+ after_creation = null
18
+ after_pipeline_creation = null
19
+ tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
20
+
21
+ [components]
22
+
23
+ [components.morphologizer]
24
+ factory = "morphologizer"
25
+ extend = false
26
+ overwrite = true
27
+ scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}
28
+
29
+ [components.morphologizer.model]
30
+ @architectures = "spacy.Tagger.v2"
31
+ nO = null
32
+ normalize = false
33
+
34
+ [components.morphologizer.model.tok2vec]
35
+ @architectures = "spacy-transformers.TransformerListener.v1"
36
+ grad_factor = 1.0
37
+ pooling = {"@layers":"reduce_mean.v1"}
38
+ upstream = "*"
39
+
40
+ [components.parser]
41
+ factory = "parser"
42
+ learn_tokens = false
43
+ min_action_freq = 30
44
+ moves = null
45
+ scorer = {"@scorers":"spacy.parser_scorer.v1"}
46
+ update_with_oracle_cut_size = 100
47
+
48
+ [components.parser.model]
49
+ @architectures = "spacy.TransitionBasedParser.v2"
50
+ state_type = "parser"
51
+ extra_state_tokens = false
52
+ hidden_width = 128
53
+ maxout_pieces = 3
54
+ use_upper = false
55
+ nO = null
56
+
57
+ [components.parser.model.tok2vec]
58
+ @architectures = "spacy-transformers.TransformerListener.v1"
59
+ grad_factor = 1.0
60
+ pooling = {"@layers":"reduce_mean.v1"}
61
+ upstream = "*"
62
+
63
+ [components.tagger]
64
+ factory = "tagger"
65
+ neg_prefix = "!"
66
+ overwrite = false
67
+ scorer = {"@scorers":"spacy.tagger_scorer.v1"}
68
+
69
+ [components.tagger.model]
70
+ @architectures = "spacy.Tagger.v2"
71
+ nO = null
72
+ normalize = false
73
+
74
+ [components.tagger.model.tok2vec]
75
+ @architectures = "spacy-transformers.TransformerListener.v1"
76
+ grad_factor = 1.0
77
+ pooling = {"@layers":"reduce_mean.v1"}
78
+ upstream = "*"
79
+
80
+ [components.trainable_lemmatizer]
81
+ factory = "trainable_lemmatizer"
82
+ backoff = "orth"
83
+ min_tree_freq = 3
84
+ overwrite = false
85
+ scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
86
+ top_k = 1
87
+
88
+ [components.trainable_lemmatizer.model]
89
+ @architectures = "spacy.Tagger.v2"
90
+ nO = null
91
+ normalize = false
92
+
93
+ [components.trainable_lemmatizer.model.tok2vec]
94
+ @architectures = "spacy-transformers.TransformerListener.v1"
95
+ grad_factor = 1.0
96
+ pooling = {"@layers":"reduce_mean.v1"}
97
+ upstream = "*"
98
+
99
+ [components.transformer]
100
+ factory = "transformer"
101
+ max_batch_items = 4096
102
+ set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
103
+
104
+ [components.transformer.model]
105
+ @architectures = "spacy-transformers.TransformerModel.v3"
106
+ name = "gerulata/slovakbert"
107
+ mixed_precision = false
108
+
109
+ [components.transformer.model.get_spans]
110
+ @span_getters = "spacy-transformers.strided_spans.v1"
111
+ window = 128
112
+ stride = 96
113
+
114
+ [components.transformer.model.grad_scaler_config]
115
+
116
+ [components.transformer.model.tokenizer_config]
117
+ use_fast = true
118
+
119
+ [components.transformer.model.transformer_config]
120
+
121
+ [corpora]
122
+
123
+ [corpora.dev]
124
+ @readers = "spacy.Corpus.v1"
125
+ path = ${paths.dev}
126
+ max_length = 0
127
+ gold_preproc = false
128
+ limit = 0
129
+ augmenter = null
130
+
131
+ [corpora.train]
132
+ @readers = "spacy.Corpus.v1"
133
+ path = ${paths.train}
134
+ max_length = 0
135
+ gold_preproc = false
136
+ limit = 0
137
+ augmenter = null
138
+
139
+ [training]
140
+ accumulate_gradient = 3
141
+ dev_corpus = "corpora.dev"
142
+ train_corpus = "corpora.train"
143
+ seed = ${system.seed}
144
+ gpu_allocator = ${system.gpu_allocator}
145
+ dropout = 0.1
146
+ patience = 1600
147
+ max_epochs = 0
148
+ max_steps = 20000
149
+ eval_frequency = 200
150
+ frozen_components = []
151
+ annotating_components = []
152
+ before_to_disk = null
153
+
154
+ [training.batcher]
155
+ @batchers = "spacy.batch_by_padded.v1"
156
+ discard_oversize = true
157
+ size = 2000
158
+ buffer = 256
159
+
160
+ [training.logger]
161
+ @loggers = "spacy.ConsoleLogger.v1"
162
+ progress_bar = false
163
+
164
+ [training.optimizer]
165
+ @optimizers = "Adam.v1"
166
+ beta1 = 0.9
167
+ beta2 = 0.999
168
+ L2_is_weight_decay = true
169
+ L2 = 0.01
170
+ grad_clip = 1.0
171
+ use_averages = false
172
+ eps = 0.00000001
173
+
174
+ [training.optimizer.learn_rate]
175
+ @schedules = "warmup_linear.v1"
176
+ warmup_steps = 250
177
+ total_steps = 20000
178
+ initial_rate = 0.00005
179
+
180
+ [training.score_weights]
181
+ tag_acc = 0.26
182
+ pos_acc = 0.12
183
+ morph_acc = 0.12
184
+ morph_per_feat = null
185
+ lemma_acc = 0.26
186
+ dep_uas = 0.12
187
+ dep_las = 0.12
188
+ dep_las_per_type = null
189
+ sents_p = null
190
+ sents_r = null
191
+ sents_f = 0.0
192
+
193
+ [pretraining]
194
+
195
+ [initialize]
196
+ vectors = ${paths.vectors}
197
+
198
+ [initialize.components]
199
+
200
+ [initialize.tokenizer]
spacy-skmodel/meta.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "lang":"sk",
3
+ "name":"dep_web_md",
4
+ "version":"3.4.1",
5
+ "description":"Slovak model with part-of-speech and parsing",
6
+ "author":"Daniel Hládek",
7
+ "email":"daniel.hladek@tuke.sk",
8
+ "url":"https://nlp.kemt.fei.tuke.sk",
9
+ "license":"BSD"
10
+ }
spacy-skmodel/prepare-env.sh ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
2
+ pip install -U spacy[cuda113,transformers,lookups]==3.4
3
+ rm -r ./input/*
spacy-skmodel/skner2json.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import json
3
+ # https://spacy.io/api/data-formats#training
4
+ #from spacy.gold import offsets_from_biluo_tags
5
+ #from spacy.gold import iob_to_biluo
6
+
7
+ def bio2bliou(ners):
8
+ state = 0
9
+ ners1 = []
10
+ # add U
11
+ for i,ner in enumerate(ners):
12
+ ners1.append(list(ner))
13
+ if i > 0 and ners[i-1][0] != "B" and ners[i-1][0]!= "I" and ner[0] == "I":
14
+ ners1[i][0] = "B"
15
+ print("fixed")
16
+ ners = ners1
17
+ ners1 = []
18
+ for i,ner in enumerate(ners):
19
+ ners1.append(ner)
20
+ if i > 0 and ners[i-1][0] == "B" and ner[0] != "I" and ner !="O":
21
+ ners1[i-1][0] = "U"
22
+ if i > 1 and (ners[i-2][0] == "I" or ners[i-2][0] == "B") and ners[i-1][0] == "I" and ners[i][0] != "I":
23
+ ners1[i-1][0] = "L"
24
+ if len(ners) == 1 and ners[0][0] == "I":
25
+ ners1[0][0] = "U"
26
+ if len(ners) > 1 and ners[-1][0] == "B":
27
+ ners1[-1][0] = "U"
28
+ if len(ners) > 0 and ners[-1][0] == "I":
29
+ ners1[-1][0] = "L"
30
+ ners2 = []
31
+ for nerlist in ners1:
32
+ ners2.append("".join(nerlist))
33
+ #if len(ners2) == 2:
34
+ return ners2
35
+
36
+ def save_sentences(sentences,filename):
37
+ paragraphs = []
38
+ for id,sentence in enumerate(sentences):
39
+ tokens = []
40
+ words = []
41
+ for word,tag in sentence:
42
+ words.append(word)
43
+ tokens.append({"orth":word,"ner":tag})
44
+ paragraphs.append({"id":id,"paragraphs":[{"raw":" ".join(words),"sentences":[{"tokens":tokens}]}]})
45
+ with open(filename,"w") as f:
46
+ json.dump(paragraphs,f)
47
+
48
+
49
+ def strippunct(word):
50
+ chars = list(word)
51
+ repl = "\"' ,.()"
52
+ if not word[0].isalpha():
53
+ chars[0] = "x"
54
+ if not word[-1].isalpha():
55
+ chars[-1] = "x"
56
+ #if not word.isalpha():
57
+ # print(word)
58
+ #for c in word:
59
+ # if c in repl:
60
+ # c="x"
61
+ # chars.append(c)
62
+ return "".join(chars)
63
+
64
+ def process_data(filename):
65
+ with open(filename) as f:
66
+ sentences = []
67
+ words = []
68
+ ners = []
69
+ for l in f:
70
+ line = l.strip()
71
+ if len(line) > 0:
72
+ tokens = l.split()
73
+ word = tokens[0].strip()
74
+ ner = tokens[-1].strip()
75
+ #word = strippunct(word)
76
+ if len(ner) > 1 and ner[1] == "-":
77
+ word = strippunct(word)
78
+ if len(word) == 0:
79
+ continue
80
+ words.append(word)
81
+ ners.append(ner)
82
+ else:
83
+ #print(ners)
84
+ ners = bio2bliou(ners)
85
+ sentence = []
86
+ for word,tag in zip(words,ners):
87
+ sentence.append((word,tag))
88
+ #print(sentence)
89
+ sentences.append(sentence)
90
+ del ners[:]
91
+ del words[:]
92
+ testset = []
93
+ trainset = []
94
+ for i,sentence in enumerate(sentences):
95
+ if i % 10 == 0:
96
+ testset.append(sentence)
97
+ else:
98
+ trainset.append(sentence)
99
+
100
+ save_sentences(trainset,sys.argv[2])
101
+ save_sentences(testset,sys.argv[3])
102
+
103
+ process_data(sys.argv[1])
spacy-skmodel/small-config.cfg ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [paths]
2
+ train = "input/sk_snk-ud-train.spacy"
3
+ dev = "input/sk_snk-ud-test.spacy"
4
+ vectors = null
5
+ init_tok2vec = null
6
+
7
+ [system]
8
+ gpu_allocator = "pytorch"
9
+ seed = 0
10
+
11
+ [nlp]
12
+ lang = "sk"
13
+ pipeline = ["tok2vec","tagger","morphologizer","trainable_lemmatizer","parser"]
14
+ batch_size = 1000
15
+ disabled = []
16
+ before_creation = null
17
+ after_creation = null
18
+ after_pipeline_creation = null
19
+ tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
20
+
21
+ [components]
22
+
23
+ [components.morphologizer]
24
+ factory = "morphologizer"
25
+
26
+ [components.morphologizer.model]
27
+ @architectures = "spacy.Tagger.v2"
28
+ nO = null
29
+
30
+ [components.morphologizer.model.tok2vec]
31
+ @architectures = "spacy.Tok2VecListener.v1"
32
+ width = ${components.tok2vec.model.encode.width}
33
+
34
+ [components.parser]
35
+ factory = "parser"
36
+
37
+ [components.parser.model]
38
+ @architectures = "spacy.TransitionBasedParser.v2"
39
+ state_type = "parser"
40
+ extra_state_tokens = false
41
+ hidden_width = 128
42
+ maxout_pieces = 3
43
+ use_upper = true
44
+ nO = null
45
+
46
+ [components.parser.model.tok2vec]
47
+ @architectures = "spacy.Tok2VecListener.v1"
48
+ width = ${components.tok2vec.model.encode.width}
49
+
50
+
51
+ [components.tagger]
52
+ factory = "tagger"
53
+
54
+ [components.tagger.model]
55
+ @architectures = "spacy.Tagger.v2"
56
+ nO = null
57
+
58
+ [components.tagger.model.tok2vec]
59
+ @architectures = "spacy.Tok2VecListener.v1"
60
+ width = ${components.tok2vec.model.encode.width}
61
+
62
+
63
+
64
+ [components.tok2vec]
65
+ factory = "tok2vec"
66
+
67
+ [components.tok2vec.model]
68
+ @architectures = "spacy.Tok2Vec.v2"
69
+
70
+ [components.tok2vec.model.embed]
71
+ @architectures = "spacy.MultiHashEmbed.v2"
72
+ width = ${components.tok2vec.model.encode.width}
73
+ attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
74
+ rows = [5000,2500,2500,2500]
75
+ include_static_vectors = false
76
+
77
+ [components.tok2vec.model.encode]
78
+ @architectures = "spacy.MaxoutWindowEncoder.v2"
79
+ width = 96
80
+ depth = 4
81
+ window_size = 1
82
+ maxout_pieces = 3
83
+
84
+ [components.trainable_lemmatizer]
85
+ factory = "trainable_lemmatizer"
86
+ backoff = "orth"
87
+ min_tree_freq = 3
88
+ overwrite = false
89
+ scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
90
+ top_k = 1
91
+
92
+ [components.trainable_lemmatizer.model]
93
+ @architectures = "spacy.Tagger.v2"
94
+ nO = null
95
+ normalize = false
96
+
97
+ [components.trainable_lemmatizer.model.tok2vec]
98
+ @architectures = "spacy.Tok2VecListener.v1"
99
+ width = ${components.tok2vec.model.encode.width}
100
+
101
+ [corpora]
102
+
103
+ [corpora.dev]
104
+ @readers = "spacy.Corpus.v1"
105
+ path = ${paths.dev}
106
+ max_length = 0
107
+ gold_preproc = false
108
+ limit = 0
109
+ augmenter = null
110
+
111
+ [corpora.train]
112
+ @readers = "spacy.Corpus.v1"
113
+ path = ${paths.train}
114
+ max_length = 2000
115
+ gold_preproc = false
116
+ limit = 0
117
+ augmenter = null
118
+
119
+ [training]
120
+ dev_corpus = "corpora.dev"
121
+ train_corpus = "corpora.train"
122
+ seed = ${system.seed}
123
+ gpu_allocator = ${system.gpu_allocator}
124
+ dropout = 0.25
125
+ accumulate_gradient = 1
126
+ patience = 1600
127
+ max_epochs = 25
128
+ max_steps = 20000
129
+ eval_frequency = 200
130
+ frozen_components = []
131
+ before_to_disk = null
132
+ annotating_components = []
133
+
134
+ [training.batcher]
135
+ @batchers = "spacy.batch_by_words.v1"
136
+ discard_oversize = false
137
+ tolerance = 0.2
138
+ get_length = null
139
+
140
+ [training.batcher.size]
141
+ @schedules = "compounding.v1"
142
+ start = 100
143
+ stop = 1000
144
+ compound = 1.001
145
+ t = 0.0
146
+
147
+ [training.logger]
148
+ @loggers = "spacy.ConsoleLogger.v1"
149
+ progress_bar = false
150
+
151
+ [training.optimizer]
152
+ @optimizers = "Adam.v1"
153
+ beta1 = 0.9
154
+ beta2 = 0.999
155
+ L2_is_weight_decay = true
156
+ L2 = 0.01
157
+ grad_clip = 1.0
158
+ use_averages = false
159
+ eps = 0.00000001
160
+ learn_rate = 0.001
161
+
162
+ [training.score_weights]
163
+ tag_acc = 0.17
164
+ pos_acc = 0.17
165
+ morph_acc = 0.17
166
+ morph_per_feat = null
167
+ lemma_acc = 0.33
168
+ dep_uas = 0.08
169
+ dep_las = 0.08
170
+ dep_las_per_type = null
171
+ sents_p = null
172
+ sents_r = null
173
+ sents_f = 0.0
174
+
175
+ [pretraining]
176
+
177
+ [initialize]
178
+ vectors = ${paths.vectors}
179
+ init_tok2vec = ${paths.init_tok2vec}
180
+ vocab_data = null
181
+ lookups = null
182
+ before_init = null
183
+ after_init = null
184
+
185
+ [initialize.components]
186
+
187
+ [initialize.components.tagger]
188
+
189
+ [initialize.tokenizer]
spacy-skmodel/sources/skner/README.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Silver-standard Name Annotations From Wikipedia Markups
2
+ Xiaoman Pan
3
+ panx2@rpi.edu
4
+
5
+ FORMAT:
6
+ [TOKEN] [ADDITIONAL INFORMATION] [TAG]
7
+
8
+ ADDITIONAL INFORMATION FORMAT:
9
+ [Wikipedia title] [name mention] [entity type] [entity type confidence] [English Wikipedia title]
10
+
11
+ If you would like to cite this work, please cite the following publication:
12
+ Cross-lingual Name Tagging and Linking for 282 Languages
spacy-skmodel/sources/skner/wikiann-sk.bio ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e6c3ff1eb8ea5bf2de7a19f44d36df45e61a3f52d3e415b81dfeadeffc61ee4e
3
+ size 13898246
spacy-skmodel/sources/slovak-treebank/stb.conll ADDED
The diff for this file is too large to render. See raw diff
 
spacy-skmodel/sources/ud-artificial-gapping/README.txt ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Artificial dependency trees in the Universal Dependencies v2 style, focused
2
+ on gapping (the 'orphan' relation in UD). For motivation and description of
3
+ the data, see the paper cited below. Please cite the paper if you use the data
4
+ in your academic work.
5
+
6
+ @inproceedings{droganova2018,
7
+ title = {Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions},
8
+ author = {Kira Droganova and Daniel Zeman and Jenna Kanerva and Filip Ginter},
9
+ year = {2018},
10
+ booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation ({LREC} 2018)},
11
+ publisher = {European Language Resources Association},
12
+ organization = {European Language Resource Association},
13
+ address = {Paris, France},
14
+ location = {Miyazaki, Japan},
15
+ venue = {Phoenix Seagaia Conference Center}
16
+ }
17
+
18
+ Permanent URI of the dataset:
19
+ http://hdl.handle.net/11234/1-2616
20
+
21
+ *-crawled-* data are crawled from the web, parsed by two parsers, filtered so
22
+ that only those trees survive where the two parsers agree, then proceesed
23
+ to create artificial gapping
24
+ *-{train,dev,test}-* data are based on Universal Dependency treebanks release
25
+ 2.1 (November 2017)
26
+ English and Finnish data were manually checked and modified after gapping
27
+ structures had been automatically drafted.
28
+ Czech, Slovak and Russian data were processed only automatically.
29
+
spacy-skmodel/sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ADDED
The diff for this file is too large to render. See raw diff
 
spacy-skmodel/testmodel.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import spacy
2
+ import sys
3
+
4
+ nlp = spacy.load(sys.argv[1])
5
+ nlp.enable_pipe("tagger")
6
+ nlp.enable_pipe("parser")
7
+ nlp.enable_pipe("ner")
8
+ lines = []
9
+ for line in sys.stdin:
10
+ lines.append(line.rstrip())
11
+ doc = nlp("\n".join(lines))
12
+ for token in doc:
13
+ print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,token.shape_, token.is_alpha, token.is_stop,token.ner_)
spacy-skmodel/train-small.sh ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ set -e # fail on error
2
+
3
+ #make # prepare data
4
+
5
+ export CUDA_VISIBLE_DEVICES=0
6
+ # cleanup old results
7
+ #rm -rf dist
8
+ mkdir -p dist
9
+ mkdir -p train
10
+ TRAINDIR=train/smposparser
11
+ NERTRAINDIR=train/smnerposparser
12
+ VER=3.3.0
13
+ MODELDIR=dist/sk_dep_web_sm-$VER
14
+ NERMODELDIR=dist/sk_core_web_sm-$VER
15
+ mkdir -p $TRAINDIR
16
+ # Train POS and dependencies
17
+ spacy train small-config.cfg -o $TRAINDIR -g 0 > $TRAINDIR/train.log 2> $TRAINDIR/train.err.log
18
+ # Package POS
19
+ spacy package -m small-meta.json -F $TRAINDIR/model-best dist
20
+ cd $MODELDIR
21
+ python ./setup.py sdist
22
+ # install to include pos and dependencies in new model
23
+ # name must be the same as in meta.json
24
+ pip install $MODELDIR.tar.gz
25
+ cd ../../
26
+ mkdir -p $NERTRAINDIR
27
+ # Train NER, copy POS and dep from old model
28
+ spacy train small-ner.cfg -o $NERTRAINDIR -g 0 > $NERTRAINDIR/train.log 2> $NERTRAINDIR/train.err.log
29
+ # Correct meta
30
+ cp $NERTRAINDIR/model-best/meta.json $NERTRAINDIR/model-best/meta-ner.json
31
+ python changemeta.py $TRAINDIR/model-best $NERTRAINDIR/model-best
32
+ # Package result
33
+ spacy package --version $VER $NERTRAINDIR/model-best dist
34
+ cd $NERMODELDIR
35
+ python ./setup.py sdist
spacy-skmodel/train.sh ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ set -e # fail on error
2
+
3
+ make # prepare data
4
+
5
+ export CUDA_VISIBLE_DEVICES=0
6
+ VERSION=3.4.1
7
+ # cleanup old results
8
+ #rm -rf dist
9
+ mkdir -p dist
10
+ mkdir -p train
11
+ mkdir -p train/sposparser
12
+ # Train POS and dependencies
13
+ spacy train config-transformer.cfg -o ./train/sposparser -g 0 > ./train/sposparser/train.log 2> ./train/sposparser/train.err.log
14
+ # Package POS
15
+ spacy package -m meta.json -F train/sposparser/model-best dist
16
+ cd dist/sk_dep_web_md-$VERSION
17
+ python ./setup.py sdist
18
+ # install to include pos and dependencies in new model
19
+ # name must be the same as in meta.json
20
+ #pip install dist/sk_dep_web_md-$VERSION.tar.gz
21
+ #cd ../../
22
+ #mkdir -p train/snerposparser
23
+ # Train NER, copy POS and dep from old model
24
+ #spacy train config-transformer-ner.cfg -o ./train/snerposparser -g 0 > ./train/snerposparser/train.log 2> ./train/snerposparser/train.err.log
25
+ # Correct meta
26
+ #cp ./train/snerposparser/model-best/meta.json ./train/snerposparser/model-best/meta-ner.json
27
+ #python changemeta.py ./train/sposparser/model-best ./train/snerposparser/model-best
28
+ # Package result
29
+ #spacy package --version $VERSION train/snerposparser/model-best dist
30
+ #cd dist/sk_core_web_md-$VERSION
31
+ #python ./setup.py sdist
spacy-skmodel/treebank2json.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import json
3
+ # https://spacy.io/api/data-formats#training
4
+ #from spacy.gold import offsets_from_biluo_tags
5
+ #from spacy.gold import iob_to_biluo
6
+
7
+ depmap = {
8
+ "case":"AuxP",
9
+ "root" : "Pred", # / Pred_M
10
+ "punct" : "AuxK",
11
+ "nsubj" : "Sb",
12
+ "obj" : "Obj",
13
+ "conj" : "Sb",
14
+ "cc" : "Coord",
15
+ "orphan" : "Obj",
16
+ "advmod" : "Adv",
17
+ "amod" : "Atr",
18
+ "nmod" : "Atr",
19
+ "mark" : "AuxC",
20
+ "aux" : "AuxV",
21
+ "det" : "Atr",
22
+ "obl" : "Atr",
23
+ "expl:pv" : "AuxT",
24
+ "advmod" : "Adv",
25
+ }
26
+
27
+ def save_data(filename,dataset):
28
+ sentences = []
29
+ words = []
30
+ docs = []
31
+ for i,item in enumerate(dataset):
32
+ bad = False
33
+ for token in item:
34
+ words.append(token["orth"])
35
+ h = token["head"] + token["id"]
36
+ #print(h,len(item))
37
+ if h < 0 or h >= len(item):
38
+ print(item)
39
+ bad = True
40
+ break
41
+ if bad:
42
+ continue
43
+ sentences.append({"tokens":item})
44
+ if len(sentences) > 4:
45
+ doc = {
46
+ "id": i,
47
+ "paragraphs":[{
48
+ "raw": " ".join(words),
49
+ "sentences": list(sentences)
50
+ }]
51
+ }
52
+ docs.append(doc)
53
+ del words[:]
54
+ del sentences[:]
55
+
56
+ if len(docs)> 0 and len(sentences)>0:
57
+ doc = {
58
+ "id": docs[-1]["id"] + 1,
59
+ "paragraphs":[{
60
+ "raw": " ".join(words),
61
+ "sentences": list(sentences)
62
+ }]
63
+ }
64
+ docs.append(doc)
65
+ with open(filename,"w") as f:
66
+ json.dump(docs,f)
67
+
68
+
69
+ def process_data(trainname,testname):
70
+ dataset = []
71
+ sentence = []
72
+ for l in sys.stdin:
73
+ if l[0] == "#":
74
+ continue
75
+ tokens = l.split()
76
+ #print(tokens)
77
+ if len(tokens) < 2:
78
+ if len(sentence) > 0:
79
+ dataset.append(list(sentence))
80
+ del sentence[:]
81
+ continue
82
+ head = int(tokens[6])
83
+ id = int(tokens[0]) -1
84
+ print(head,id)
85
+ h = 0
86
+ if head != 0:
87
+ h = head - id -1
88
+ dep = tokens[7]
89
+ if dep in depmap:
90
+ dep = depmap[dep]
91
+ #print(h)
92
+ token = {
93
+ "id": id,
94
+ "orth": tokens[1],
95
+ "tag": tokens[4],
96
+ # "ner":
97
+ "head": h,
98
+ "dep": dep,
99
+ }
100
+ sentence.append(token)
101
+ trainset = []
102
+ testset = []
103
+ for i, item in enumerate(dataset):
104
+ if i % 10 == 0:
105
+ testset.append(item)
106
+ else:
107
+ trainset.append(item)
108
+ save_data(trainname,trainset)
109
+ save_data(testname,testset)
110
+
111
+ process_data(sys.argv[1],sys.argv[2])
spacy-skmodel/v2/01.prepare.sh ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # conda install spacy=2.3.5 cupy cudatoolkit=9.2
2
+ mkdir -p input
3
+ # Prepare Treebank
4
+ mkdir -p input/slovak-treebank
5
+ spacy convert ./sources/slovak-treebank/stb.conll ./input/slovak-treebank
6
+ # UDAG used as evaluation
7
+ mkdir -p input/ud-artificial-gapping
8
+ spacy convert ./sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ./d/input/ud-artificial-gapping
9
+ # Prepare skner
10
+ mkdir -p input/skner
11
+ cd input/skner
12
+ python ../../skner2json.py ../../sources/skner/wikiann-sk.bio
13
+
14
+ wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.sk.300.vec.gz
15
+ mv cc.sk.300.vec.gz ./input
spacy-skmodel/v2/assemble.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import json
3
+
4
+ base = sys.argv[1]
5
+ ner = sys.argv[2]
6
+ posparser = sys.argv[3]
7
+ outmeta = sys.argv[4]
8
+
9
+ meta = None
10
+ with open(base,"rb") as f:
11
+ meta = json.load(f)
12
+ meta["labels"] = {}
13
+ meta["accuracy"] = {}
14
+
15
+ ner_meta = None
16
+ with open(ner,"rb") as f:
17
+ ner_meta = json.load(f)
18
+ meta["spacy_version"] = ner_meta["spacy_version"]
19
+ meta["labels"]["ner"] = ner_meta["labels"]["ner"]
20
+ meta["accuracy"]["ents_p"] = ner_meta["accuracy"]["ents_p"]
21
+ meta["accuracy"]["ents_r"] = ner_meta["accuracy"]["ents_r"]
22
+ meta["accuracy"]["ents_f"] = ner_meta["accuracy"]["ents_f"]
23
+ meta["accuracy"]["ents_per_type"] = ner_meta["accuracy"]["ents_per_type"]
24
+
25
+ posparser_meta = None
26
+ with open(posparser,"rb") as f:
27
+ posparser_meta = json.load(f)
28
+ meta["vectors"] = posparser_meta["vectors"]
29
+ meta["accuracy"]["tags_acc"] = posparser_meta["accuracy"]["tags_acc"]
30
+ meta["accuracy"]["uas"] = posparser_meta["accuracy"]["uas"]
31
+ meta["accuracy"]["las"] = posparser_meta["accuracy"]["las"]
32
+ meta["accuracy"]["las_per_type"] = posparser_meta["accuracy"]["las_per_type"]
33
+ meta["labels"]["tagger"] = posparser_meta["labels"]["tagger"]
34
+
35
+ with open(outmeta,"w") as f:
36
+ json.dump(meta,f,indent=6)
37
+
38
+
39
+
spacy-skmodel/v2/meta-ccv2.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "lang":"sk",
3
+ "name": "sk_core_web_lg",
4
+ "version": "2.3.1",
5
+ "description": "Basic Slovak model with fastext word vectors trained on public data",
6
+ "author":"Daniel Hládek",
7
+ "email":"dhladek@gmail.com",
8
+ "url":"https://nlp.kemt.fei.tuke.sk",
9
+ "license":"CC BY-SA 3.0",
10
+ "pipeline": ["tagger","parser","ner"]
11
+ }
12
+
spacy-skmodel/v2/meta-v2.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "lang":"sk",
3
+ "name": "sk_core_web_md",
4
+ "version": "2.3.1",
5
+ "description": "Basic Slovak model without word vectors trained on public data",
6
+ "author":"Daniel Hládek",
7
+ "email":"dhladek@gmail.com",
8
+ "url":"https://nlp.kemt.fei.tuke.sk",
9
+ "license":"CC BY-SA 3.0",
10
+ "pipeline": ["tagger","parser","ner"]
11
+ }
12
+
spacy-skmodel/v2/train-v2.sh ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FLAGS="--n-iter 10"
2
+ OUTDIR=outv2
3
+ rm -r $OUTDIR
4
+ mkdir -p $OUTDIR
5
+ # Train dependency and POS
6
+ spacy train sk $OUTDIR/posparser input/slovak-treebank input/ud-artificial-gapping -p tagger,parser $FLAGS
7
+ # Train NER
8
+ spacy train sk $OUTDIR/ner input/skner/train.json input/skner/test.json -p ner -R $FLAGS
9
+
10
+ ## Assemle model
11
+ mkdir -p $OUTDIR/nerposparser
12
+ cp -r $OUTDIR/posparser/model-final/* $OUTDIR/nerposparser
13
+ cp -r $OUTDIR/ner/model-final/ner $OUTDIR/nerposparser
14
+ python ./assemble.py v2/meta-v2.json $OUTDIR/ner/model-final/meta.json $OUTDIR/posparser/model-final/meta.json $OUTDIR/nerposparser/meta.json
15
+
16
+ # Make python package
17
+ mkdir -p $OUTDIR/dist
18
+ spacy package $OUTDIR/nerposparser $OUTDIR/dist
19
+ DNAME=`ls $OUTDIR/dist`
20
+ cd $OUTDIR/dist/$DNAME
21
+ python ./setup.py sdist --dist-dir ../
spacy-skmodel/v2/train-v2cc.sh ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FLAGS="-g 0 --n-iter 10"
2
+ OUTDIR=outccv2
3
+ rm -r $OUTDIR
4
+ mkdir -p $OUTDIR
5
+ spacy init-model sk $OUTDIR/basic -v ./input/cc.sk.300.vec.gz -V 600000
6
+
7
+ # Train dependency and POS
8
+ spacy train sk $OUTDIR/posparser input/slovak-treebank input/ud-artificial-gapping -p tagger,parser -b $OUTDIR/basic $FLAGS
9
+
10
+ spacy train sk $OUTDIR/ner input/skner/train.json input/skner/test.json -p ner -R -b $OUTDIR/basic $FLAGS
11
+
12
+ ## Assemle model
13
+ mkdir -p $OUTDIR/nerposparser
14
+ cp -r $OUTDIR/posparser/model-final/* $OUTDIR/nerposparser
15
+ cp -r $OUTDIR/ner/model-final/ner $OUTDIR/nerposparser
16
+ python ./assemble.py v2/meta-ccv2.json $OUTDIR/ner/model-final/meta.json $OUTDIR/posparser/model-final/meta.json $OUTDIR/nerposparser/meta.json
17
+
18
+ # Make python package
19
+ mkdir -p $OUTDIR/dist
20
+ spacy package $OUTDIR/nerposparser $OUTDIR/dist
21
+ DNAME=`ls $OUTDIR/dist`
22
+ cd $OUTDIR/dist/$DNAME
23
+ python ./setup.py sdist --dist-dir ../