Upload 27 files
Browse files- .gitattributes +1 -0
- spacy-skmodel/.gitignore +6 -0
- spacy-skmodel/Makefile +30 -0
- spacy-skmodel/README.md +70 -0
- spacy-skmodel/changemeta.py +35 -0
- spacy-skmodel/clean.sh +4 -0
- spacy-skmodel/config-ner.cfg +160 -0
- spacy-skmodel/config-transformer-ner.cfg +165 -0
- spacy-skmodel/config-transformer.cfg +200 -0
- spacy-skmodel/meta.json +10 -0
- spacy-skmodel/prepare-env.sh +3 -0
- spacy-skmodel/skner2json.py +103 -0
- spacy-skmodel/small-config.cfg +189 -0
- spacy-skmodel/sources/skner/README.txt +12 -0
- spacy-skmodel/sources/skner/wikiann-sk.bio +3 -0
- spacy-skmodel/sources/slovak-treebank/stb.conll +0 -0
- spacy-skmodel/sources/ud-artificial-gapping/README.txt +29 -0
- spacy-skmodel/sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu +0 -0
- spacy-skmodel/testmodel.py +13 -0
- spacy-skmodel/train-small.sh +35 -0
- spacy-skmodel/train.sh +31 -0
- spacy-skmodel/treebank2json.py +111 -0
- spacy-skmodel/v2/01.prepare.sh +15 -0
- spacy-skmodel/v2/assemble.py +39 -0
- spacy-skmodel/v2/meta-ccv2.json +12 -0
- spacy-skmodel/v2/meta-v2.json +12 -0
- spacy-skmodel/v2/train-v2.sh +21 -0
- spacy-skmodel/v2/train-v2cc.sh +23 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
spacy-skmodel/sources/skner/wikiann-sk.bio filter=lfs diff=lfs merge=lfs -text
|
spacy-skmodel/.gitignore
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
build
|
| 2 |
+
venv
|
| 3 |
+
dist
|
| 4 |
+
input
|
| 5 |
+
posparser
|
| 6 |
+
nerposparser
|
spacy-skmodel/Makefile
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
all: input/sk_snk-ud-test.spacy input/sk_snk-ud-train.spacy input/train-ner.spacy input/vectors/config.cfg
|
| 2 |
+
|
| 3 |
+
sources/slovak-treebank/sk_snk-ud-test.conllu:
|
| 4 |
+
mkdir -p sources/slovak-treebank
|
| 5 |
+
cd sources && wget -P slovak-treebank https://raw.githubusercontent.com/UniversalDependencies/UD_Slovak-SNK/master/sk_snk-ud-test.conllu
|
| 6 |
+
|
| 7 |
+
sources/slovak-treebank/sk_snk-ud-train.conllu:
|
| 8 |
+
mkdir -p sources/slovak-treebank
|
| 9 |
+
cd sources && wget -P slovak-treebank https://raw.githubusercontent.com/UniversalDependencies/UD_Slovak-SNK/master/sk_snk-ud-train.conllu
|
| 10 |
+
|
| 11 |
+
sources/floret/vectors.floret.gz:
|
| 12 |
+
mkdir -p sources/floret
|
| 13 |
+
cd sources && wget -P floret https://files.kemt.fei.tuke.sk/models/fasttext/sk-fastext-floretvec-skweb2021/vectors.floret.gz --no-check-certificate
|
| 14 |
+
|
| 15 |
+
input/sk_snk-ud-test.spacy: sources/slovak-treebank/sk_snk-ud-test.conllu
|
| 16 |
+
mkdir -p input
|
| 17 |
+
spacy convert -n 10 sources/slovak-treebank/sk_snk-ud-test.conllu input
|
| 18 |
+
|
| 19 |
+
input/sk_snk-ud-train.spacy: sources/slovak-treebank/sk_snk-ud-train.conllu
|
| 20 |
+
mkdir -p input
|
| 21 |
+
spacy convert -n 10 sources/slovak-treebank/sk_snk-ud-train.conllu input
|
| 22 |
+
|
| 23 |
+
input/train-ner.spacy: sources/skner/wikiann-sk.bio
|
| 24 |
+
python skner2json.py ./sources/skner/wikiann-sk.bio input/train-ner.json input/test-ner.json
|
| 25 |
+
spacy convert input/train-ner.json input
|
| 26 |
+
spacy convert input/test-ner.json input
|
| 27 |
+
|
| 28 |
+
input/vectors/config.cfg: sources/floret/vectors.floret.gz
|
| 29 |
+
mkdir -p input/vectors
|
| 30 |
+
spacy init vectors sk sources/floret/vectors.floret.gz input/vectors -V -m floret
|
spacy-skmodel/README.md
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Slovak Spacy Model
|
| 2 |
+
|
| 3 |
+
This is Slovak Spacy model.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
|
| 7 |
+
- Requires Spacy 3.x.
|
| 8 |
+
- Contains Floret Word Vectors.
|
| 9 |
+
- Tagger module uses Slovak National Corpus Tagset.
|
| 10 |
+
- Morphological analyzer uses Universal dependencies tagset and is trained on Slovak dependency treebank.
|
| 11 |
+
- Lemmatizer is trained on Slovak dependency treebank.
|
| 12 |
+
- Named entity recognizer is trained separately on WikiAnn database.
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
## Downloads
|
| 16 |
+
|
| 17 |
+
# Version 3.4
|
| 18 |
+
|
| 19 |
+
- [Spacy 3.4, Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_dep_web_md-3.4.1.tar.gz).
|
| 20 |
+
- Model for trained lemmatization, POS tagging and dependency relations.
|
| 21 |
+
- Contains Floret Word Vectors, trained on our web corpus.
|
| 22 |
+
- Should be without license issues.
|
| 23 |
+
|
| 24 |
+
- [Spacy 3.4, NER + Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_core_web_md-3.4.1.tar.gz).
|
| 25 |
+
- Includes the dependencies model.
|
| 26 |
+
- This model uses separate fine-tuned model for NER recognition.
|
| 27 |
+
|
| 28 |
+
# Version 3.3
|
| 29 |
+
|
| 30 |
+
- [Spacy 3.3, Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_dep_web_md-3.3.0.tar.gz). Model for trained lemmatization, POS tagging and dependency relations.
|
| 31 |
+
- [Spacy 3.3, NER + Dependencies](https://files.kemt.fei.tuke.sk/models/spacy/sk_core_web_md-3.3.0.tar.gz). This model uses separate fine-tuned model for NER recognition.
|
| 32 |
+
|
| 33 |
+
These models do not have word vectors.
|
| 34 |
+
|
| 35 |
+
## Training
|
| 36 |
+
|
| 37 |
+
Requirements for training:
|
| 38 |
+
|
| 39 |
+
- Anaconda virtual environment
|
| 40 |
+
- Spacy 3
|
| 41 |
+
- make
|
| 42 |
+
- bash
|
| 43 |
+
|
| 44 |
+
Usage:
|
| 45 |
+
|
| 46 |
+
1. Install dependencies in the Conda
|
| 47 |
+
|
| 48 |
+
./prepare-env.sh
|
| 49 |
+
|
| 50 |
+
2. Download and prepare data:
|
| 51 |
+
|
| 52 |
+
make
|
| 53 |
+
|
| 54 |
+
3. Train models
|
| 55 |
+
|
| 56 |
+
./train.sh
|
| 57 |
+
|
| 58 |
+
## Credits
|
| 59 |
+
|
| 60 |
+
Author:
|
| 61 |
+
|
| 62 |
+
Daniel Hládek daniel.hladek@tuke.sk and Technical University of Košice
|
| 63 |
+
|
| 64 |
+
Sources:
|
| 65 |
+
|
| 66 |
+
- The model uses spacy-transformers and [SlovakBERT](https://huggingface.co/gerulata/slovakbert).
|
| 67 |
+
- [Part of Speech and Dependency relations](https://github.com/UniversalDependencies/UD_Slovak-SNK)
|
| 68 |
+
The Slovak UD treebank with Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
|
| 69 |
+
- [Semi-automatic named entities](https://huggingface.co/datasets/wikiann) - Unspecified License
|
| 70 |
+
|
spacy-skmodel/changemeta.py
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import sys
|
| 3 |
+
|
| 4 |
+
pos_dname = sys.argv[1]
|
| 5 |
+
with open(pos_dname + "/meta.json") as f:
|
| 6 |
+
pos_meta = json.load(f)
|
| 7 |
+
pos_performance = pos_meta["performance"]
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
dname = sys.argv[2]
|
| 11 |
+
meta_name = dname + "/meta.json"
|
| 12 |
+
with open(meta_name) as f:
|
| 13 |
+
doc = json.load(f)
|
| 14 |
+
doc["name"] = "core_web_md"
|
| 15 |
+
if "disabled" in doc:
|
| 16 |
+
del doc["disabled"]
|
| 17 |
+
doc["pipeline"] = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser","ner"]
|
| 18 |
+
for k,v in pos_performance.items():
|
| 19 |
+
doc["performance"][k] = v
|
| 20 |
+
|
| 21 |
+
with open(meta_name,"w") as f:
|
| 22 |
+
json.dump(doc,f,indent=4)
|
| 23 |
+
|
| 24 |
+
clines = []
|
| 25 |
+
config_name = dname + "/config.cfg"
|
| 26 |
+
with open(config_name) as f:
|
| 27 |
+
for l in f:
|
| 28 |
+
line = l.rstrip()
|
| 29 |
+
if "disabled" in line:
|
| 30 |
+
line = "disabled: []"
|
| 31 |
+
clines.append(line)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
with open(config_name,"w") as f:
|
| 35 |
+
print("\n".join(clines),file=f)
|
spacy-skmodel/clean.sh
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
rm -rf traindir
|
| 2 |
+
rm -rf posparser
|
| 3 |
+
rm -rf nerposparser
|
| 4 |
+
rm -rf dist
|
spacy-skmodel/config-ner.cfg
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[paths]
|
| 2 |
+
train = "input/train-ner.spacy"
|
| 3 |
+
dev = "input/test-ner.spacy"
|
| 4 |
+
vectors = "input/vectors"
|
| 5 |
+
init_tok2vec = null
|
| 6 |
+
|
| 7 |
+
[system]
|
| 8 |
+
gpu_allocator = null
|
| 9 |
+
seed = 0
|
| 10 |
+
|
| 11 |
+
[nlp]
|
| 12 |
+
lang = "sk"
|
| 13 |
+
pipeline = ["tok2vec","parser","tagger","ner"]
|
| 14 |
+
batch_size = 1000
|
| 15 |
+
#disabled = ["parser","tagger"]
|
| 16 |
+
before_creation = null
|
| 17 |
+
after_creation = null
|
| 18 |
+
after_pipeline_creation = null
|
| 19 |
+
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
| 20 |
+
|
| 21 |
+
[components]
|
| 22 |
+
|
| 23 |
+
[components.ner]
|
| 24 |
+
factory = "ner"
|
| 25 |
+
moves = null
|
| 26 |
+
update_with_oracle_cut_size = 100
|
| 27 |
+
|
| 28 |
+
[components.ner.model]
|
| 29 |
+
@architectures = "spacy.TransitionBasedParser.v2"
|
| 30 |
+
state_type = "ner"
|
| 31 |
+
extra_state_tokens = false
|
| 32 |
+
hidden_width = 64
|
| 33 |
+
maxout_pieces = 2
|
| 34 |
+
use_upper = true
|
| 35 |
+
nO = null
|
| 36 |
+
|
| 37 |
+
[components.ner.model.tok2vec]
|
| 38 |
+
@architectures = "spacy.HashEmbedCNN.v2"
|
| 39 |
+
pretrained_vectors = null
|
| 40 |
+
width = 96
|
| 41 |
+
depth = 4
|
| 42 |
+
embed_size = 2000
|
| 43 |
+
window_size = 1
|
| 44 |
+
maxout_pieces = 3
|
| 45 |
+
subword_features = true
|
| 46 |
+
|
| 47 |
+
[components.parser]
|
| 48 |
+
source = "sk_pipeline"
|
| 49 |
+
replace_listeners = ["model.tok2vec"]
|
| 50 |
+
|
| 51 |
+
[components.tagger]
|
| 52 |
+
source = "sk_pipeline"
|
| 53 |
+
replace_listeners = ["model.tok2vec"]
|
| 54 |
+
|
| 55 |
+
[components.tok2vec]
|
| 56 |
+
factory = "tok2vec"
|
| 57 |
+
|
| 58 |
+
[components.tok2vec.model]
|
| 59 |
+
@architectures = "spacy.Tok2Vec.v2"
|
| 60 |
+
|
| 61 |
+
[components.tok2vec.model.embed]
|
| 62 |
+
@architectures = "spacy.MultiHashEmbed.v2"
|
| 63 |
+
width = ${components.tok2vec.model.encode.width}
|
| 64 |
+
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
|
| 65 |
+
rows = [5000,2500,2500,2500]
|
| 66 |
+
include_static_vectors = true
|
| 67 |
+
|
| 68 |
+
[components.tok2vec.model.encode]
|
| 69 |
+
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
| 70 |
+
width = 96
|
| 71 |
+
depth = 4
|
| 72 |
+
window_size = 1
|
| 73 |
+
maxout_pieces = 3
|
| 74 |
+
|
| 75 |
+
[corpora]
|
| 76 |
+
|
| 77 |
+
[corpora.dev]
|
| 78 |
+
@readers = "spacy.Corpus.v1"
|
| 79 |
+
path = ${paths.dev}
|
| 80 |
+
max_length = 0
|
| 81 |
+
gold_preproc = false
|
| 82 |
+
limit = 0
|
| 83 |
+
augmenter = null
|
| 84 |
+
|
| 85 |
+
[corpora.train]
|
| 86 |
+
@readers = "spacy.Corpus.v1"
|
| 87 |
+
path = ${paths.train}
|
| 88 |
+
max_length = 2000
|
| 89 |
+
gold_preproc = false
|
| 90 |
+
limit = 0
|
| 91 |
+
augmenter = null
|
| 92 |
+
|
| 93 |
+
[training]
|
| 94 |
+
dev_corpus = "corpora.dev"
|
| 95 |
+
train_corpus = "corpora.train"
|
| 96 |
+
seed = ${system.seed}
|
| 97 |
+
gpu_allocator = ${system.gpu_allocator}
|
| 98 |
+
dropout = 0.1
|
| 99 |
+
accumulate_gradient = 1
|
| 100 |
+
patience = 1600
|
| 101 |
+
max_epochs = 0
|
| 102 |
+
max_steps = 20000
|
| 103 |
+
eval_frequency = 200
|
| 104 |
+
frozen_components = ["tagger","parser"]
|
| 105 |
+
before_to_disk = null
|
| 106 |
+
|
| 107 |
+
[training.batcher]
|
| 108 |
+
@batchers = "spacy.batch_by_words.v1"
|
| 109 |
+
discard_oversize = false
|
| 110 |
+
tolerance = 0.2
|
| 111 |
+
get_length = null
|
| 112 |
+
|
| 113 |
+
[training.batcher.size]
|
| 114 |
+
@schedules = "compounding.v1"
|
| 115 |
+
start = 100
|
| 116 |
+
stop = 1000
|
| 117 |
+
compound = 1.001
|
| 118 |
+
t = 0.0
|
| 119 |
+
|
| 120 |
+
[training.logger]
|
| 121 |
+
@loggers = "spacy.ConsoleLogger.v1"
|
| 122 |
+
progress_bar = false
|
| 123 |
+
|
| 124 |
+
[training.optimizer]
|
| 125 |
+
@optimizers = "Adam.v1"
|
| 126 |
+
beta1 = 0.9
|
| 127 |
+
beta2 = 0.999
|
| 128 |
+
L2_is_weight_decay = true
|
| 129 |
+
L2 = 0.01
|
| 130 |
+
grad_clip = 1.0
|
| 131 |
+
use_averages = false
|
| 132 |
+
eps = 0.00000001
|
| 133 |
+
learn_rate = 0.001
|
| 134 |
+
|
| 135 |
+
[training.score_weights]
|
| 136 |
+
dep_las_per_type = null
|
| 137 |
+
sents_p = null
|
| 138 |
+
sents_r = null
|
| 139 |
+
ents_per_type = null
|
| 140 |
+
dep_uas = 0.17
|
| 141 |
+
dep_las = 0.17
|
| 142 |
+
sents_f = 0.0
|
| 143 |
+
tag_acc = 0.33
|
| 144 |
+
ents_f = 0.33
|
| 145 |
+
ents_p = 0.0
|
| 146 |
+
ents_r = 0.0
|
| 147 |
+
|
| 148 |
+
[pretraining]
|
| 149 |
+
|
| 150 |
+
[initialize]
|
| 151 |
+
vectors = ${paths.vectors}
|
| 152 |
+
init_tok2vec = ${paths.init_tok2vec}
|
| 153 |
+
vocab_data = null
|
| 154 |
+
lookups = null
|
| 155 |
+
before_init = null
|
| 156 |
+
after_init = null
|
| 157 |
+
|
| 158 |
+
[initialize.components]
|
| 159 |
+
|
| 160 |
+
[initialize.tokenizer]
|
spacy-skmodel/config-transformer-ner.cfg
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[paths]
|
| 2 |
+
train = "input/train-ner.spacy"
|
| 3 |
+
dev = "input/test-ner.spacy"
|
| 4 |
+
vectors = "input/vectors"
|
| 5 |
+
init_tok2vec = null
|
| 6 |
+
|
| 7 |
+
[system]
|
| 8 |
+
gpu_allocator = "pytorch"
|
| 9 |
+
seed = 0
|
| 10 |
+
|
| 11 |
+
[nlp]
|
| 12 |
+
lang = "sk"
|
| 13 |
+
pipeline = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser","ner"]
|
| 14 |
+
batch_size = 128
|
| 15 |
+
disabled = []
|
| 16 |
+
before_creation = null
|
| 17 |
+
after_creation = null
|
| 18 |
+
after_pipeline_creation = null
|
| 19 |
+
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
| 20 |
+
|
| 21 |
+
[components]
|
| 22 |
+
|
| 23 |
+
[components.ner]
|
| 24 |
+
factory = "ner"
|
| 25 |
+
|
| 26 |
+
[components.ner.model]
|
| 27 |
+
@architectures = "spacy.TransitionBasedParser.v2"
|
| 28 |
+
state_type = "ner"
|
| 29 |
+
extra_state_tokens = false
|
| 30 |
+
hidden_width = 64
|
| 31 |
+
maxout_pieces = 2
|
| 32 |
+
use_upper = false
|
| 33 |
+
nO = null
|
| 34 |
+
|
| 35 |
+
[components.ner.model.tok2vec]
|
| 36 |
+
@architectures = "spacy-transformers.TransformerListener.v1"
|
| 37 |
+
grad_factor = 1.0
|
| 38 |
+
|
| 39 |
+
[components.ner.model.tok2vec.pooling]
|
| 40 |
+
@layers = "reduce_mean.v1"
|
| 41 |
+
|
| 42 |
+
[components.morphologizer]
|
| 43 |
+
source = "sk_dep_web_md"
|
| 44 |
+
replace_listeners = ["model.tok2vec"]
|
| 45 |
+
|
| 46 |
+
[components.parser]
|
| 47 |
+
source = "sk_dep_web_md"
|
| 48 |
+
replace_listeners = ["model.tok2vec"]
|
| 49 |
+
|
| 50 |
+
[components.tagger]
|
| 51 |
+
source = "sk_dep_web_md"
|
| 52 |
+
replace_listeners = ["model.tok2vec"]
|
| 53 |
+
|
| 54 |
+
[components.trainable_lemmatizer]
|
| 55 |
+
source = "sk_dep_web_md"
|
| 56 |
+
replace_listeners = ["model.tok2vec"]
|
| 57 |
+
|
| 58 |
+
[components.transformer]
|
| 59 |
+
factory = "transformer"
|
| 60 |
+
max_batch_items = 4096
|
| 61 |
+
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
|
| 62 |
+
|
| 63 |
+
[components.transformer.model]
|
| 64 |
+
@architectures = "spacy-transformers.TransformerModel.v3"
|
| 65 |
+
name = "gerulata/slovakbert"
|
| 66 |
+
mixed_precision = false
|
| 67 |
+
|
| 68 |
+
[components.transformer.model.get_spans]
|
| 69 |
+
@span_getters = "spacy-transformers.strided_spans.v1"
|
| 70 |
+
window = 128
|
| 71 |
+
stride = 96
|
| 72 |
+
|
| 73 |
+
[components.transformer.model.grad_scaler_config]
|
| 74 |
+
|
| 75 |
+
[components.transformer.model.tokenizer_config]
|
| 76 |
+
use_fast = true
|
| 77 |
+
|
| 78 |
+
[components.transformer.model.transformer_config]
|
| 79 |
+
|
| 80 |
+
[corpora]
|
| 81 |
+
|
| 82 |
+
[corpora.dev]
|
| 83 |
+
@readers = "spacy.Corpus.v1"
|
| 84 |
+
path = ${paths.dev}
|
| 85 |
+
max_length = 0
|
| 86 |
+
gold_preproc = false
|
| 87 |
+
limit = 0
|
| 88 |
+
augmenter = null
|
| 89 |
+
|
| 90 |
+
[corpora.train]
|
| 91 |
+
@readers = "spacy.Corpus.v1"
|
| 92 |
+
path = ${paths.train}
|
| 93 |
+
max_length = 0
|
| 94 |
+
gold_preproc = false
|
| 95 |
+
limit = 0
|
| 96 |
+
augmenter = null
|
| 97 |
+
|
| 98 |
+
[training]
|
| 99 |
+
accumulate_gradient = 3
|
| 100 |
+
dev_corpus = "corpora.dev"
|
| 101 |
+
train_corpus = "corpora.train"
|
| 102 |
+
seed = ${system.seed}
|
| 103 |
+
gpu_allocator = ${system.gpu_allocator}
|
| 104 |
+
dropout = 0.1
|
| 105 |
+
patience = 1600
|
| 106 |
+
max_epochs = 0
|
| 107 |
+
max_steps = 20000
|
| 108 |
+
eval_frequency = 200
|
| 109 |
+
frozen_components = ["tagger","morphologizer","trainable_lemmatizer","parser"]
|
| 110 |
+
annotating_components = []
|
| 111 |
+
before_to_disk = null
|
| 112 |
+
|
| 113 |
+
[training.batcher]
|
| 114 |
+
@batchers = "spacy.batch_by_padded.v1"
|
| 115 |
+
discard_oversize = true
|
| 116 |
+
size = 2000
|
| 117 |
+
buffer = 256
|
| 118 |
+
get_length = null
|
| 119 |
+
|
| 120 |
+
[training.logger]
|
| 121 |
+
@loggers = "spacy.ConsoleLogger.v1"
|
| 122 |
+
progress_bar = false
|
| 123 |
+
|
| 124 |
+
[training.optimizer]
|
| 125 |
+
@optimizers = "Adam.v1"
|
| 126 |
+
beta1 = 0.9
|
| 127 |
+
beta2 = 0.999
|
| 128 |
+
L2_is_weight_decay = true
|
| 129 |
+
L2 = 0.01
|
| 130 |
+
grad_clip = 1.0
|
| 131 |
+
use_averages = false
|
| 132 |
+
eps = 0.00000001
|
| 133 |
+
|
| 134 |
+
[training.optimizer.learn_rate]
|
| 135 |
+
@schedules = "warmup_linear.v1"
|
| 136 |
+
warmup_steps = 250
|
| 137 |
+
total_steps = 20000
|
| 138 |
+
initial_rate = 0.00005
|
| 139 |
+
|
| 140 |
+
[training.score_weights]
|
| 141 |
+
tag_acc = 0.26
|
| 142 |
+
pos_acc = 0.12
|
| 143 |
+
morph_acc = 0.12
|
| 144 |
+
morph_per_feat = null
|
| 145 |
+
lemma_acc = 0.26
|
| 146 |
+
dep_uas = 0.12
|
| 147 |
+
dep_las = 0.12
|
| 148 |
+
dep_las_per_type = null
|
| 149 |
+
sents_p = null
|
| 150 |
+
sents_r = null
|
| 151 |
+
sents_f = 0.0
|
| 152 |
+
|
| 153 |
+
[pretraining]
|
| 154 |
+
|
| 155 |
+
[initialize]
|
| 156 |
+
vectors = ${paths.vectors}
|
| 157 |
+
init_tok2vec = ${paths.init_tok2vec}
|
| 158 |
+
vocab_data = null
|
| 159 |
+
lookups = null
|
| 160 |
+
before_init = null
|
| 161 |
+
after_init = null
|
| 162 |
+
|
| 163 |
+
[initialize.components]
|
| 164 |
+
|
| 165 |
+
[initialize.tokenizer]
|
spacy-skmodel/config-transformer.cfg
ADDED
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[paths]
|
| 2 |
+
train = "input/sk_snk-ud-train.spacy"
|
| 3 |
+
dev = "input/sk_snk-ud-test.spacy"
|
| 4 |
+
vectors = "input/vectors"
|
| 5 |
+
init_tok2vec = null
|
| 6 |
+
|
| 7 |
+
[system]
|
| 8 |
+
gpu_allocator = "pytorch"
|
| 9 |
+
seed = 0
|
| 10 |
+
|
| 11 |
+
[nlp]
|
| 12 |
+
lang = "sk"
|
| 13 |
+
pipeline = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser"]
|
| 14 |
+
batch_size = 128
|
| 15 |
+
disabled = []
|
| 16 |
+
before_creation = null
|
| 17 |
+
after_creation = null
|
| 18 |
+
after_pipeline_creation = null
|
| 19 |
+
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
| 20 |
+
|
| 21 |
+
[components]
|
| 22 |
+
|
| 23 |
+
[components.morphologizer]
|
| 24 |
+
factory = "morphologizer"
|
| 25 |
+
extend = false
|
| 26 |
+
overwrite = true
|
| 27 |
+
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}
|
| 28 |
+
|
| 29 |
+
[components.morphologizer.model]
|
| 30 |
+
@architectures = "spacy.Tagger.v2"
|
| 31 |
+
nO = null
|
| 32 |
+
normalize = false
|
| 33 |
+
|
| 34 |
+
[components.morphologizer.model.tok2vec]
|
| 35 |
+
@architectures = "spacy-transformers.TransformerListener.v1"
|
| 36 |
+
grad_factor = 1.0
|
| 37 |
+
pooling = {"@layers":"reduce_mean.v1"}
|
| 38 |
+
upstream = "*"
|
| 39 |
+
|
| 40 |
+
[components.parser]
|
| 41 |
+
factory = "parser"
|
| 42 |
+
learn_tokens = false
|
| 43 |
+
min_action_freq = 30
|
| 44 |
+
moves = null
|
| 45 |
+
scorer = {"@scorers":"spacy.parser_scorer.v1"}
|
| 46 |
+
update_with_oracle_cut_size = 100
|
| 47 |
+
|
| 48 |
+
[components.parser.model]
|
| 49 |
+
@architectures = "spacy.TransitionBasedParser.v2"
|
| 50 |
+
state_type = "parser"
|
| 51 |
+
extra_state_tokens = false
|
| 52 |
+
hidden_width = 128
|
| 53 |
+
maxout_pieces = 3
|
| 54 |
+
use_upper = false
|
| 55 |
+
nO = null
|
| 56 |
+
|
| 57 |
+
[components.parser.model.tok2vec]
|
| 58 |
+
@architectures = "spacy-transformers.TransformerListener.v1"
|
| 59 |
+
grad_factor = 1.0
|
| 60 |
+
pooling = {"@layers":"reduce_mean.v1"}
|
| 61 |
+
upstream = "*"
|
| 62 |
+
|
| 63 |
+
[components.tagger]
|
| 64 |
+
factory = "tagger"
|
| 65 |
+
neg_prefix = "!"
|
| 66 |
+
overwrite = false
|
| 67 |
+
scorer = {"@scorers":"spacy.tagger_scorer.v1"}
|
| 68 |
+
|
| 69 |
+
[components.tagger.model]
|
| 70 |
+
@architectures = "spacy.Tagger.v2"
|
| 71 |
+
nO = null
|
| 72 |
+
normalize = false
|
| 73 |
+
|
| 74 |
+
[components.tagger.model.tok2vec]
|
| 75 |
+
@architectures = "spacy-transformers.TransformerListener.v1"
|
| 76 |
+
grad_factor = 1.0
|
| 77 |
+
pooling = {"@layers":"reduce_mean.v1"}
|
| 78 |
+
upstream = "*"
|
| 79 |
+
|
| 80 |
+
[components.trainable_lemmatizer]
|
| 81 |
+
factory = "trainable_lemmatizer"
|
| 82 |
+
backoff = "orth"
|
| 83 |
+
min_tree_freq = 3
|
| 84 |
+
overwrite = false
|
| 85 |
+
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
|
| 86 |
+
top_k = 1
|
| 87 |
+
|
| 88 |
+
[components.trainable_lemmatizer.model]
|
| 89 |
+
@architectures = "spacy.Tagger.v2"
|
| 90 |
+
nO = null
|
| 91 |
+
normalize = false
|
| 92 |
+
|
| 93 |
+
[components.trainable_lemmatizer.model.tok2vec]
|
| 94 |
+
@architectures = "spacy-transformers.TransformerListener.v1"
|
| 95 |
+
grad_factor = 1.0
|
| 96 |
+
pooling = {"@layers":"reduce_mean.v1"}
|
| 97 |
+
upstream = "*"
|
| 98 |
+
|
| 99 |
+
[components.transformer]
|
| 100 |
+
factory = "transformer"
|
| 101 |
+
max_batch_items = 4096
|
| 102 |
+
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
|
| 103 |
+
|
| 104 |
+
[components.transformer.model]
|
| 105 |
+
@architectures = "spacy-transformers.TransformerModel.v3"
|
| 106 |
+
name = "gerulata/slovakbert"
|
| 107 |
+
mixed_precision = false
|
| 108 |
+
|
| 109 |
+
[components.transformer.model.get_spans]
|
| 110 |
+
@span_getters = "spacy-transformers.strided_spans.v1"
|
| 111 |
+
window = 128
|
| 112 |
+
stride = 96
|
| 113 |
+
|
| 114 |
+
[components.transformer.model.grad_scaler_config]
|
| 115 |
+
|
| 116 |
+
[components.transformer.model.tokenizer_config]
|
| 117 |
+
use_fast = true
|
| 118 |
+
|
| 119 |
+
[components.transformer.model.transformer_config]
|
| 120 |
+
|
| 121 |
+
[corpora]
|
| 122 |
+
|
| 123 |
+
[corpora.dev]
|
| 124 |
+
@readers = "spacy.Corpus.v1"
|
| 125 |
+
path = ${paths.dev}
|
| 126 |
+
max_length = 0
|
| 127 |
+
gold_preproc = false
|
| 128 |
+
limit = 0
|
| 129 |
+
augmenter = null
|
| 130 |
+
|
| 131 |
+
[corpora.train]
|
| 132 |
+
@readers = "spacy.Corpus.v1"
|
| 133 |
+
path = ${paths.train}
|
| 134 |
+
max_length = 0
|
| 135 |
+
gold_preproc = false
|
| 136 |
+
limit = 0
|
| 137 |
+
augmenter = null
|
| 138 |
+
|
| 139 |
+
[training]
|
| 140 |
+
accumulate_gradient = 3
|
| 141 |
+
dev_corpus = "corpora.dev"
|
| 142 |
+
train_corpus = "corpora.train"
|
| 143 |
+
seed = ${system.seed}
|
| 144 |
+
gpu_allocator = ${system.gpu_allocator}
|
| 145 |
+
dropout = 0.1
|
| 146 |
+
patience = 1600
|
| 147 |
+
max_epochs = 0
|
| 148 |
+
max_steps = 20000
|
| 149 |
+
eval_frequency = 200
|
| 150 |
+
frozen_components = []
|
| 151 |
+
annotating_components = []
|
| 152 |
+
before_to_disk = null
|
| 153 |
+
|
| 154 |
+
[training.batcher]
|
| 155 |
+
@batchers = "spacy.batch_by_padded.v1"
|
| 156 |
+
discard_oversize = true
|
| 157 |
+
size = 2000
|
| 158 |
+
buffer = 256
|
| 159 |
+
|
| 160 |
+
[training.logger]
|
| 161 |
+
@loggers = "spacy.ConsoleLogger.v1"
|
| 162 |
+
progress_bar = false
|
| 163 |
+
|
| 164 |
+
[training.optimizer]
|
| 165 |
+
@optimizers = "Adam.v1"
|
| 166 |
+
beta1 = 0.9
|
| 167 |
+
beta2 = 0.999
|
| 168 |
+
L2_is_weight_decay = true
|
| 169 |
+
L2 = 0.01
|
| 170 |
+
grad_clip = 1.0
|
| 171 |
+
use_averages = false
|
| 172 |
+
eps = 0.00000001
|
| 173 |
+
|
| 174 |
+
[training.optimizer.learn_rate]
|
| 175 |
+
@schedules = "warmup_linear.v1"
|
| 176 |
+
warmup_steps = 250
|
| 177 |
+
total_steps = 20000
|
| 178 |
+
initial_rate = 0.00005
|
| 179 |
+
|
| 180 |
+
[training.score_weights]
|
| 181 |
+
tag_acc = 0.26
|
| 182 |
+
pos_acc = 0.12
|
| 183 |
+
morph_acc = 0.12
|
| 184 |
+
morph_per_feat = null
|
| 185 |
+
lemma_acc = 0.26
|
| 186 |
+
dep_uas = 0.12
|
| 187 |
+
dep_las = 0.12
|
| 188 |
+
dep_las_per_type = null
|
| 189 |
+
sents_p = null
|
| 190 |
+
sents_r = null
|
| 191 |
+
sents_f = 0.0
|
| 192 |
+
|
| 193 |
+
[pretraining]
|
| 194 |
+
|
| 195 |
+
[initialize]
|
| 196 |
+
vectors = ${paths.vectors}
|
| 197 |
+
|
| 198 |
+
[initialize.components]
|
| 199 |
+
|
| 200 |
+
[initialize.tokenizer]
|
spacy-skmodel/meta.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"lang":"sk",
|
| 3 |
+
"name":"dep_web_md",
|
| 4 |
+
"version":"3.4.1",
|
| 5 |
+
"description":"Slovak model with part-of-speech and parsing",
|
| 6 |
+
"author":"Daniel Hládek",
|
| 7 |
+
"email":"daniel.hladek@tuke.sk",
|
| 8 |
+
"url":"https://nlp.kemt.fei.tuke.sk",
|
| 9 |
+
"license":"BSD"
|
| 10 |
+
}
|
spacy-skmodel/prepare-env.sh
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
|
| 2 |
+
pip install -U spacy[cuda113,transformers,lookups]==3.4
|
| 3 |
+
rm -r ./input/*
|
spacy-skmodel/skner2json.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import sys
|
| 2 |
+
import json
|
| 3 |
+
# https://spacy.io/api/data-formats#training
|
| 4 |
+
#from spacy.gold import offsets_from_biluo_tags
|
| 5 |
+
#from spacy.gold import iob_to_biluo
|
| 6 |
+
|
| 7 |
+
def bio2bliou(ners):
|
| 8 |
+
state = 0
|
| 9 |
+
ners1 = []
|
| 10 |
+
# add U
|
| 11 |
+
for i,ner in enumerate(ners):
|
| 12 |
+
ners1.append(list(ner))
|
| 13 |
+
if i > 0 and ners[i-1][0] != "B" and ners[i-1][0]!= "I" and ner[0] == "I":
|
| 14 |
+
ners1[i][0] = "B"
|
| 15 |
+
print("fixed")
|
| 16 |
+
ners = ners1
|
| 17 |
+
ners1 = []
|
| 18 |
+
for i,ner in enumerate(ners):
|
| 19 |
+
ners1.append(ner)
|
| 20 |
+
if i > 0 and ners[i-1][0] == "B" and ner[0] != "I" and ner !="O":
|
| 21 |
+
ners1[i-1][0] = "U"
|
| 22 |
+
if i > 1 and (ners[i-2][0] == "I" or ners[i-2][0] == "B") and ners[i-1][0] == "I" and ners[i][0] != "I":
|
| 23 |
+
ners1[i-1][0] = "L"
|
| 24 |
+
if len(ners) == 1 and ners[0][0] == "I":
|
| 25 |
+
ners1[0][0] = "U"
|
| 26 |
+
if len(ners) > 1 and ners[-1][0] == "B":
|
| 27 |
+
ners1[-1][0] = "U"
|
| 28 |
+
if len(ners) > 0 and ners[-1][0] == "I":
|
| 29 |
+
ners1[-1][0] = "L"
|
| 30 |
+
ners2 = []
|
| 31 |
+
for nerlist in ners1:
|
| 32 |
+
ners2.append("".join(nerlist))
|
| 33 |
+
#if len(ners2) == 2:
|
| 34 |
+
return ners2
|
| 35 |
+
|
| 36 |
+
def save_sentences(sentences,filename):
|
| 37 |
+
paragraphs = []
|
| 38 |
+
for id,sentence in enumerate(sentences):
|
| 39 |
+
tokens = []
|
| 40 |
+
words = []
|
| 41 |
+
for word,tag in sentence:
|
| 42 |
+
words.append(word)
|
| 43 |
+
tokens.append({"orth":word,"ner":tag})
|
| 44 |
+
paragraphs.append({"id":id,"paragraphs":[{"raw":" ".join(words),"sentences":[{"tokens":tokens}]}]})
|
| 45 |
+
with open(filename,"w") as f:
|
| 46 |
+
json.dump(paragraphs,f)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def strippunct(word):
|
| 50 |
+
chars = list(word)
|
| 51 |
+
repl = "\"' ,.()"
|
| 52 |
+
if not word[0].isalpha():
|
| 53 |
+
chars[0] = "x"
|
| 54 |
+
if not word[-1].isalpha():
|
| 55 |
+
chars[-1] = "x"
|
| 56 |
+
#if not word.isalpha():
|
| 57 |
+
# print(word)
|
| 58 |
+
#for c in word:
|
| 59 |
+
# if c in repl:
|
| 60 |
+
# c="x"
|
| 61 |
+
# chars.append(c)
|
| 62 |
+
return "".join(chars)
|
| 63 |
+
|
| 64 |
+
def process_data(filename):
|
| 65 |
+
with open(filename) as f:
|
| 66 |
+
sentences = []
|
| 67 |
+
words = []
|
| 68 |
+
ners = []
|
| 69 |
+
for l in f:
|
| 70 |
+
line = l.strip()
|
| 71 |
+
if len(line) > 0:
|
| 72 |
+
tokens = l.split()
|
| 73 |
+
word = tokens[0].strip()
|
| 74 |
+
ner = tokens[-1].strip()
|
| 75 |
+
#word = strippunct(word)
|
| 76 |
+
if len(ner) > 1 and ner[1] == "-":
|
| 77 |
+
word = strippunct(word)
|
| 78 |
+
if len(word) == 0:
|
| 79 |
+
continue
|
| 80 |
+
words.append(word)
|
| 81 |
+
ners.append(ner)
|
| 82 |
+
else:
|
| 83 |
+
#print(ners)
|
| 84 |
+
ners = bio2bliou(ners)
|
| 85 |
+
sentence = []
|
| 86 |
+
for word,tag in zip(words,ners):
|
| 87 |
+
sentence.append((word,tag))
|
| 88 |
+
#print(sentence)
|
| 89 |
+
sentences.append(sentence)
|
| 90 |
+
del ners[:]
|
| 91 |
+
del words[:]
|
| 92 |
+
testset = []
|
| 93 |
+
trainset = []
|
| 94 |
+
for i,sentence in enumerate(sentences):
|
| 95 |
+
if i % 10 == 0:
|
| 96 |
+
testset.append(sentence)
|
| 97 |
+
else:
|
| 98 |
+
trainset.append(sentence)
|
| 99 |
+
|
| 100 |
+
save_sentences(trainset,sys.argv[2])
|
| 101 |
+
save_sentences(testset,sys.argv[3])
|
| 102 |
+
|
| 103 |
+
process_data(sys.argv[1])
|
spacy-skmodel/small-config.cfg
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[paths]
|
| 2 |
+
train = "input/sk_snk-ud-train.spacy"
|
| 3 |
+
dev = "input/sk_snk-ud-test.spacy"
|
| 4 |
+
vectors = null
|
| 5 |
+
init_tok2vec = null
|
| 6 |
+
|
| 7 |
+
[system]
|
| 8 |
+
gpu_allocator = "pytorch"
|
| 9 |
+
seed = 0
|
| 10 |
+
|
| 11 |
+
[nlp]
|
| 12 |
+
lang = "sk"
|
| 13 |
+
pipeline = ["tok2vec","tagger","morphologizer","trainable_lemmatizer","parser"]
|
| 14 |
+
batch_size = 1000
|
| 15 |
+
disabled = []
|
| 16 |
+
before_creation = null
|
| 17 |
+
after_creation = null
|
| 18 |
+
after_pipeline_creation = null
|
| 19 |
+
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
|
| 20 |
+
|
| 21 |
+
[components]
|
| 22 |
+
|
| 23 |
+
[components.morphologizer]
|
| 24 |
+
factory = "morphologizer"
|
| 25 |
+
|
| 26 |
+
[components.morphologizer.model]
|
| 27 |
+
@architectures = "spacy.Tagger.v2"
|
| 28 |
+
nO = null
|
| 29 |
+
|
| 30 |
+
[components.morphologizer.model.tok2vec]
|
| 31 |
+
@architectures = "spacy.Tok2VecListener.v1"
|
| 32 |
+
width = ${components.tok2vec.model.encode.width}
|
| 33 |
+
|
| 34 |
+
[components.parser]
|
| 35 |
+
factory = "parser"
|
| 36 |
+
|
| 37 |
+
[components.parser.model]
|
| 38 |
+
@architectures = "spacy.TransitionBasedParser.v2"
|
| 39 |
+
state_type = "parser"
|
| 40 |
+
extra_state_tokens = false
|
| 41 |
+
hidden_width = 128
|
| 42 |
+
maxout_pieces = 3
|
| 43 |
+
use_upper = true
|
| 44 |
+
nO = null
|
| 45 |
+
|
| 46 |
+
[components.parser.model.tok2vec]
|
| 47 |
+
@architectures = "spacy.Tok2VecListener.v1"
|
| 48 |
+
width = ${components.tok2vec.model.encode.width}
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
[components.tagger]
|
| 52 |
+
factory = "tagger"
|
| 53 |
+
|
| 54 |
+
[components.tagger.model]
|
| 55 |
+
@architectures = "spacy.Tagger.v2"
|
| 56 |
+
nO = null
|
| 57 |
+
|
| 58 |
+
[components.tagger.model.tok2vec]
|
| 59 |
+
@architectures = "spacy.Tok2VecListener.v1"
|
| 60 |
+
width = ${components.tok2vec.model.encode.width}
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
[components.tok2vec]
|
| 65 |
+
factory = "tok2vec"
|
| 66 |
+
|
| 67 |
+
[components.tok2vec.model]
|
| 68 |
+
@architectures = "spacy.Tok2Vec.v2"
|
| 69 |
+
|
| 70 |
+
[components.tok2vec.model.embed]
|
| 71 |
+
@architectures = "spacy.MultiHashEmbed.v2"
|
| 72 |
+
width = ${components.tok2vec.model.encode.width}
|
| 73 |
+
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
|
| 74 |
+
rows = [5000,2500,2500,2500]
|
| 75 |
+
include_static_vectors = false
|
| 76 |
+
|
| 77 |
+
[components.tok2vec.model.encode]
|
| 78 |
+
@architectures = "spacy.MaxoutWindowEncoder.v2"
|
| 79 |
+
width = 96
|
| 80 |
+
depth = 4
|
| 81 |
+
window_size = 1
|
| 82 |
+
maxout_pieces = 3
|
| 83 |
+
|
| 84 |
+
[components.trainable_lemmatizer]
|
| 85 |
+
factory = "trainable_lemmatizer"
|
| 86 |
+
backoff = "orth"
|
| 87 |
+
min_tree_freq = 3
|
| 88 |
+
overwrite = false
|
| 89 |
+
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
|
| 90 |
+
top_k = 1
|
| 91 |
+
|
| 92 |
+
[components.trainable_lemmatizer.model]
|
| 93 |
+
@architectures = "spacy.Tagger.v2"
|
| 94 |
+
nO = null
|
| 95 |
+
normalize = false
|
| 96 |
+
|
| 97 |
+
[components.trainable_lemmatizer.model.tok2vec]
|
| 98 |
+
@architectures = "spacy.Tok2VecListener.v1"
|
| 99 |
+
width = ${components.tok2vec.model.encode.width}
|
| 100 |
+
|
| 101 |
+
[corpora]
|
| 102 |
+
|
| 103 |
+
[corpora.dev]
|
| 104 |
+
@readers = "spacy.Corpus.v1"
|
| 105 |
+
path = ${paths.dev}
|
| 106 |
+
max_length = 0
|
| 107 |
+
gold_preproc = false
|
| 108 |
+
limit = 0
|
| 109 |
+
augmenter = null
|
| 110 |
+
|
| 111 |
+
[corpora.train]
|
| 112 |
+
@readers = "spacy.Corpus.v1"
|
| 113 |
+
path = ${paths.train}
|
| 114 |
+
max_length = 2000
|
| 115 |
+
gold_preproc = false
|
| 116 |
+
limit = 0
|
| 117 |
+
augmenter = null
|
| 118 |
+
|
| 119 |
+
[training]
|
| 120 |
+
dev_corpus = "corpora.dev"
|
| 121 |
+
train_corpus = "corpora.train"
|
| 122 |
+
seed = ${system.seed}
|
| 123 |
+
gpu_allocator = ${system.gpu_allocator}
|
| 124 |
+
dropout = 0.25
|
| 125 |
+
accumulate_gradient = 1
|
| 126 |
+
patience = 1600
|
| 127 |
+
max_epochs = 25
|
| 128 |
+
max_steps = 20000
|
| 129 |
+
eval_frequency = 200
|
| 130 |
+
frozen_components = []
|
| 131 |
+
before_to_disk = null
|
| 132 |
+
annotating_components = []
|
| 133 |
+
|
| 134 |
+
[training.batcher]
|
| 135 |
+
@batchers = "spacy.batch_by_words.v1"
|
| 136 |
+
discard_oversize = false
|
| 137 |
+
tolerance = 0.2
|
| 138 |
+
get_length = null
|
| 139 |
+
|
| 140 |
+
[training.batcher.size]
|
| 141 |
+
@schedules = "compounding.v1"
|
| 142 |
+
start = 100
|
| 143 |
+
stop = 1000
|
| 144 |
+
compound = 1.001
|
| 145 |
+
t = 0.0
|
| 146 |
+
|
| 147 |
+
[training.logger]
|
| 148 |
+
@loggers = "spacy.ConsoleLogger.v1"
|
| 149 |
+
progress_bar = false
|
| 150 |
+
|
| 151 |
+
[training.optimizer]
|
| 152 |
+
@optimizers = "Adam.v1"
|
| 153 |
+
beta1 = 0.9
|
| 154 |
+
beta2 = 0.999
|
| 155 |
+
L2_is_weight_decay = true
|
| 156 |
+
L2 = 0.01
|
| 157 |
+
grad_clip = 1.0
|
| 158 |
+
use_averages = false
|
| 159 |
+
eps = 0.00000001
|
| 160 |
+
learn_rate = 0.001
|
| 161 |
+
|
| 162 |
+
[training.score_weights]
|
| 163 |
+
tag_acc = 0.17
|
| 164 |
+
pos_acc = 0.17
|
| 165 |
+
morph_acc = 0.17
|
| 166 |
+
morph_per_feat = null
|
| 167 |
+
lemma_acc = 0.33
|
| 168 |
+
dep_uas = 0.08
|
| 169 |
+
dep_las = 0.08
|
| 170 |
+
dep_las_per_type = null
|
| 171 |
+
sents_p = null
|
| 172 |
+
sents_r = null
|
| 173 |
+
sents_f = 0.0
|
| 174 |
+
|
| 175 |
+
[pretraining]
|
| 176 |
+
|
| 177 |
+
[initialize]
|
| 178 |
+
vectors = ${paths.vectors}
|
| 179 |
+
init_tok2vec = ${paths.init_tok2vec}
|
| 180 |
+
vocab_data = null
|
| 181 |
+
lookups = null
|
| 182 |
+
before_init = null
|
| 183 |
+
after_init = null
|
| 184 |
+
|
| 185 |
+
[initialize.components]
|
| 186 |
+
|
| 187 |
+
[initialize.components.tagger]
|
| 188 |
+
|
| 189 |
+
[initialize.tokenizer]
|
spacy-skmodel/sources/skner/README.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Silver-standard Name Annotations From Wikipedia Markups
|
| 2 |
+
Xiaoman Pan
|
| 3 |
+
panx2@rpi.edu
|
| 4 |
+
|
| 5 |
+
FORMAT:
|
| 6 |
+
[TOKEN] [ADDITIONAL INFORMATION] [TAG]
|
| 7 |
+
|
| 8 |
+
ADDITIONAL INFORMATION FORMAT:
|
| 9 |
+
[Wikipedia title] [name mention] [entity type] [entity type confidence] [English Wikipedia title]
|
| 10 |
+
|
| 11 |
+
If you would like to cite this work, please cite the following publication:
|
| 12 |
+
Cross-lingual Name Tagging and Linking for 282 Languages
|
spacy-skmodel/sources/skner/wikiann-sk.bio
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e6c3ff1eb8ea5bf2de7a19f44d36df45e61a3f52d3e415b81dfeadeffc61ee4e
|
| 3 |
+
size 13898246
|
spacy-skmodel/sources/slovak-treebank/stb.conll
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
spacy-skmodel/sources/ud-artificial-gapping/README.txt
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Artificial dependency trees in the Universal Dependencies v2 style, focused
|
| 2 |
+
on gapping (the 'orphan' relation in UD). For motivation and description of
|
| 3 |
+
the data, see the paper cited below. Please cite the paper if you use the data
|
| 4 |
+
in your academic work.
|
| 5 |
+
|
| 6 |
+
@inproceedings{droganova2018,
|
| 7 |
+
title = {Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions},
|
| 8 |
+
author = {Kira Droganova and Daniel Zeman and Jenna Kanerva and Filip Ginter},
|
| 9 |
+
year = {2018},
|
| 10 |
+
booktitle = {Proceedings of the 11th International Conference on Language Resources and Evaluation ({LREC} 2018)},
|
| 11 |
+
publisher = {European Language Resources Association},
|
| 12 |
+
organization = {European Language Resource Association},
|
| 13 |
+
address = {Paris, France},
|
| 14 |
+
location = {Miyazaki, Japan},
|
| 15 |
+
venue = {Phoenix Seagaia Conference Center}
|
| 16 |
+
}
|
| 17 |
+
|
| 18 |
+
Permanent URI of the dataset:
|
| 19 |
+
http://hdl.handle.net/11234/1-2616
|
| 20 |
+
|
| 21 |
+
*-crawled-* data are crawled from the web, parsed by two parsers, filtered so
|
| 22 |
+
that only those trees survive where the two parsers agree, then proceesed
|
| 23 |
+
to create artificial gapping
|
| 24 |
+
*-{train,dev,test}-* data are based on Universal Dependency treebanks release
|
| 25 |
+
2.1 (November 2017)
|
| 26 |
+
English and Finnish data were manually checked and modified after gapping
|
| 27 |
+
structures had been automatically drafted.
|
| 28 |
+
Czech, Slovak and Russian data were processed only automatically.
|
| 29 |
+
|
spacy-skmodel/sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
spacy-skmodel/testmodel.py
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import spacy
|
| 2 |
+
import sys
|
| 3 |
+
|
| 4 |
+
nlp = spacy.load(sys.argv[1])
|
| 5 |
+
nlp.enable_pipe("tagger")
|
| 6 |
+
nlp.enable_pipe("parser")
|
| 7 |
+
nlp.enable_pipe("ner")
|
| 8 |
+
lines = []
|
| 9 |
+
for line in sys.stdin:
|
| 10 |
+
lines.append(line.rstrip())
|
| 11 |
+
doc = nlp("\n".join(lines))
|
| 12 |
+
for token in doc:
|
| 13 |
+
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,token.shape_, token.is_alpha, token.is_stop,token.ner_)
|
spacy-skmodel/train-small.sh
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
set -e # fail on error
|
| 2 |
+
|
| 3 |
+
#make # prepare data
|
| 4 |
+
|
| 5 |
+
export CUDA_VISIBLE_DEVICES=0
|
| 6 |
+
# cleanup old results
|
| 7 |
+
#rm -rf dist
|
| 8 |
+
mkdir -p dist
|
| 9 |
+
mkdir -p train
|
| 10 |
+
TRAINDIR=train/smposparser
|
| 11 |
+
NERTRAINDIR=train/smnerposparser
|
| 12 |
+
VER=3.3.0
|
| 13 |
+
MODELDIR=dist/sk_dep_web_sm-$VER
|
| 14 |
+
NERMODELDIR=dist/sk_core_web_sm-$VER
|
| 15 |
+
mkdir -p $TRAINDIR
|
| 16 |
+
# Train POS and dependencies
|
| 17 |
+
spacy train small-config.cfg -o $TRAINDIR -g 0 > $TRAINDIR/train.log 2> $TRAINDIR/train.err.log
|
| 18 |
+
# Package POS
|
| 19 |
+
spacy package -m small-meta.json -F $TRAINDIR/model-best dist
|
| 20 |
+
cd $MODELDIR
|
| 21 |
+
python ./setup.py sdist
|
| 22 |
+
# install to include pos and dependencies in new model
|
| 23 |
+
# name must be the same as in meta.json
|
| 24 |
+
pip install $MODELDIR.tar.gz
|
| 25 |
+
cd ../../
|
| 26 |
+
mkdir -p $NERTRAINDIR
|
| 27 |
+
# Train NER, copy POS and dep from old model
|
| 28 |
+
spacy train small-ner.cfg -o $NERTRAINDIR -g 0 > $NERTRAINDIR/train.log 2> $NERTRAINDIR/train.err.log
|
| 29 |
+
# Correct meta
|
| 30 |
+
cp $NERTRAINDIR/model-best/meta.json $NERTRAINDIR/model-best/meta-ner.json
|
| 31 |
+
python changemeta.py $TRAINDIR/model-best $NERTRAINDIR/model-best
|
| 32 |
+
# Package result
|
| 33 |
+
spacy package --version $VER $NERTRAINDIR/model-best dist
|
| 34 |
+
cd $NERMODELDIR
|
| 35 |
+
python ./setup.py sdist
|
spacy-skmodel/train.sh
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
set -e # fail on error
|
| 2 |
+
|
| 3 |
+
make # prepare data
|
| 4 |
+
|
| 5 |
+
export CUDA_VISIBLE_DEVICES=0
|
| 6 |
+
VERSION=3.4.1
|
| 7 |
+
# cleanup old results
|
| 8 |
+
#rm -rf dist
|
| 9 |
+
mkdir -p dist
|
| 10 |
+
mkdir -p train
|
| 11 |
+
mkdir -p train/sposparser
|
| 12 |
+
# Train POS and dependencies
|
| 13 |
+
spacy train config-transformer.cfg -o ./train/sposparser -g 0 > ./train/sposparser/train.log 2> ./train/sposparser/train.err.log
|
| 14 |
+
# Package POS
|
| 15 |
+
spacy package -m meta.json -F train/sposparser/model-best dist
|
| 16 |
+
cd dist/sk_dep_web_md-$VERSION
|
| 17 |
+
python ./setup.py sdist
|
| 18 |
+
# install to include pos and dependencies in new model
|
| 19 |
+
# name must be the same as in meta.json
|
| 20 |
+
#pip install dist/sk_dep_web_md-$VERSION.tar.gz
|
| 21 |
+
#cd ../../
|
| 22 |
+
#mkdir -p train/snerposparser
|
| 23 |
+
# Train NER, copy POS and dep from old model
|
| 24 |
+
#spacy train config-transformer-ner.cfg -o ./train/snerposparser -g 0 > ./train/snerposparser/train.log 2> ./train/snerposparser/train.err.log
|
| 25 |
+
# Correct meta
|
| 26 |
+
#cp ./train/snerposparser/model-best/meta.json ./train/snerposparser/model-best/meta-ner.json
|
| 27 |
+
#python changemeta.py ./train/sposparser/model-best ./train/snerposparser/model-best
|
| 28 |
+
# Package result
|
| 29 |
+
#spacy package --version $VERSION train/snerposparser/model-best dist
|
| 30 |
+
#cd dist/sk_core_web_md-$VERSION
|
| 31 |
+
#python ./setup.py sdist
|
spacy-skmodel/treebank2json.py
ADDED
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import sys
|
| 2 |
+
import json
|
| 3 |
+
# https://spacy.io/api/data-formats#training
|
| 4 |
+
#from spacy.gold import offsets_from_biluo_tags
|
| 5 |
+
#from spacy.gold import iob_to_biluo
|
| 6 |
+
|
| 7 |
+
depmap = {
|
| 8 |
+
"case":"AuxP",
|
| 9 |
+
"root" : "Pred", # / Pred_M
|
| 10 |
+
"punct" : "AuxK",
|
| 11 |
+
"nsubj" : "Sb",
|
| 12 |
+
"obj" : "Obj",
|
| 13 |
+
"conj" : "Sb",
|
| 14 |
+
"cc" : "Coord",
|
| 15 |
+
"orphan" : "Obj",
|
| 16 |
+
"advmod" : "Adv",
|
| 17 |
+
"amod" : "Atr",
|
| 18 |
+
"nmod" : "Atr",
|
| 19 |
+
"mark" : "AuxC",
|
| 20 |
+
"aux" : "AuxV",
|
| 21 |
+
"det" : "Atr",
|
| 22 |
+
"obl" : "Atr",
|
| 23 |
+
"expl:pv" : "AuxT",
|
| 24 |
+
"advmod" : "Adv",
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
def save_data(filename,dataset):
|
| 28 |
+
sentences = []
|
| 29 |
+
words = []
|
| 30 |
+
docs = []
|
| 31 |
+
for i,item in enumerate(dataset):
|
| 32 |
+
bad = False
|
| 33 |
+
for token in item:
|
| 34 |
+
words.append(token["orth"])
|
| 35 |
+
h = token["head"] + token["id"]
|
| 36 |
+
#print(h,len(item))
|
| 37 |
+
if h < 0 or h >= len(item):
|
| 38 |
+
print(item)
|
| 39 |
+
bad = True
|
| 40 |
+
break
|
| 41 |
+
if bad:
|
| 42 |
+
continue
|
| 43 |
+
sentences.append({"tokens":item})
|
| 44 |
+
if len(sentences) > 4:
|
| 45 |
+
doc = {
|
| 46 |
+
"id": i,
|
| 47 |
+
"paragraphs":[{
|
| 48 |
+
"raw": " ".join(words),
|
| 49 |
+
"sentences": list(sentences)
|
| 50 |
+
}]
|
| 51 |
+
}
|
| 52 |
+
docs.append(doc)
|
| 53 |
+
del words[:]
|
| 54 |
+
del sentences[:]
|
| 55 |
+
|
| 56 |
+
if len(docs)> 0 and len(sentences)>0:
|
| 57 |
+
doc = {
|
| 58 |
+
"id": docs[-1]["id"] + 1,
|
| 59 |
+
"paragraphs":[{
|
| 60 |
+
"raw": " ".join(words),
|
| 61 |
+
"sentences": list(sentences)
|
| 62 |
+
}]
|
| 63 |
+
}
|
| 64 |
+
docs.append(doc)
|
| 65 |
+
with open(filename,"w") as f:
|
| 66 |
+
json.dump(docs,f)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def process_data(trainname,testname):
|
| 70 |
+
dataset = []
|
| 71 |
+
sentence = []
|
| 72 |
+
for l in sys.stdin:
|
| 73 |
+
if l[0] == "#":
|
| 74 |
+
continue
|
| 75 |
+
tokens = l.split()
|
| 76 |
+
#print(tokens)
|
| 77 |
+
if len(tokens) < 2:
|
| 78 |
+
if len(sentence) > 0:
|
| 79 |
+
dataset.append(list(sentence))
|
| 80 |
+
del sentence[:]
|
| 81 |
+
continue
|
| 82 |
+
head = int(tokens[6])
|
| 83 |
+
id = int(tokens[0]) -1
|
| 84 |
+
print(head,id)
|
| 85 |
+
h = 0
|
| 86 |
+
if head != 0:
|
| 87 |
+
h = head - id -1
|
| 88 |
+
dep = tokens[7]
|
| 89 |
+
if dep in depmap:
|
| 90 |
+
dep = depmap[dep]
|
| 91 |
+
#print(h)
|
| 92 |
+
token = {
|
| 93 |
+
"id": id,
|
| 94 |
+
"orth": tokens[1],
|
| 95 |
+
"tag": tokens[4],
|
| 96 |
+
# "ner":
|
| 97 |
+
"head": h,
|
| 98 |
+
"dep": dep,
|
| 99 |
+
}
|
| 100 |
+
sentence.append(token)
|
| 101 |
+
trainset = []
|
| 102 |
+
testset = []
|
| 103 |
+
for i, item in enumerate(dataset):
|
| 104 |
+
if i % 10 == 0:
|
| 105 |
+
testset.append(item)
|
| 106 |
+
else:
|
| 107 |
+
trainset.append(item)
|
| 108 |
+
save_data(trainname,trainset)
|
| 109 |
+
save_data(testname,testset)
|
| 110 |
+
|
| 111 |
+
process_data(sys.argv[1],sys.argv[2])
|
spacy-skmodel/v2/01.prepare.sh
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# conda install spacy=2.3.5 cupy cudatoolkit=9.2
|
| 2 |
+
mkdir -p input
|
| 3 |
+
# Prepare Treebank
|
| 4 |
+
mkdir -p input/slovak-treebank
|
| 5 |
+
spacy convert ./sources/slovak-treebank/stb.conll ./input/slovak-treebank
|
| 6 |
+
# UDAG used as evaluation
|
| 7 |
+
mkdir -p input/ud-artificial-gapping
|
| 8 |
+
spacy convert ./sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ./d/input/ud-artificial-gapping
|
| 9 |
+
# Prepare skner
|
| 10 |
+
mkdir -p input/skner
|
| 11 |
+
cd input/skner
|
| 12 |
+
python ../../skner2json.py ../../sources/skner/wikiann-sk.bio
|
| 13 |
+
|
| 14 |
+
wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.sk.300.vec.gz
|
| 15 |
+
mv cc.sk.300.vec.gz ./input
|
spacy-skmodel/v2/assemble.py
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import sys
|
| 2 |
+
import json
|
| 3 |
+
|
| 4 |
+
base = sys.argv[1]
|
| 5 |
+
ner = sys.argv[2]
|
| 6 |
+
posparser = sys.argv[3]
|
| 7 |
+
outmeta = sys.argv[4]
|
| 8 |
+
|
| 9 |
+
meta = None
|
| 10 |
+
with open(base,"rb") as f:
|
| 11 |
+
meta = json.load(f)
|
| 12 |
+
meta["labels"] = {}
|
| 13 |
+
meta["accuracy"] = {}
|
| 14 |
+
|
| 15 |
+
ner_meta = None
|
| 16 |
+
with open(ner,"rb") as f:
|
| 17 |
+
ner_meta = json.load(f)
|
| 18 |
+
meta["spacy_version"] = ner_meta["spacy_version"]
|
| 19 |
+
meta["labels"]["ner"] = ner_meta["labels"]["ner"]
|
| 20 |
+
meta["accuracy"]["ents_p"] = ner_meta["accuracy"]["ents_p"]
|
| 21 |
+
meta["accuracy"]["ents_r"] = ner_meta["accuracy"]["ents_r"]
|
| 22 |
+
meta["accuracy"]["ents_f"] = ner_meta["accuracy"]["ents_f"]
|
| 23 |
+
meta["accuracy"]["ents_per_type"] = ner_meta["accuracy"]["ents_per_type"]
|
| 24 |
+
|
| 25 |
+
posparser_meta = None
|
| 26 |
+
with open(posparser,"rb") as f:
|
| 27 |
+
posparser_meta = json.load(f)
|
| 28 |
+
meta["vectors"] = posparser_meta["vectors"]
|
| 29 |
+
meta["accuracy"]["tags_acc"] = posparser_meta["accuracy"]["tags_acc"]
|
| 30 |
+
meta["accuracy"]["uas"] = posparser_meta["accuracy"]["uas"]
|
| 31 |
+
meta["accuracy"]["las"] = posparser_meta["accuracy"]["las"]
|
| 32 |
+
meta["accuracy"]["las_per_type"] = posparser_meta["accuracy"]["las_per_type"]
|
| 33 |
+
meta["labels"]["tagger"] = posparser_meta["labels"]["tagger"]
|
| 34 |
+
|
| 35 |
+
with open(outmeta,"w") as f:
|
| 36 |
+
json.dump(meta,f,indent=6)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
|
spacy-skmodel/v2/meta-ccv2.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"lang":"sk",
|
| 3 |
+
"name": "sk_core_web_lg",
|
| 4 |
+
"version": "2.3.1",
|
| 5 |
+
"description": "Basic Slovak model with fastext word vectors trained on public data",
|
| 6 |
+
"author":"Daniel Hládek",
|
| 7 |
+
"email":"dhladek@gmail.com",
|
| 8 |
+
"url":"https://nlp.kemt.fei.tuke.sk",
|
| 9 |
+
"license":"CC BY-SA 3.0",
|
| 10 |
+
"pipeline": ["tagger","parser","ner"]
|
| 11 |
+
}
|
| 12 |
+
|
spacy-skmodel/v2/meta-v2.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"lang":"sk",
|
| 3 |
+
"name": "sk_core_web_md",
|
| 4 |
+
"version": "2.3.1",
|
| 5 |
+
"description": "Basic Slovak model without word vectors trained on public data",
|
| 6 |
+
"author":"Daniel Hládek",
|
| 7 |
+
"email":"dhladek@gmail.com",
|
| 8 |
+
"url":"https://nlp.kemt.fei.tuke.sk",
|
| 9 |
+
"license":"CC BY-SA 3.0",
|
| 10 |
+
"pipeline": ["tagger","parser","ner"]
|
| 11 |
+
}
|
| 12 |
+
|
spacy-skmodel/v2/train-v2.sh
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FLAGS="--n-iter 10"
|
| 2 |
+
OUTDIR=outv2
|
| 3 |
+
rm -r $OUTDIR
|
| 4 |
+
mkdir -p $OUTDIR
|
| 5 |
+
# Train dependency and POS
|
| 6 |
+
spacy train sk $OUTDIR/posparser input/slovak-treebank input/ud-artificial-gapping -p tagger,parser $FLAGS
|
| 7 |
+
# Train NER
|
| 8 |
+
spacy train sk $OUTDIR/ner input/skner/train.json input/skner/test.json -p ner -R $FLAGS
|
| 9 |
+
|
| 10 |
+
## Assemle model
|
| 11 |
+
mkdir -p $OUTDIR/nerposparser
|
| 12 |
+
cp -r $OUTDIR/posparser/model-final/* $OUTDIR/nerposparser
|
| 13 |
+
cp -r $OUTDIR/ner/model-final/ner $OUTDIR/nerposparser
|
| 14 |
+
python ./assemble.py v2/meta-v2.json $OUTDIR/ner/model-final/meta.json $OUTDIR/posparser/model-final/meta.json $OUTDIR/nerposparser/meta.json
|
| 15 |
+
|
| 16 |
+
# Make python package
|
| 17 |
+
mkdir -p $OUTDIR/dist
|
| 18 |
+
spacy package $OUTDIR/nerposparser $OUTDIR/dist
|
| 19 |
+
DNAME=`ls $OUTDIR/dist`
|
| 20 |
+
cd $OUTDIR/dist/$DNAME
|
| 21 |
+
python ./setup.py sdist --dist-dir ../
|
spacy-skmodel/v2/train-v2cc.sh
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FLAGS="-g 0 --n-iter 10"
|
| 2 |
+
OUTDIR=outccv2
|
| 3 |
+
rm -r $OUTDIR
|
| 4 |
+
mkdir -p $OUTDIR
|
| 5 |
+
spacy init-model sk $OUTDIR/basic -v ./input/cc.sk.300.vec.gz -V 600000
|
| 6 |
+
|
| 7 |
+
# Train dependency and POS
|
| 8 |
+
spacy train sk $OUTDIR/posparser input/slovak-treebank input/ud-artificial-gapping -p tagger,parser -b $OUTDIR/basic $FLAGS
|
| 9 |
+
|
| 10 |
+
spacy train sk $OUTDIR/ner input/skner/train.json input/skner/test.json -p ner -R -b $OUTDIR/basic $FLAGS
|
| 11 |
+
|
| 12 |
+
## Assemle model
|
| 13 |
+
mkdir -p $OUTDIR/nerposparser
|
| 14 |
+
cp -r $OUTDIR/posparser/model-final/* $OUTDIR/nerposparser
|
| 15 |
+
cp -r $OUTDIR/ner/model-final/ner $OUTDIR/nerposparser
|
| 16 |
+
python ./assemble.py v2/meta-ccv2.json $OUTDIR/ner/model-final/meta.json $OUTDIR/posparser/model-final/meta.json $OUTDIR/nerposparser/meta.json
|
| 17 |
+
|
| 18 |
+
# Make python package
|
| 19 |
+
mkdir -p $OUTDIR/dist
|
| 20 |
+
spacy package $OUTDIR/nerposparser $OUTDIR/dist
|
| 21 |
+
DNAME=`ls $OUTDIR/dist`
|
| 22 |
+
cd $OUTDIR/dist/$DNAME
|
| 23 |
+
python ./setup.py sdist --dist-dir ../
|