| # Retraining Stanza to optimize depparse on a diachronic Swedish corpus | |
| This repository contains forked code from the official Stanza github, with scripts that help prepare for and train models on different combinations of treebanks relevant to historical Swedish. | |
| ## Guide | |
| Dev/test for all models is 10/90 of our human validated gold sentences (https://github.com/alanev52/Diachronic_Treebanks_DigPhil/tree/main/parsed_data/validated) | |
| Example work flow, training a model with pretrained vectors from kubhist2 1880 incremental and training data from Swedish and Bokmål treebanks, as well as our own machine-parsed diachronic corpus: | |
| ``` | |
| python prepare-train-val-test.py sv diachron bm | |
| source scripts/config_alvis.sh | |
| python -m stanza.utils.datasets.prepare_depparse_treebank UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt | |
| python -m stanza.utils.training.run_depparse UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt --batch_size 32 --dropout 0.33 | |
| ``` | |
| All of the above can be done with a single command by using: | |
| ``` | |
| ./make_new_model.sh {vectors} {language codes} | |
| ``` | |
| which for the example becomes: | |
| ``` | |
| ./make_new_model.sh diachronic.pt sv diachron bm | |
| ``` | |
| ## Pretrained vectors | |
| We tried the incremental vectors up until 1880 from Henchen & Tahmasebi 2021. | |
| I first converted the kubhist2 vectors from `gensim` fasttext `.ft` to an ordinary textfile with gensims python package, then I used Stanza's `.pt` converter: | |
| ``` | |
| from stanza.models.common.pretrain import Pretrain | |
| pt = Pretrain("foo.pt", "new_vectors.txt") | |
| pt.load() | |
| ``` | |
| The result is included compressed as `diachronic.pt.xz`. In our tests, the default conllu17 vectors work better even for our diachronic corpus. | |
| ## Results | |
| DigPhil, UD-sv, bm, dk: | |
| UAS LAS CLAS MLAS BLEX | |
| 65.45 55.84 50.19 46.49 50.19 | |
| ## References | |
| **Hengchen, Simon & Tahmasebi, Nina. (2021).** | |
| *A collection of Swedish diachronic word embedding models trained on historical newspaper data.* | |
| **Journal of Open Humanities Data**, 7(2), 1–7. | |
| https://doi.org/10.5334/johd.22 | |