stanza-digphil / README.md
al1808th's picture
Second go at parsing the corpus
ee01d8f
# Retraining Stanza to optimize depparse on a diachronic Swedish corpus
This repository contains forked code from the official Stanza github, with scripts that help prepare for and train models on different combinations of treebanks relevant to historical Swedish.
## Guide
Dev/test for all models is 10/90 of our human validated gold sentences (https://github.com/alanev52/Diachronic_Treebanks_DigPhil/tree/main/parsed_data/validated)
Example work flow, training a model with pretrained vectors from kubhist2 1880 incremental and training data from Swedish and Bokmål treebanks, as well as our own machine-parsed diachronic corpus:
```
python prepare-train-val-test.py sv diachron bm
source scripts/config_alvis.sh
python -m stanza.utils.datasets.prepare_depparse_treebank UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt
python -m stanza.utils.training.run_depparse UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt --batch_size 32 --dropout 0.33
```
All of the above can be done with a single command by using:
```
./make_new_model.sh {vectors} {language codes}
```
which for the example becomes:
```
./make_new_model.sh diachronic.pt sv diachron bm
```
## Pretrained vectors
We tried the incremental vectors up until 1880 from Henchen & Tahmasebi 2021.
I first converted the kubhist2 vectors from `gensim` fasttext `.ft` to an ordinary textfile with gensims python package, then I used Stanza's `.pt` converter:
```
from stanza.models.common.pretrain import Pretrain
pt = Pretrain("foo.pt", "new_vectors.txt")
pt.load()
```
The result is included compressed as `diachronic.pt.xz`. In our tests, the default conllu17 vectors work better even for our diachronic corpus.
## Results
DigPhil, UD-sv, bm, dk:
UAS LAS CLAS MLAS BLEX
65.45 55.84 50.19 46.49 50.19
## References
**Hengchen, Simon & Tahmasebi, Nina. (2021).**
*A collection of Swedish diachronic word embedding models trained on historical newspaper data.*
**Journal of Open Humanities Data**, 7(2), 1–7.
https://doi.org/10.5334/johd.22