stanza-digphil / README.md
al1808th's picture
Second go at parsing the corpus
ee01d8f

Retraining Stanza to optimize depparse on a diachronic Swedish corpus

This repository contains forked code from the official Stanza github, with scripts that help prepare for and train models on different combinations of treebanks relevant to historical Swedish.

Guide

Dev/test for all models is 10/90 of our human validated gold sentences (https://github.com/alanev52/Diachronic_Treebanks_DigPhil/tree/main/parsed_data/validated)

Example work flow, training a model with pretrained vectors from kubhist2 1880 incremental and training data from Swedish and Bokmål treebanks, as well as our own machine-parsed diachronic corpus:

python prepare-train-val-test.py sv diachron bm

source scripts/config_alvis.sh

python -m stanza.utils.datasets.prepare_depparse_treebank UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt

python -m stanza.utils.training.run_depparse UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt --batch_size 32 --dropout 0.33

All of the above can be done with a single command by using:

./make_new_model.sh {vectors} {language codes}

which for the example becomes:

./make_new_model.sh diachronic.pt sv diachron bm

Pretrained vectors

We tried the incremental vectors up until 1880 from Henchen & Tahmasebi 2021.

I first converted the kubhist2 vectors from gensim fasttext .ft to an ordinary textfile with gensims python package, then I used Stanza's .pt converter:

from stanza.models.common.pretrain import Pretrain
pt = Pretrain("foo.pt", "new_vectors.txt")
pt.load()

The result is included compressed as diachronic.pt.xz. In our tests, the default conllu17 vectors work better even for our diachronic corpus.

Results

DigPhil, UD-sv, bm, dk: UAS LAS CLAS MLAS BLEX 65.45 55.84 50.19 46.49 50.19

References

Hengchen, Simon & Tahmasebi, Nina. (2021).
A collection of Swedish diachronic word embedding models trained on historical newspaper data.
Journal of Open Humanities Data, 7(2), 1–7.
https://doi.org/10.5334/johd.22