al1808th
/

stanza-digphil

Model card Files Files and versions

stanza-digphil / README.md

al1808th's picture

Second go at parsing the corpus

ee01d8f 5 days ago

|

history blame contribute delete

2.17 kB

	# Retraining Stanza to optimize depparse on a diachronic Swedish corpus

	This repository contains forked code from the official Stanza github, with scripts that help prepare for and train models on different combinations of treebanks relevant to historical Swedish.

	## Guide

	Dev/test for all models is 10/90 of our human validated gold sentences (https://github.com/alanev52/Diachronic_Treebanks_DigPhil/tree/main/parsed_data/validated)

	Example work flow, training a model with pretrained vectors from kubhist2 1880 incremental and training data from Swedish and Bokmål treebanks, as well as our own machine-parsed diachronic corpus:

	```
	python prepare-train-val-test.py sv diachron bm

	source scripts/config_alvis.sh

	python -m stanza.utils.datasets.prepare_depparse_treebank UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt

	python -m stanza.utils.training.run_depparse UD_Swedish-diachronic --wordvec_pretrain_file /cephyr/users/cleland/Alvis/stanza_resources/sv/pretrain/diachronic.pt --batch_size 32 --dropout 0.33
	```

	All of the above can be done with a single command by using:

	```
	./make_new_model.sh {vectors} {language codes}
	```

	which for the example becomes:

	```
	./make_new_model.sh diachronic.pt sv diachron bm
	```

	## Pretrained vectors

	We tried the incremental vectors up until 1880 from Henchen & Tahmasebi 2021.

	I first converted the kubhist2 vectors from `gensim` fasttext `.ft` to an ordinary textfile with gensims python package, then I used Stanza's `.pt` converter:

	```
	from stanza.models.common.pretrain import Pretrain
	pt = Pretrain("foo.pt", "new_vectors.txt")
	pt.load()
	```

	The result is included compressed as `diachronic.pt.xz`. In our tests, the default conllu17 vectors work better even for our diachronic corpus.

	## Results

	DigPhil, UD-sv, bm, dk:
	UAS LAS CLAS MLAS BLEX
	65.45 55.84 50.19 46.49 50.19

	## References

	Hengchen, Simon & Tahmasebi, Nina. (2021).
	A collection of Swedish diachronic word embedding models trained on historical newspaper data.
	Journal of Open Humanities Data, 7(2), 1–7.
	https://doi.org/10.5334/johd.22