Added files from Clarin: http://hdl.handle.net/20.500.12537/302

Instead of directly adding the dependencies (electra model, Tokenizer),
use git submodules to reference them.

- Tokenizer submodule is using https://github.com/icelandic-lt/Tokenizer
- Electra submodule is using https://huggingface.co/Icelandic-lt/electra-base-igc-is

Signed-off-by: Daniel Schnell <dschnell@grammatek.com>

Files changed (11) hide show

.gitignore +1 -0
.gitmodules +6 -0
.python-version +1 -0
README.md +53 -3
Tokenizer +1 -0
diaparser-is-combined-v211/diaparser.model +3 -0
diaparser-is-combined-v211/diaparser.train.log +0 -0
parse_file.py +22 -0
requirements.txt +2 -0
test_file.txt +3 -0
transformer_models/electra-base-igc-is +1 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ venv*/

.gitmodules ADDED Viewed

	@@ -0,0 +1,6 @@

+[submodule "Tokenizer"]
+	path = Tokenizer
+	url = https://github.com/icelandic-lt/Tokenizer
+[submodule "transformer_models/electra-base-igc-is"]
+	path = transformer_models/electra-base-igc-is
+	url = git@hf.co:Icelandic-lt/electra-base-igc-is

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.8

README.md CHANGED Viewed

@@ -1,3 +1,53 @@
----
-license: apache-2.0
----

+## UD-þáttari sem nýtir sér upplýsingar úr Transformer-mállíkani
+Mælt er með því að þáttarinn sé keyrður með Python3.8 í sýndarumhverfi.
+Hægt er að nota conda fyrir sýndarumhverfi: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
+Einnig er hægt að nota venv fyrir sýndarumhverfi: https://docs.python.org/3/library/venv.html
+Til þess að keyra þáttarann þarf að setja upp nauðsynlega pakka, eftir að sýndarumhverfi hefur verið virkjað: python3 -m pip install -r requirements.txt
+Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer).
+Hægt er að keyra þáttarann svona: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
+~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.
+Skor:
+Metric     | Precision |    Recall |  F1 Score | AligndAcc
+-----------+-----------+-----------+-----------+-----------
+Tokens     |     99.70 |     99.77 |     99.73 |
+Sentences  |    100.00 |    100.00 |    100.00 |
+Words      |     99.62 |     99.61 |     99.61 |
+UAS        |     89.58 |     89.57 |     89.58 |     89.92
+LAS        |     86.46 |     86.45 |     86.46 |     86.79
+CLAS       |     82.30 |     81.81 |     82.05 |     82.24
+## A Universal Dependency parser built on top of a Transformer language model
+Python3.8 recommended, as well as a virtual environment.
+You can use conda for a virtual environment: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
+You can also use venv for a virtual environment: https://docs.python.org/3/library/venv.html
+To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
+The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/mideind/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
+The parser can be run as follows: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
+~~~transformer_models/~~~ contains a pretrained model, electra-base-igc-is, which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
+The parser scores as follows:
+Metric     | Precision |    Recall |  F1 Score | AligndAcc
+-----------+-----------+-----------+-----------+-----------
+Tokens     |     99.70 |     99.77 |     99.73 |
+Sentences  |    100.00 |    100.00 |    100.00 |
+Words      |     99.62 |     99.61 |     99.61 |
+UAS        |     89.58 |     89.57 |     89.58 |     89.92
+LAS        |     86.46 |     86.45 |     86.46 |     86.79
+CLAS       |     82.30 |     81.81 |     82.05 |     82.24
+### License
+https://opensource.org/licenses/Apache-2.0

Tokenizer ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit be8ee4de465ecf0dbf008d986b99df43210f27bf

diaparser-is-combined-v211/diaparser.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:56c56fcf918fa29e7cd1034546ee64205aba16a0492c8364ca724d3bf895c4e0
+size 497675177

diaparser-is-combined-v211/diaparser.train.log ADDED Viewed

The diff for this file is too large to render. See raw diff

parse_file.py ADDED Viewed

	@@ -0,0 +1,22 @@

+from argparse import ArgumentParser
+from diaparser.parsers import Parser
+from Tokenizer.src.tokenizer import split_into_sentences
+parser = ArgumentParser()
+parser.add_argument('--parser')
+parser.add_argument('--infile')
+args = parser.parse_args()
+PARSER = Parser.load(args.parser)
+def read_test_file(file):
+    with open(file, 'r', encoding='utf-8') as infile:
+        for line in infile:
+            yield [tok for tok in ' '.join(split_into_sentences(line)).split()]
+test_file = list(read_test_file(args.infile))
+dataset = PARSER.predict(test_file, prob=True)
+for i in dataset.sentences:
+    print(i)

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ diaparser==1.1.2
2	+ Tokenizer==3.4.2

test_file.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+Komið þið sæl.
+Þetta skjal er ætlað til að sýna hvernig þáttarinn virkar.
+Njótið dagsins.

transformer_models/electra-base-igc-is ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit e2921de06b441e2a3066da485d6fa31cf5c816a8