danielschnell commited on
Commit
695687f
·
1 Parent(s): 79368a9

Added files from Clarin: http://hdl.handle.net/20.500.12537/302

Browse files

Instead of directly adding the dependencies (electra model, Tokenizer),
use git submodules to reference them.

- Tokenizer submodule is using https://github.com/icelandic-lt/Tokenizer
- Electra submodule is using https://huggingface.co/Icelandic-lt/electra-base-igc-is


Signed-off-by: Daniel Schnell <dschnell@grammatek.com>

.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ venv*/
.gitmodules ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ [submodule "Tokenizer"]
2
+ path = Tokenizer
3
+ url = https://github.com/icelandic-lt/Tokenizer
4
+ [submodule "transformer_models/electra-base-igc-is"]
5
+ path = transformer_models/electra-base-igc-is
6
+ url = git@hf.co:Icelandic-lt/electra-base-igc-is
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.8
README.md CHANGED
@@ -1,3 +1,53 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## UD-þáttari sem nýtir sér upplýsingar úr Transformer-mállíkani
2
+
3
+ Mælt er með því að þáttarinn sé keyrður með Python3.8 í sýndarumhverfi.
4
+ Hægt er að nota conda fyrir sýndarumhverfi: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
5
+ Einnig er hægt að nota venv fyrir sýndarumhverfi: https://docs.python.org/3/library/venv.html
6
+
7
+ Til þess að keyra þáttarann þarf að setja upp nauðsynlega pakka, eftir að sýndarumhverfi hefur verið virkjað: python3 -m pip install -r requirements.txt
8
+ Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer).
9
+
10
+ Hægt er að keyra þáttarann svona: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
11
+ ~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.
12
+
13
+ Skor:
14
+
15
+ Metric | Precision | Recall | F1 Score | AligndAcc
16
+ -----------+-----------+-----------+-----------+-----------
17
+ Tokens | 99.70 | 99.77 | 99.73 |
18
+ Sentences | 100.00 | 100.00 | 100.00 |
19
+ Words | 99.62 | 99.61 | 99.61 |
20
+ UAS | 89.58 | 89.57 | 89.58 | 89.92
21
+ LAS | 86.46 | 86.45 | 86.46 | 86.79
22
+ CLAS | 82.30 | 81.81 | 82.05 | 82.24
23
+
24
+
25
+
26
+ ## A Universal Dependency parser built on top of a Transformer language model
27
+
28
+ Python3.8 recommended, as well as a virtual environment.
29
+
30
+ You can use conda for a virtual environment: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
31
+ You can also use venv for a virtual environment: https://docs.python.org/3/library/venv.html
32
+
33
+ To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
34
+
35
+ The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/mideind/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
36
+
37
+ The parser can be run as follows: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
38
+
39
+ ~~~transformer_models/~~~ contains a pretrained model, electra-base-igc-is, which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
40
+
41
+ The parser scores as follows:
42
+
43
+ Metric | Precision | Recall | F1 Score | AligndAcc
44
+ -----------+-----------+-----------+-----------+-----------
45
+ Tokens | 99.70 | 99.77 | 99.73 |
46
+ Sentences | 100.00 | 100.00 | 100.00 |
47
+ Words | 99.62 | 99.61 | 99.61 |
48
+ UAS | 89.58 | 89.57 | 89.58 | 89.92
49
+ LAS | 86.46 | 86.45 | 86.46 | 86.79
50
+ CLAS | 82.30 | 81.81 | 82.05 | 82.24
51
+
52
+ ### License
53
+ https://opensource.org/licenses/Apache-2.0
Tokenizer ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit be8ee4de465ecf0dbf008d986b99df43210f27bf
diaparser-is-combined-v211/diaparser.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56c56fcf918fa29e7cd1034546ee64205aba16a0492c8364ca724d3bf895c4e0
3
+ size 497675177
diaparser-is-combined-v211/diaparser.train.log ADDED
The diff for this file is too large to render. See raw diff
 
parse_file.py ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from argparse import ArgumentParser
2
+ from diaparser.parsers import Parser
3
+ from Tokenizer.src.tokenizer import split_into_sentences
4
+
5
+ parser = ArgumentParser()
6
+ parser.add_argument('--parser')
7
+ parser.add_argument('--infile')
8
+ args = parser.parse_args()
9
+ PARSER = Parser.load(args.parser)
10
+
11
+
12
+ def read_test_file(file):
13
+ with open(file, 'r', encoding='utf-8') as infile:
14
+ for line in infile:
15
+ yield [tok for tok in ' '.join(split_into_sentences(line)).split()]
16
+
17
+ test_file = list(read_test_file(args.infile))
18
+
19
+
20
+ dataset = PARSER.predict(test_file, prob=True)
21
+ for i in dataset.sentences:
22
+ print(i)
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ diaparser==1.1.2
2
+ Tokenizer==3.4.2
test_file.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ Komið þið sæl.
2
+ Þetta skjal er ætlað til að sýna hvernig þáttarinn virkar.
3
+ Njótið dagsins.
transformer_models/electra-base-igc-is ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit e2921de06b441e2a3066da485d6fa31cf5c816a8