Commit
·
695687f
1
Parent(s):
79368a9
Added files from Clarin: http://hdl.handle.net/20.500.12537/302
Browse filesInstead of directly adding the dependencies (electra model, Tokenizer),
use git submodules to reference them.
- Tokenizer submodule is using https://github.com/icelandic-lt/Tokenizer
- Electra submodule is using https://huggingface.co/Icelandic-lt/electra-base-igc-is
Signed-off-by: Daniel Schnell <dschnell@grammatek.com>
- .gitignore +1 -0
- .gitmodules +6 -0
- .python-version +1 -0
- README.md +53 -3
- Tokenizer +1 -0
- diaparser-is-combined-v211/diaparser.model +3 -0
- diaparser-is-combined-v211/diaparser.train.log +0 -0
- parse_file.py +22 -0
- requirements.txt +2 -0
- test_file.txt +3 -0
- transformer_models/electra-base-igc-is +1 -0
.gitignore
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
venv*/
|
.gitmodules
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[submodule "Tokenizer"]
|
| 2 |
+
path = Tokenizer
|
| 3 |
+
url = https://github.com/icelandic-lt/Tokenizer
|
| 4 |
+
[submodule "transformer_models/electra-base-igc-is"]
|
| 5 |
+
path = transformer_models/electra-base-igc-is
|
| 6 |
+
url = git@hf.co:Icelandic-lt/electra-base-igc-is
|
.python-version
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
3.8
|
README.md
CHANGED
|
@@ -1,3 +1,53 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## UD-þáttari sem nýtir sér upplýsingar úr Transformer-mállíkani
|
| 2 |
+
|
| 3 |
+
Mælt er með því að þáttarinn sé keyrður með Python3.8 í sýndarumhverfi.
|
| 4 |
+
Hægt er að nota conda fyrir sýndarumhverfi: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
|
| 5 |
+
Einnig er hægt að nota venv fyrir sýndarumhverfi: https://docs.python.org/3/library/venv.html
|
| 6 |
+
|
| 7 |
+
Til þess að keyra þáttarann þarf að setja upp nauðsynlega pakka, eftir að sýndarumhverfi hefur verið virkjað: python3 -m pip install -r requirements.txt
|
| 8 |
+
Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer).
|
| 9 |
+
|
| 10 |
+
Hægt er að keyra þáttarann svona: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
|
| 11 |
+
~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.
|
| 12 |
+
|
| 13 |
+
Skor:
|
| 14 |
+
|
| 15 |
+
Metric | Precision | Recall | F1 Score | AligndAcc
|
| 16 |
+
-----------+-----------+-----------+-----------+-----------
|
| 17 |
+
Tokens | 99.70 | 99.77 | 99.73 |
|
| 18 |
+
Sentences | 100.00 | 100.00 | 100.00 |
|
| 19 |
+
Words | 99.62 | 99.61 | 99.61 |
|
| 20 |
+
UAS | 89.58 | 89.57 | 89.58 | 89.92
|
| 21 |
+
LAS | 86.46 | 86.45 | 86.46 | 86.79
|
| 22 |
+
CLAS | 82.30 | 81.81 | 82.05 | 82.24
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
## A Universal Dependency parser built on top of a Transformer language model
|
| 27 |
+
|
| 28 |
+
Python3.8 recommended, as well as a virtual environment.
|
| 29 |
+
|
| 30 |
+
You can use conda for a virtual environment: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
|
| 31 |
+
You can also use venv for a virtual environment: https://docs.python.org/3/library/venv.html
|
| 32 |
+
|
| 33 |
+
To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
|
| 34 |
+
|
| 35 |
+
The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/mideind/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
|
| 36 |
+
|
| 37 |
+
The parser can be run as follows: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
|
| 38 |
+
|
| 39 |
+
~~~transformer_models/~~~ contains a pretrained model, electra-base-igc-is, which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
|
| 40 |
+
|
| 41 |
+
The parser scores as follows:
|
| 42 |
+
|
| 43 |
+
Metric | Precision | Recall | F1 Score | AligndAcc
|
| 44 |
+
-----------+-----------+-----------+-----------+-----------
|
| 45 |
+
Tokens | 99.70 | 99.77 | 99.73 |
|
| 46 |
+
Sentences | 100.00 | 100.00 | 100.00 |
|
| 47 |
+
Words | 99.62 | 99.61 | 99.61 |
|
| 48 |
+
UAS | 89.58 | 89.57 | 89.58 | 89.92
|
| 49 |
+
LAS | 86.46 | 86.45 | 86.46 | 86.79
|
| 50 |
+
CLAS | 82.30 | 81.81 | 82.05 | 82.24
|
| 51 |
+
|
| 52 |
+
### License
|
| 53 |
+
https://opensource.org/licenses/Apache-2.0
|
Tokenizer
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
Subproject commit be8ee4de465ecf0dbf008d986b99df43210f27bf
|
diaparser-is-combined-v211/diaparser.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:56c56fcf918fa29e7cd1034546ee64205aba16a0492c8364ca724d3bf895c4e0
|
| 3 |
+
size 497675177
|
diaparser-is-combined-v211/diaparser.train.log
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
parse_file.py
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from argparse import ArgumentParser
|
| 2 |
+
from diaparser.parsers import Parser
|
| 3 |
+
from Tokenizer.src.tokenizer import split_into_sentences
|
| 4 |
+
|
| 5 |
+
parser = ArgumentParser()
|
| 6 |
+
parser.add_argument('--parser')
|
| 7 |
+
parser.add_argument('--infile')
|
| 8 |
+
args = parser.parse_args()
|
| 9 |
+
PARSER = Parser.load(args.parser)
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def read_test_file(file):
|
| 13 |
+
with open(file, 'r', encoding='utf-8') as infile:
|
| 14 |
+
for line in infile:
|
| 15 |
+
yield [tok for tok in ' '.join(split_into_sentences(line)).split()]
|
| 16 |
+
|
| 17 |
+
test_file = list(read_test_file(args.infile))
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
dataset = PARSER.predict(test_file, prob=True)
|
| 21 |
+
for i in dataset.sentences:
|
| 22 |
+
print(i)
|
requirements.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
diaparser==1.1.2
|
| 2 |
+
Tokenizer==3.4.2
|
test_file.txt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Komið þið sæl.
|
| 2 |
+
Þetta skjal er ætlað til að sýna hvernig þáttarinn virkar.
|
| 3 |
+
Njótið dagsins.
|
transformer_models/electra-base-igc-is
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
Subproject commit e2921de06b441e2a3066da485d6fa31cf5c816a8
|