Ahmet Yildirim
commited on
Commit
·
7d46aa7
1
Parent(s):
fab0b3a
- Initial commit
Browse files- README.md +138 -3
- __init__.py +0 -0
- config.json +11 -0
- configuration_humit_tagger.py +12 -0
- lemma_rule.py +190 -0
- model.safetensors +3 -0
- modeling_humit_tagger.py +969 -0
- special_tokens_map.json +1 -0
- tagger_config.json +0 -0
- tokenizer.json +0 -0
- tokenizer_config.json +3 -0
README.md
CHANGED
|
@@ -1,3 +1,138 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language: 'no'
|
| 4 |
+
base_model: ltg/norbert3-base
|
| 5 |
+
tags:
|
| 6 |
+
- norsk,
|
| 7 |
+
- nynorsk,
|
| 8 |
+
- bokmål,
|
| 9 |
+
- språkidentifikasjon,
|
| 10 |
+
- morfologisk_tagging
|
| 11 |
+
- setningsgrensedeteksjon
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Humit-Tagger Base
|
| 15 |
+
|
| 16 |
+
The official release of Norwegian Morphology Tagger - Humit Tagger as a Hugginface Model.
|
| 17 |
+
|
| 18 |
+
This specific version of the tagger is based on Norbert3-base.
|
| 19 |
+
|
| 20 |
+
The aim of this model is to make Humit-Tagger available as a HuggingFace model including all functionality that the [original code](https://github.com/humit-oslo/humit-tagger) supports.
|
| 21 |
+
In addition to the morphological tagging, this model supports Nynorsk/Bokmåk language identification provided by this [repository](https://github.com/humit-oslo/humit-sprakidentifikator).
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
This model adds four classification layers on top of the base model.
|
| 25 |
+
These layers do language identification, morphologic classification, lemmatization classification, and language identification.
|
| 26 |
+
|
| 27 |
+
The large version has in overal %1 more accuracy score than the base version.
|
| 28 |
+
According to CPU/GPU power, the following sizes could be used:
|
| 29 |
+
|
| 30 |
+
## Humit-tagger sizes:
|
| 31 |
+
|
| 32 |
+
The humit-tagger sizes follow the sizes of [Norbert3](https://huggingface.co/ltg/norbert3-base).
|
| 33 |
+
|
| 34 |
+
- [humit-tagger-xs (15M)](https://huggingface.co/Humit-Oslo/humit-tagger-xs)
|
| 35 |
+
- [humit-tagger-small (40M)](https://huggingface.co/Humit-Oslo/humit-tagger-small)
|
| 36 |
+
- [humit-tagger-base (123M)](https://huggingface.co/Humit-Oslo/humit-tagger-base)
|
| 37 |
+
- [humit-tagger-large (323M)](https://huggingface.co/Humit-Oslo/humit-tagger-large)
|
| 38 |
+
|
| 39 |
+
## Loading Model
|
| 40 |
+
|
| 41 |
+
This model implements custom functionalities such as the tag and identify\_language functions and other functions that are used by these functions.
|
| 42 |
+
To be able to provide these functionalities, this model uses a custom wrapper.
|
| 43 |
+
Therefore the model should be loaded with `trust_remote_code=True`.
|
| 44 |
+
|
| 45 |
+
The model can be loadad as the following:
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from transformers import AutoModel
|
| 49 |
+
humit_tagger = AutoModel.from_pretrained("Humit-Oslo/humit-tagger-base", trust_remote_code=True)
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
## Functions and parameters
|
| 53 |
+
|
| 54 |
+
The model provides two functions: tag and identify\_language.
|
| 55 |
+
The tag function does the morphologic tagging of the input.
|
| 56 |
+
The identify\_language identifies the language of input as "nn" for nynorsk and "bm" for bokmål.
|
| 57 |
+
These functions receive similar parameters.
|
| 58 |
+
|
| 59 |
+
| parameter | .tag supports| .identify\_language supports | options | default | used |
|
| 60 |
+
| :--- | :-| :- | :- | :- | :- |
|
| 61 |
+
| inp | yes | yes | | None | to give the input. No need to give parameter name if the parameter is the first parameter |
|
| 62 |
+
| lang | yes | no | "nn", "bm", "au" | "au"| to specify the language of tags. "au" tries to identify the language automatically from the input. |
|
| 63 |
+
| input\_directory | yes | yes | | None | to apply the function recursively on input\_directory |
|
| 64 |
+
| output\_directory | yes | yes | | None | to output recursively in output\_directory. The written files will have extension ".tagged" or ".lang" according to the function called. |
|
| 65 |
+
| one\_sentence\_per\_line | yes | yes | True / False | False | not to apply sentence boundary detection and consider each line as a sentence in the input or the input file(s). |
|
| 66 |
+
| lang\_per\_sentence| yes | no | True / False | False | identify the language per sentence and output the tags according to the language identified for that sentence. If this is not set, and lang is "au" then the whole input (or a file if input\_directory is used) is used to identify the language. |
|
| 67 |
+
| write\_output\_to | yes | yes | a file path, a file handle, or "list" | sys.stdout | to specify where to write the output. If a file path is provided, the output will be written to that file. The file is overwritten. If a file handle is provided, then the output is written to there. If "list" is given as parameters, then the function returns a python "list". |
|
| 68 |
+
| output\_tsv | yes | yes | True/False | False | to specify the output format. The default is the json format. If multiple sentences exist, each line is a single valid json but not the whole output. This option cannot be used along with write\_output\_to="list" |
|
| 69 |
+
| lang\_per\_item | no | yes | True/False | False | consider each item in the list given as separate input for language identification. |
|
| 70 |
+
| fast\_mode | no | yes | True/False | False | identify languages of the files in the input directory in fast mode. This mode uses only the beginning of the files in identification. This method is much more faster for many files but is not as accurate as if this paramer is set to False. |
|
| 71 |
+
|
| 72 |
+
## Several example use cases:
|
| 73 |
+
|
| 74 |
+
### Tag one sentence
|
| 75 |
+
```python
|
| 76 |
+
humit_tagger.tag("Dette er en norsk setning.")
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Tag a list of sentences
|
| 80 |
+
```python
|
| 81 |
+
humit_tagger.tag(["Dette er en norsk setning.", "Dette er en annen norsk setning."])
|
| 82 |
+
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
### Tag a file
|
| 86 |
+
```python
|
| 87 |
+
with open ("path/to/file", "r") as f:
|
| 88 |
+
humit_tagger.tag(f.read())
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### Tag all files recursively in a directory
|
| 92 |
+
|
| 93 |
+
Here, input\_directory and output\_direcotry must be given as parameter.
|
| 94 |
+
The files that can be read in text mode will be tagged and the output will be written in the output\_directory with same directory and sub-directory structure.
|
| 95 |
+
The file names will have the same name with ".tagged" at the end.
|
| 96 |
+
Any existing files will be overwritten.
|
| 97 |
+
|
| 98 |
+
```python
|
| 99 |
+
humit_tagger.tag(input_directory = "path/to/input/directory", output_directory = "path/to/output/directory" )
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### Language identification
|
| 103 |
+
```python
|
| 104 |
+
humit_tagger.identify_language("Eg elskar snø.")
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
### Language identification of multiple sentences:
|
| 108 |
+
```python
|
| 109 |
+
humit_tagger.identify_language(["Jeg elsker snø.","Eg elskar snø."])
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
### Recursive language identification of all files in a directory
|
| 113 |
+
```python
|
| 114 |
+
humit_tagger.identify_language(input_directory = "../inp")
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
## Cite us
|
| 119 |
+
|
| 120 |
+
```bibtex
|
| 121 |
+
@inproceedings{haug-etal-2023-integrating,
|
| 122 |
+
title = "Rules and neural nets for morphological tagging of {N}orwegian - Results and challenges",
|
| 123 |
+
author = "Haug, Dag and
|
| 124 |
+
Yildirim, Ahmet and
|
| 125 |
+
Hagen, Kristin and
|
| 126 |
+
N{\o}klestad, Anders",
|
| 127 |
+
editor = {Alum{\"a}e, Tanel and
|
| 128 |
+
Fishel, Mark},
|
| 129 |
+
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
|
| 130 |
+
month = may,
|
| 131 |
+
year = "2023",
|
| 132 |
+
address = "T{\'o}rshavn, Faroe Islands",
|
| 133 |
+
publisher = "University of Tartu Library",
|
| 134 |
+
url = "https://aclanthology.org/2023.nodalida-1.43/",
|
| 135 |
+
pages = "425--435",
|
| 136 |
+
abstract = "This paper reports on efforts to improve the Oslo-Bergen Tagger for Norwegian morphological tagging. We train two deep neural network-based taggers using the recently introduced Norwegian pre-trained encoder (a BERT model for Norwegian). The first network is a sequence-to-sequence encoder-decoder and the second is a sequence classifier. We test both these configurations in a hybrid system where they combine with the existing rule-based system, and on their own. The sequence-to-sequence system performs better in the hybrid configuration, but the classifier system performs so well that combining it with the rules is actually slightly detrimental to performance."
|
| 137 |
+
}
|
| 138 |
+
```
|
__init__.py
ADDED
|
File without changes
|
config.json
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"HumitTaggerModel"
|
| 4 |
+
],
|
| 5 |
+
"auto_map": {
|
| 6 |
+
"AutoConfig": "configuration_humit_tagger.HumitTaggerConfig",
|
| 7 |
+
"AutoModel": "modeling_humit_tagger.HumitTaggerModel"
|
| 8 |
+
},
|
| 9 |
+
"humit_tagger_configuration":"tagger_config.json",
|
| 10 |
+
"lemma_rules_py_file":"lemma_rule.py"
|
| 11 |
+
}
|
configuration_humit_tagger.py
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers.configuration_utils import PretrainedConfig
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
class HumitTaggerConfig(PretrainedConfig):
|
| 5 |
+
"""Configuration class to store the configuration of a `HumitTaggerModel`.
|
| 6 |
+
"""
|
| 7 |
+
def __init__(
|
| 8 |
+
self,
|
| 9 |
+
**kwargs,
|
| 10 |
+
):
|
| 11 |
+
super().__init__(**kwargs)
|
| 12 |
+
|
lemma_rule.py
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Script that implements word-lemma conversions and rule extractsion.
|
| 2 |
+
# Most of the code has been taken from : https://github.com/hplt-project/HPLT-WP4/blob/main/evaluation/ud/lemma_rule.py
|
| 3 |
+
# This is a class with static members
|
| 4 |
+
|
| 5 |
+
import pickle
|
| 6 |
+
|
| 7 |
+
class LemmaHandling:
|
| 8 |
+
lemma_dict = dict()
|
| 9 |
+
lemma_list = list()
|
| 10 |
+
lemma_list_inverted = dict()
|
| 11 |
+
word_classes = dict()
|
| 12 |
+
def __init__(self):
|
| 13 |
+
pass
|
| 14 |
+
|
| 15 |
+
def min_edit_script(source, target, allow_copy):
|
| 16 |
+
a = [[(len(source) + len(target) + 1, None)] * (len(target) + 1) for _ in range(len(source) + 1)]
|
| 17 |
+
for i in range(0, len(source) + 1):
|
| 18 |
+
for j in range(0, len(target) + 1):
|
| 19 |
+
if i == 0 and j == 0:
|
| 20 |
+
a[i][j] = (0, "")
|
| 21 |
+
else:
|
| 22 |
+
if allow_copy and i and j and source[i - 1] == target[j - 1] and a[i-1][j-1][0] < a[i][j][0]:
|
| 23 |
+
a[i][j] = (a[i-1][j-1][0], a[i-1][j-1][1] + "→")
|
| 24 |
+
if i and a[i-1][j][0] < a[i][j][0]:
|
| 25 |
+
a[i][j] = (a[i-1][j][0] + 1, a[i-1][j][1] + "-")
|
| 26 |
+
if j and a[i][j-1][0] < a[i][j][0]:
|
| 27 |
+
a[i][j] = (a[i][j-1][0] + 1, a[i][j-1][1] + "+" + target[j - 1])
|
| 28 |
+
return a[-1][-1][1]
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def gen_lemma_rule(form, lemma, allow_copy):
|
| 32 |
+
best, best_form, best_lemma = 0, 0, 0
|
| 33 |
+
for l in range(len(lemma)):
|
| 34 |
+
for f in range(len(form)):
|
| 35 |
+
cpl = 0
|
| 36 |
+
while f + cpl < len(form) and l + cpl < len(lemma) and form[f + cpl].lower() == lemma[l + cpl].lower():
|
| 37 |
+
cpl += 1
|
| 38 |
+
if cpl > best:
|
| 39 |
+
best = cpl
|
| 40 |
+
best_form = f
|
| 41 |
+
best_lemma = l
|
| 42 |
+
|
| 43 |
+
if not best:
|
| 44 |
+
return {"case": None, "prefix": None, "suffix": None, "absolute": "a" + lemma}
|
| 45 |
+
|
| 46 |
+
prefix_rule = LemmaHandling.min_edit_script(form[:best_form].lower(), lemma[:best_lemma].lower(), allow_copy)
|
| 47 |
+
suffix_rule = LemmaHandling.min_edit_script(form[best_form + best:].lower(), lemma[best_lemma + best:].lower(), allow_copy)
|
| 48 |
+
|
| 49 |
+
if lemma.islower():
|
| 50 |
+
return {"case": "lower", "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}
|
| 51 |
+
|
| 52 |
+
generated_lemma = LemmaHandling.apply_lemma_rule(form, {"case": "lower", "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}, apply_casing=False)
|
| 53 |
+
if generated_lemma == lemma:
|
| 54 |
+
return {"case": "keep", "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}
|
| 55 |
+
|
| 56 |
+
previous_case = -1
|
| 57 |
+
lemma_casing = ""
|
| 58 |
+
for i, c in enumerate(lemma):
|
| 59 |
+
case = "↑" if c.lower() != c else "↓"
|
| 60 |
+
if case != previous_case:
|
| 61 |
+
lemma_casing += "{}{}{}".format("¦" if lemma_casing else "", case, i if i <= len(lemma) // 2 else i - len(lemma))
|
| 62 |
+
previous_case = case
|
| 63 |
+
|
| 64 |
+
return {"case": lemma_casing, "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def apply_lemma_rule(form, lemma_rule, apply_casing=True):
|
| 68 |
+
if lemma_rule["absolute"].startswith("a"):
|
| 69 |
+
return lemma_rule["absolute"][1:]
|
| 70 |
+
|
| 71 |
+
if any(rule is None for rule in lemma_rule.values()):
|
| 72 |
+
return form
|
| 73 |
+
|
| 74 |
+
rules, rule_sources = (lemma_rule["prefix"], lemma_rule["suffix"]), []
|
| 75 |
+
for rule in rules:
|
| 76 |
+
source, i = 0, 0
|
| 77 |
+
while i < len(rule):
|
| 78 |
+
if rule[i] == "→" or rule[i] == "-":
|
| 79 |
+
source += 1
|
| 80 |
+
else:
|
| 81 |
+
assert rule[i] == "+"
|
| 82 |
+
i += 1
|
| 83 |
+
i += 1
|
| 84 |
+
rule_sources.append(source)
|
| 85 |
+
|
| 86 |
+
try:
|
| 87 |
+
lemma, form_offset = "", 0
|
| 88 |
+
for i in range(2):
|
| 89 |
+
j, offset = 0, (0 if i == 0 else len(form) - rule_sources[1])
|
| 90 |
+
while j < len(rules[i]):
|
| 91 |
+
if rules[i][j] == "→":
|
| 92 |
+
lemma += form[offset]
|
| 93 |
+
offset += 1
|
| 94 |
+
elif rules[i][j] == "-":
|
| 95 |
+
offset += 1
|
| 96 |
+
else:
|
| 97 |
+
assert(rules[i][j] == "+")
|
| 98 |
+
lemma += rules[i][j + 1]
|
| 99 |
+
j += 1
|
| 100 |
+
j += 1
|
| 101 |
+
if i == 0:
|
| 102 |
+
lemma += form[rule_sources[0] : len(form) - rule_sources[1]]
|
| 103 |
+
except:
|
| 104 |
+
lemma = form
|
| 105 |
+
|
| 106 |
+
if not apply_casing:
|
| 107 |
+
return lemma
|
| 108 |
+
|
| 109 |
+
if lemma_rule["case"] == "lower":
|
| 110 |
+
return lemma.lower()
|
| 111 |
+
elif lemma_rule["case"] == "keep":
|
| 112 |
+
return lemma
|
| 113 |
+
|
| 114 |
+
lemma = lemma.lower()
|
| 115 |
+
for rule in lemma_rule["case"].split("¦"):
|
| 116 |
+
if rule == "↓0": continue # The lemma is lowercased initially
|
| 117 |
+
if not rule: continue # Empty lemma might generate empty casing rule
|
| 118 |
+
case, offset = rule[0], int(rule[1:])
|
| 119 |
+
lemma = lemma[:offset] + (lemma[offset:].upper() if case == "↑" else lemma[offset:].lower())
|
| 120 |
+
|
| 121 |
+
return lemma
|
| 122 |
+
|
| 123 |
+
# Extracts lemma rule given word and its lemma and adds the rule to the lemma rules dictionary if the rule does not exist
|
| 124 |
+
def add_lemma_rule_to_dict(word, lemma, word_class=None):
|
| 125 |
+
r=LemmaHandling.gen_lemma_rule(word,lemma, True)
|
| 126 |
+
st=[r['case'], r['prefix'], r['suffix'], r['absolute']]
|
| 127 |
+
|
| 128 |
+
st=";".join(["§" if i==None else i for i in st])
|
| 129 |
+
if st not in LemmaHandling.lemma_dict:
|
| 130 |
+
LemmaHandling.lemma_dict[st]=r
|
| 131 |
+
if word_class==None:
|
| 132 |
+
word_class="ukjent"
|
| 133 |
+
if st not in LemmaHandling.word_classes:
|
| 134 |
+
LemmaHandling.word_classes[st]=[word_class]
|
| 135 |
+
else:
|
| 136 |
+
LemmaHandling.word_classes[st].append(word_class)
|
| 137 |
+
LemmaHandling.word_classes[st]=sorted(list(set(LemmaHandling.word_classes[st])))
|
| 138 |
+
|
| 139 |
+
# This function initializes lemma rule directory and lists
|
| 140 |
+
def start_lemma_rule_extraction():
|
| 141 |
+
LemmaHandling.lemma_list=[]
|
| 142 |
+
LemmaHandling.lemma_list_inverted={}
|
| 143 |
+
LemmaHandling.lemma_dict={}
|
| 144 |
+
|
| 145 |
+
# This function extracts lemma_list using the lemma_dict
|
| 146 |
+
def done_lemma_list_extraction():
|
| 147 |
+
LemmaHandling.lemma_list=["[NONE]"] + list(LemmaHandling.lemma_dict.keys())
|
| 148 |
+
LemmaHandling.lemma_list_inverted={j:i for i,j in enumerate(LemmaHandling.lemma_list)}
|
| 149 |
+
|
| 150 |
+
# This saves lemma rules to a file
|
| 151 |
+
def save_lemma_rules(file_name):
|
| 152 |
+
with open(file_name, "wb") as fil:
|
| 153 |
+
pickle.dump([LemmaHandling.lemma_dict, LemmaHandling.lemma_list, LemmaHandling.word_classes ], fil)
|
| 154 |
+
|
| 155 |
+
# This function loads an already saved rules file
|
| 156 |
+
def load_lemma_rules(dict_file):
|
| 157 |
+
with open(dict_file, 'rb') as fil:
|
| 158 |
+
LemmaHandling.lemma_dict, LemmaHandling.lemma_list, LemmaHandling.word_classes = pickle.load(fil)
|
| 159 |
+
LemmaHandling.lemma_list_inverted={j:i for i,j in enumerate(LemmaHandling.lemma_list)}
|
| 160 |
+
|
| 161 |
+
# This function loads lemma rules from an object
|
| 162 |
+
def load_lemma_rules_from_obj(obj):
|
| 163 |
+
LemmaHandling.lemma_dict, LemmaHandling.lemma_list, LemmaHandling.word_classes = obj
|
| 164 |
+
LemmaHandling.lemma_list_inverted={j:i for i,j in enumerate(LemmaHandling.lemma_list)}
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
# This returns the lemma given the word and its rule index
|
| 168 |
+
# If the index is not found returns the word as lemma
|
| 169 |
+
def get_lemma_and_word_classes_given_word_and_lemma_list_index(word, lemma_list_index):
|
| 170 |
+
if lemma_list_index>=len(LemmaHandling.lemma_dict):
|
| 171 |
+
return word
|
| 172 |
+
st = LemmaHandling.lemma_list[lemma_list_index]
|
| 173 |
+
return LemmaHandling.apply_lemma_rule(word, LemmaHandling.lemma_dict[st], apply_casing=True) , LemmaHandling.word_classes[st]
|
| 174 |
+
|
| 175 |
+
# Same as before without word classes
|
| 176 |
+
def get_lemma_given_word_and_lemma_list_index(word, lemma_list_index):
|
| 177 |
+
if lemma_list_index>=len(LemmaHandling.lemma_dict) or lemma_list_index==0:
|
| 178 |
+
return word
|
| 179 |
+
return LemmaHandling.apply_lemma_rule(word, LemmaHandling.lemma_dict[LemmaHandling.lemma_list[lemma_list_index]], apply_casing=True)
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
# This function returns lemma_rule index given word and lemma
|
| 183 |
+
def get_lemma_rule_index(word, lemma):
|
| 184 |
+
r=LemmaHandling.gen_lemma_rule(word,lemma, True)
|
| 185 |
+
st=[r['case'], r['prefix'], r['suffix'], r['absolute']]
|
| 186 |
+
st=";".join(["§" if i==None else i for i in st])
|
| 187 |
+
if st not in LemmaHandling.lemma_dict:
|
| 188 |
+
return 0
|
| 189 |
+
return LemmaHandling.lemma_list_inverted[st]
|
| 190 |
+
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0e5ea6c6918d8598ecec750d5f71d6da616db0cc4e8dda0409ed56098463edd9
|
| 3 |
+
size 496451528
|
modeling_humit_tagger.py
ADDED
|
@@ -0,0 +1,969 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from transformers import (
|
| 2 |
+
AutoModel,
|
| 3 |
+
AutoTokenizer
|
| 4 |
+
)
|
| 5 |
+
import torch
|
| 6 |
+
from huggingface_hub import hf_hub_download
|
| 7 |
+
import os
|
| 8 |
+
import importlib.util
|
| 9 |
+
import sys
|
| 10 |
+
import shutil
|
| 11 |
+
from safetensors.torch import load_model
|
| 12 |
+
import json
|
| 13 |
+
import re
|
| 14 |
+
import copy
|
| 15 |
+
|
| 16 |
+
class HumitTaggerModel(torch.nn.Module):
|
| 17 |
+
|
| 18 |
+
# We do not need to do anything to register our class as this class will only be used
|
| 19 |
+
# for easily getting humit-tagger worki
|
| 20 |
+
def register_for_auto_class(auto_class):
|
| 21 |
+
pass
|
| 22 |
+
return
|
| 23 |
+
|
| 24 |
+
# Define our own from-pretrained to load the weights and other files needed for the tagger to work
|
| 25 |
+
def from_pretrained(repo_name, **kwargs):
|
| 26 |
+
|
| 27 |
+
# Download this model's config:
|
| 28 |
+
this_model_config_path = hf_hub_download(repo_id=repo_name, filename=kwargs["config"].humit_tagger_configuration)
|
| 29 |
+
|
| 30 |
+
# load this model's config
|
| 31 |
+
with open(this_model_config_path,"r") as js:
|
| 32 |
+
kwargs["this_model_config"]=json.load(js)
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
# Download this model's config:
|
| 36 |
+
lemma_rules_path = hf_hub_download(repo_id=repo_name, filename=kwargs["config"].lemma_rules_py_file)
|
| 37 |
+
|
| 38 |
+
# load lemma rules class
|
| 39 |
+
sys.path.append(os.path.dirname(lemma_rules_path))
|
| 40 |
+
spec = importlib.util.spec_from_file_location("lemma_rules", lemma_rules_path)
|
| 41 |
+
lemma_rules = importlib.util.module_from_spec(spec)
|
| 42 |
+
sys.modules["lemma_rules"] = lemma_rules
|
| 43 |
+
spec.loader.exec_module(lemma_rules)
|
| 44 |
+
|
| 45 |
+
# Download base_model files into cache
|
| 46 |
+
base_config_file = hf_hub_download(repo_id=kwargs["this_model_config"]["base_model"], filename=kwargs["this_model_config"]["base_model_config_file"])
|
| 47 |
+
base_model_file = hf_hub_download(repo_id=kwargs["this_model_config"]["base_model"], filename=kwargs["this_model_config"]["base_model_model_file"])
|
| 48 |
+
base_model_config_json_file = hf_hub_download(repo_id=kwargs["this_model_config"]["base_model"], filename=kwargs["this_model_config"]["base_model_config_json_file"])
|
| 49 |
+
|
| 50 |
+
# Copy base model's configuration python file into our working directory
|
| 51 |
+
config_file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)) , os.path.basename(base_config_file))
|
| 52 |
+
shutil.copyfile(base_config_file, config_file_path)
|
| 53 |
+
|
| 54 |
+
# HACK: Modify base model main file since __init.py__ has already been read and the new file must not contain relative imports
|
| 55 |
+
base_model_file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)) , os.path.basename(base_model_file))
|
| 56 |
+
with open(base_model_file, 'r') as file:
|
| 57 |
+
file_content = file.read().replace("from .", "from ")
|
| 58 |
+
with open(base_model_file_path, 'w') as file:
|
| 59 |
+
file.write(file_content)
|
| 60 |
+
|
| 61 |
+
# Register the new files:
|
| 62 |
+
# First register the base model config file
|
| 63 |
+
sys.path.append(os.path.dirname(config_file_path))
|
| 64 |
+
spec = importlib.util.spec_from_file_location("base_config", config_file_path)
|
| 65 |
+
base_config = importlib.util.module_from_spec(spec)
|
| 66 |
+
sys.modules["base_config"] = base_config
|
| 67 |
+
spec.loader.exec_module(base_config)
|
| 68 |
+
# Then register the base model file
|
| 69 |
+
sys.path.append(os.path.dirname(base_model_file_path))
|
| 70 |
+
spec = importlib.util.spec_from_file_location("base_model", base_model_file_path)
|
| 71 |
+
base_model = importlib.util.module_from_spec(spec)
|
| 72 |
+
sys.modules["base_model"] = base_model
|
| 73 |
+
spec.loader.exec_module(base_model)
|
| 74 |
+
|
| 75 |
+
# Download model weights
|
| 76 |
+
model_weights_path = hf_hub_download(repo_id=repo_name, filename=kwargs["this_model_config"]["model_weights"])
|
| 77 |
+
|
| 78 |
+
# load base model config
|
| 79 |
+
with open(base_model_config_json_file,"r") as js:
|
| 80 |
+
kwargs["base_model_json_cfg"] = json.load(js)
|
| 81 |
+
|
| 82 |
+
kwargs["model_weights_path"] = model_weights_path
|
| 83 |
+
kwargs["repo_name"] = repo_name
|
| 84 |
+
return HumitTaggerModel(**kwargs)
|
| 85 |
+
|
| 86 |
+
def __init__(self, **kwargs ):
|
| 87 |
+
super(HumitTaggerModel, self).__init__()
|
| 88 |
+
json_cfg = kwargs["base_model_json_cfg"]
|
| 89 |
+
self.config=kwargs["this_model_config"]
|
| 90 |
+
self.LemmaHandling = sys.modules["lemma_rules"].LemmaHandling
|
| 91 |
+
self.LemmaHandling.load_lemma_rules_from_obj(self.config["lemma_rules"])
|
| 92 |
+
cfg=sys.modules["base_config"].NorbertConfig(**json_cfg)
|
| 93 |
+
self.bert=sys.modules["base_model"].NorbertModel(cfg, pooling_type="CLS")
|
| 94 |
+
self.dropout = torch.nn.Dropout(self.bert.config.hidden_dropout_prob)
|
| 95 |
+
self.classifier1 = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels1"])
|
| 96 |
+
self.classifier2 = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels2"])
|
| 97 |
+
self.classifier3 = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels3"])
|
| 98 |
+
self.seq_classifier = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels_seq"])
|
| 99 |
+
self.ignore_index = self.config["ignore_index"]
|
| 100 |
+
load_model(self, kwargs["model_weights_path"])
|
| 101 |
+
self.tokenizer=AutoTokenizer.from_pretrained(kwargs["repo_name"])
|
| 102 |
+
if "batch_size" in kwargs:
|
| 103 |
+
self.batch_size=kwargs["batch_size"]
|
| 104 |
+
else:
|
| 105 |
+
self.batch_size=8
|
| 106 |
+
|
| 107 |
+
if "device" in kwargs:
|
| 108 |
+
self.device = torch.device(kwargs["device"])
|
| 109 |
+
else:
|
| 110 |
+
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 111 |
+
|
| 112 |
+
self.MAX_LENGTH_WITHOUT_CLS = self.bert.config.max_position_embeddings -1
|
| 113 |
+
self.tags=self.config["tags"]
|
| 114 |
+
self.tags_str=[[" ".join(i) for i in self.config["tags"][0]], [" ".join(i) for i in self.config["tags"][1]]]
|
| 115 |
+
self.to(self.device)
|
| 116 |
+
self.REPLACE_DICT = self.config["replace_dict"]
|
| 117 |
+
self.REPLACE_PATTERN = '|'.join(sorted(re.escape(k) for k in self.REPLACE_DICT))
|
| 118 |
+
self.MAX_LENGTH = self.bert.config.max_position_embeddings
|
| 119 |
+
|
| 120 |
+
def forward(self, input_ids=None, attention_mask=None ):
|
| 121 |
+
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, return_dict=True )
|
| 122 |
+
sequence_output = self.dropout(outputs.last_hidden_state)
|
| 123 |
+
logits1 = self.classifier1(sequence_output)
|
| 124 |
+
logits2 = self.classifier2(sequence_output)
|
| 125 |
+
logits3 = self.classifier3(sequence_output)
|
| 126 |
+
seq_logits = self.seq_classifier(sequence_output)
|
| 127 |
+
total_loss = 0
|
| 128 |
+
return {
|
| 129 |
+
"logits1": logits1,
|
| 130 |
+
"logits2": logits2,
|
| 131 |
+
"logits3": logits3,
|
| 132 |
+
"seq_logits": seq_logits,
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
def _preprocess_text(self,text):
|
| 136 |
+
new_text = re.sub(self.REPLACE_PATTERN, lambda m: self.REPLACE_DICT.get(m.group(0).upper()), text)
|
| 137 |
+
while new_text != text:
|
| 138 |
+
text = new_text
|
| 139 |
+
new_text = re.sub(self.REPLACE_PATTERN, lambda m: self.REPLACE_DICT.get(m.group(0).upper()), text)
|
| 140 |
+
return new_text
|
| 141 |
+
|
| 142 |
+
def _batchify(self, lst):
|
| 143 |
+
|
| 144 |
+
# Create batches
|
| 145 |
+
batched_sentences=[]
|
| 146 |
+
my_batch=[]
|
| 147 |
+
for sentence in lst:
|
| 148 |
+
sentence.append(self.tokenizer.sep_token_id)
|
| 149 |
+
my_batch.append(sentence)
|
| 150 |
+
if len(my_batch)==self.batch_size:
|
| 151 |
+
max_len=len(max(my_batch, key=len))
|
| 152 |
+
if max_len > self.MAX_LENGTH:
|
| 153 |
+
max_len = self.MAX_LENGTH
|
| 154 |
+
my_attentions=torch.LongTensor([[1] * len(i[0:max_len]) + [0]*(max_len-len(i[0:max_len])) for i in my_batch]).to("cpu")
|
| 155 |
+
my_batch=[i[0:max_len] + [0]*(max_len-len(i[0:max_len])) for i in my_batch]
|
| 156 |
+
to_append={
|
| 157 |
+
"input_ids": torch.LongTensor(my_batch).to("cpu"),
|
| 158 |
+
"attention_mask": my_attentions,
|
| 159 |
+
}
|
| 160 |
+
batched_sentences.append(to_append)
|
| 161 |
+
my_batch=[]
|
| 162 |
+
if len(my_batch)>0:
|
| 163 |
+
max_len=len(max(my_batch, key=len))
|
| 164 |
+
if max_len > self.MAX_LENGTH:
|
| 165 |
+
max_len = self.MAX_LENGTH
|
| 166 |
+
my_attentions=torch.LongTensor([[1] * len(i[0:max_len]) + [0]*(max_len-len(i[0:max_len])) for i in my_batch]).to("cpu")
|
| 167 |
+
my_batch=[i[0:max_len] + [0]*(max_len-len(i[0:max_len])) for i in my_batch]
|
| 168 |
+
to_append={
|
| 169 |
+
"input_ids": torch.LongTensor(my_batch).to("cpu"),
|
| 170 |
+
"attention_mask": my_attentions,
|
| 171 |
+
}
|
| 172 |
+
batched_sentences.append(to_append)
|
| 173 |
+
|
| 174 |
+
torch.cuda.empty_cache()
|
| 175 |
+
|
| 176 |
+
return batched_sentences
|
| 177 |
+
|
| 178 |
+
def _split_sentences(self, inp):
|
| 179 |
+
|
| 180 |
+
# Here we get the whole text tokenized.
|
| 181 |
+
encodings = self.tokenizer(inp,add_special_tokens=False, return_tensors="pt").to(self.device)
|
| 182 |
+
|
| 183 |
+
# Save a copy of the tokenization
|
| 184 |
+
original_encodings=copy.deepcopy(encodings)
|
| 185 |
+
original_encodings=original_encodings.to("cpu")
|
| 186 |
+
torch.cuda.empty_cache()
|
| 187 |
+
|
| 188 |
+
# Pad to the complete size (model max_size -1 (-1 to add CLS))
|
| 189 |
+
old_size=encodings["input_ids"][0].size()[0]
|
| 190 |
+
|
| 191 |
+
# Pad size
|
| 192 |
+
pad_size=self.MAX_LENGTH_WITHOUT_CLS - old_size % self.MAX_LENGTH_WITHOUT_CLS
|
| 193 |
+
|
| 194 |
+
# Number of rows
|
| 195 |
+
row_count=int(old_size/self.MAX_LENGTH_WITHOUT_CLS) + 1
|
| 196 |
+
|
| 197 |
+
# Do padding with pad_id to the pad_size that we have calculated.
|
| 198 |
+
encodings["input_ids"] = torch.nn.functional.pad(input=encodings["input_ids"], pad=(0, pad_size), mode="constant", value=self.tokenizer.pad_token_id)
|
| 199 |
+
|
| 200 |
+
# Set the last token as SENTENCE END (SEP)
|
| 201 |
+
encodings["input_ids"][0][old_size]=self.tokenizer.sep_token_id
|
| 202 |
+
|
| 203 |
+
# Chunk into max_length items
|
| 204 |
+
encodings["input_ids"]=torch.reshape(encodings["input_ids"],(row_count,self.MAX_LENGTH_WITHOUT_CLS))
|
| 205 |
+
|
| 206 |
+
# Add CLS to each item
|
| 207 |
+
encodings["input_ids"]=torch.cat(( torch.full((row_count,1), self.tokenizer.cls_token_id, device=self.device) ,encodings["input_ids"]),dim=1)
|
| 208 |
+
|
| 209 |
+
# Create attention mask
|
| 210 |
+
encodings["attention_mask"]=torch.ones_like(encodings["input_ids"], device=self.device)
|
| 211 |
+
|
| 212 |
+
# Create batches
|
| 213 |
+
input_ids_batched=torch.split(encodings["input_ids"], self.batch_size)
|
| 214 |
+
attention_mask_batched=torch.split(encodings["attention_mask"], self.batch_size)
|
| 215 |
+
|
| 216 |
+
# Set the last chunk's attention mask according to its size
|
| 217 |
+
attention_mask_batched[-1][-1][pad_size +1:] = 0
|
| 218 |
+
|
| 219 |
+
encodings=encodings.to("cpu")
|
| 220 |
+
|
| 221 |
+
# Now pass all chunks through the model and get the labels
|
| 222 |
+
# While passing, we count the number of bokmal and nynorsk markers
|
| 223 |
+
labels_output=[]
|
| 224 |
+
|
| 225 |
+
# First get them back to CPU to open space on GPU
|
| 226 |
+
input_ids_batched=[i.to("cpu") for i in input_ids_batched]
|
| 227 |
+
attention_mask_batched=[i.to("cpu") for i in attention_mask_batched]
|
| 228 |
+
torch.cuda.empty_cache()
|
| 229 |
+
|
| 230 |
+
for input_ids, attention_masks in zip(input_ids_batched, attention_mask_batched):
|
| 231 |
+
current_batch={"input_ids":input_ids.to(self.device).long(), "attention_mask":attention_masks.to(self.device).long()}
|
| 232 |
+
outputs = self(**current_batch)
|
| 233 |
+
del current_batch
|
| 234 |
+
torch.cuda.empty_cache()
|
| 235 |
+
|
| 236 |
+
label_data=outputs["logits1"].argmax(-1)
|
| 237 |
+
labels_output.extend(label_data)
|
| 238 |
+
|
| 239 |
+
# Serialize back
|
| 240 |
+
labels_output=torch.stack(labels_output ,dim=0)
|
| 241 |
+
labels_output=labels_output[:, range(1,self.MAX_LENGTH)]
|
| 242 |
+
labels_output=torch.reshape(labels_output,(1,row_count * self.MAX_LENGTH_WITHOUT_CLS))
|
| 243 |
+
torch.cuda.empty_cache()
|
| 244 |
+
|
| 245 |
+
# Now the data is split into sentences
|
| 246 |
+
# So, now create sentence data as list so that this could be used
|
| 247 |
+
# in torch operations and can be input to the models
|
| 248 |
+
sentence_list=[]
|
| 249 |
+
this_sentence=[self.tokenizer.cls_token_id]
|
| 250 |
+
for token, label in zip(original_encodings["input_ids"][0].tolist(), labels_output[0].tolist()):
|
| 251 |
+
if label==0:
|
| 252 |
+
this_sentence.append(token)
|
| 253 |
+
else:
|
| 254 |
+
this_sentence.append(token)
|
| 255 |
+
sentence_list.append(this_sentence)
|
| 256 |
+
this_sentence=[self.tokenizer.cls_token_id]
|
| 257 |
+
|
| 258 |
+
if len(this_sentence)>1:
|
| 259 |
+
sentence_list.append(this_sentence)
|
| 260 |
+
del original_encodings
|
| 261 |
+
del labels_output
|
| 262 |
+
del attention_mask_batched
|
| 263 |
+
del input_ids_batched
|
| 264 |
+
del encodings
|
| 265 |
+
del old_size
|
| 266 |
+
del inp
|
| 267 |
+
del outputs
|
| 268 |
+
torch.cuda.empty_cache()
|
| 269 |
+
|
| 270 |
+
return sentence_list
|
| 271 |
+
|
| 272 |
+
def _matcher(self, o):
|
| 273 |
+
return o.group(0)[0] + "\n\n" + o.group(0)[2]
|
| 274 |
+
|
| 275 |
+
def split_sentences(self, inp, **tag_config):
|
| 276 |
+
inp = [i.replace("\n"," ") for i in re.sub(r"[^.!\?](\n)([^a-z,æ,ø,å,\\ ])", self._matcher, inp).split("\n\n")]
|
| 277 |
+
sentences = []
|
| 278 |
+
for i in inp:
|
| 279 |
+
sentences.extend(self._split_sentences(i.strip()))
|
| 280 |
+
return sentences
|
| 281 |
+
|
| 282 |
+
def tag_sentence_list(self, lst, **tag_config):
|
| 283 |
+
|
| 284 |
+
# If the sentences are not tokenized, tokenize while batching:
|
| 285 |
+
tokenized_batches = []
|
| 286 |
+
if type(lst[0])==str:
|
| 287 |
+
tokenized_batches = []
|
| 288 |
+
for i in range(0, len(lst), self.batch_size):
|
| 289 |
+
batch_texts = lst[i:i + self.batch_size]
|
| 290 |
+
encoded_batch = self.tokenizer(batch_texts, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
|
| 291 |
+
encoded_batch["input_ids"].to("cpu")
|
| 292 |
+
encoded_batch["attention_mask"].to("cpu")
|
| 293 |
+
tokenized_batches.append(encoded_batch)
|
| 294 |
+
|
| 295 |
+
# sentences are already tokenized, then batchify them:
|
| 296 |
+
else:
|
| 297 |
+
tokenized_batches = self._batchify(lst)
|
| 298 |
+
|
| 299 |
+
# If language will be identified per sentence
|
| 300 |
+
if tag_config["lang_per_sentence"]:
|
| 301 |
+
id_to_lang = self.config["id_to_lang"]
|
| 302 |
+
# If the output will be to a python list
|
| 303 |
+
if tag_config["write_output_to"]==None:
|
| 304 |
+
all_tagged_sentences = []
|
| 305 |
+
for batch in tokenized_batches:
|
| 306 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 307 |
+
batch_tags = torch.argmax(all_out["logits2"], dim=-1)
|
| 308 |
+
batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
|
| 309 |
+
batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
|
| 310 |
+
batch["input_ids"].to("cpu")
|
| 311 |
+
batch["attention_mask"].to("cpu")
|
| 312 |
+
|
| 313 |
+
for input_ids, tags, lemmas, lang in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
|
| 314 |
+
batch_lemmas.tolist(), batch_langs[:, 0].tolist()):
|
| 315 |
+
this_sentence=[]
|
| 316 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 317 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 318 |
+
break
|
| 319 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 320 |
+
if len(this_sentence)>0:
|
| 321 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 322 |
+
else:
|
| 323 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 324 |
+
else:
|
| 325 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 326 |
+
all_tagged_sentences.append({"lang":id_to_lang[lang], "sent": [ {"w":i["w"], "t":self.tags[lang][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]})
|
| 327 |
+
|
| 328 |
+
return all_tagged_sentences
|
| 329 |
+
|
| 330 |
+
# If the output is in TSV format to a pipe (stdout or a file handle)
|
| 331 |
+
elif tag_config["output_tsv"]:
|
| 332 |
+
for batch in tokenized_batches:
|
| 333 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 334 |
+
batch_tags = torch.argmax(all_out["logits2"], dim=-1)
|
| 335 |
+
batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
|
| 336 |
+
batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
|
| 337 |
+
batch["input_ids"].to("cpu")
|
| 338 |
+
batch["attention_mask"].to("cpu")
|
| 339 |
+
|
| 340 |
+
for input_ids, tags, lemmas, lang in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
|
| 341 |
+
batch_lemmas.tolist(), batch_langs[:, 0].tolist()):
|
| 342 |
+
this_sentence=[]
|
| 343 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 344 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 345 |
+
break
|
| 346 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 347 |
+
if len(this_sentence)>0:
|
| 348 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 349 |
+
else:
|
| 350 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 351 |
+
else:
|
| 352 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 353 |
+
this_sentence=[ {"w":i["w"], "t":self.tags_str[lang][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]
|
| 354 |
+
tag_config["write_output_to"].write(id_to_lang[lang])
|
| 355 |
+
for lin in this_sentence:
|
| 356 |
+
tag_config["write_output_to"].write("\t")
|
| 357 |
+
tag_config["write_output_to"].write(lin["w"])
|
| 358 |
+
tag_config["write_output_to"].write("\t")
|
| 359 |
+
tag_config["write_output_to"].write(lin["l"])
|
| 360 |
+
tag_config["write_output_to"].write("\t")
|
| 361 |
+
tag_config["write_output_to"].write(lin["t"])
|
| 362 |
+
tag_config["write_output_to"].write("\n")
|
| 363 |
+
tag_config["write_output_to"].write("\n")
|
| 364 |
+
|
| 365 |
+
# If output format will be json to a pipe (stdout or a file handle)
|
| 366 |
+
else:
|
| 367 |
+
for batch in tokenized_batches:
|
| 368 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 369 |
+
batch_tags = torch.argmax(all_out["logits2"], dim=-1)
|
| 370 |
+
batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
|
| 371 |
+
batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
|
| 372 |
+
batch["input_ids"].to("cpu")
|
| 373 |
+
batch["attention_mask"].to("cpu")
|
| 374 |
+
|
| 375 |
+
for input_ids, tags, lemmas, lang in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
|
| 376 |
+
batch_lemmas.tolist(), batch_langs[:, 0].tolist()):
|
| 377 |
+
this_sentence=[]
|
| 378 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 379 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 380 |
+
break
|
| 381 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 382 |
+
if len(this_sentence)>0:
|
| 383 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 384 |
+
else:
|
| 385 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 386 |
+
else:
|
| 387 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 388 |
+
|
| 389 |
+
json.dump({"lang":id_to_lang[lang], "sent":[ {"w":i["w"], "t":self.tags[lang][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]}, tag_config["write_output_to"])
|
| 390 |
+
tag_config["write_output_to"].write("\n")
|
| 391 |
+
|
| 392 |
+
# If the language is set as parameter
|
| 393 |
+
elif tag_config["lang"] != -1:
|
| 394 |
+
LANG = tag_config["lang"]
|
| 395 |
+
LANG_STR = self.config["id_to_lang"][LANG]
|
| 396 |
+
# If the output will be to a python list
|
| 397 |
+
if tag_config["write_output_to"]==None:
|
| 398 |
+
all_tagged_sentences = []
|
| 399 |
+
for batch in tokenized_batches:
|
| 400 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 401 |
+
batch_tags = torch.argmax(all_out["logits2"], dim=-1)
|
| 402 |
+
batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
|
| 403 |
+
batch["input_ids"].to("cpu")
|
| 404 |
+
batch["attention_mask"].to("cpu")
|
| 405 |
+
|
| 406 |
+
for input_ids, tags, lemmas in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
|
| 407 |
+
batch_lemmas.tolist()):
|
| 408 |
+
this_sentence=[]
|
| 409 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 410 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 411 |
+
break
|
| 412 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 413 |
+
if len(this_sentence)>0:
|
| 414 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 415 |
+
else:
|
| 416 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 417 |
+
else:
|
| 418 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 419 |
+
all_tagged_sentences.append({"lang":LANG_STR, "sent": [ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]})
|
| 420 |
+
|
| 421 |
+
return all_tagged_sentences
|
| 422 |
+
|
| 423 |
+
# If the output is in TSV format to a pipe (stdout or a file handle)
|
| 424 |
+
elif tag_config["output_tsv"]:
|
| 425 |
+
for batch in tokenized_batches:
|
| 426 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 427 |
+
batch_tags = torch.argmax(all_out["logits2"], dim=-1)
|
| 428 |
+
batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
|
| 429 |
+
batch["input_ids"].to("cpu")
|
| 430 |
+
batch["attention_mask"].to("cpu")
|
| 431 |
+
|
| 432 |
+
for input_ids, tags, lemmas in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
|
| 433 |
+
batch_lemmas.tolist()):
|
| 434 |
+
this_sentence=[]
|
| 435 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 436 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 437 |
+
break
|
| 438 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 439 |
+
if len(this_sentence)>0:
|
| 440 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 441 |
+
else:
|
| 442 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 443 |
+
else:
|
| 444 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 445 |
+
this_sentence=[ {"w":i["w"], "t":self.tags_str[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]
|
| 446 |
+
tag_config["write_output_to"].write(LANG_STR)
|
| 447 |
+
for lin in this_sentence:
|
| 448 |
+
tag_config["write_output_to"].write("\t")
|
| 449 |
+
tag_config["write_output_to"].write(lin["w"])
|
| 450 |
+
tag_config["write_output_to"].write("\t")
|
| 451 |
+
tag_config["write_output_to"].write(lin["l"])
|
| 452 |
+
tag_config["write_output_to"].write("\t")
|
| 453 |
+
tag_config["write_output_to"].write(lin["t"])
|
| 454 |
+
tag_config["write_output_to"].write("\n")
|
| 455 |
+
tag_config["write_output_to"].write("\n")
|
| 456 |
+
|
| 457 |
+
# If output format will be json to a pipe (stdout or a file handle)
|
| 458 |
+
else:
|
| 459 |
+
for batch in tokenized_batches:
|
| 460 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 461 |
+
batch_tags = torch.argmax(all_out["logits2"], dim=-1)
|
| 462 |
+
batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
|
| 463 |
+
batch["input_ids"].to("cpu")
|
| 464 |
+
batch["attention_mask"].to("cpu")
|
| 465 |
+
|
| 466 |
+
for input_ids, tags, lemmas in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
|
| 467 |
+
batch_lemmas.tolist()):
|
| 468 |
+
this_sentence=[]
|
| 469 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 470 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 471 |
+
break
|
| 472 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 473 |
+
if len(this_sentence)>0:
|
| 474 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 475 |
+
else:
|
| 476 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 477 |
+
else:
|
| 478 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 479 |
+
|
| 480 |
+
json.dump({"lang":LANG_STR, "sent": [ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]}, tag_config["write_output_to"])
|
| 481 |
+
tag_config["write_output_to"].write("\n")
|
| 482 |
+
|
| 483 |
+
# If language will be identified according to the majority of all sentences:
|
| 484 |
+
else:
|
| 485 |
+
all_tags=[]
|
| 486 |
+
all_lemmas=[]
|
| 487 |
+
all_langs=[]
|
| 488 |
+
all_input_ids=[]
|
| 489 |
+
# Go over all batches and each sentence in each batch
|
| 490 |
+
for batch in tokenized_batches:
|
| 491 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 492 |
+
batch_tags = torch.argmax(all_out["logits2"], dim=-1)
|
| 493 |
+
batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
|
| 494 |
+
batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
|
| 495 |
+
all_input_ids.extend(batch["input_ids"].tolist())
|
| 496 |
+
batch["input_ids"].to("cpu")
|
| 497 |
+
batch["attention_mask"].to("cpu")
|
| 498 |
+
all_langs.extend(batch_langs[:, 0].tolist())
|
| 499 |
+
all_tags.extend(batch_tags.tolist())
|
| 500 |
+
all_lemmas.extend(batch_lemmas.tolist())
|
| 501 |
+
|
| 502 |
+
# Identify the language
|
| 503 |
+
tag_config["lang"] = 1 if sum(all_langs)/len(all_langs)>=0.5 else 0
|
| 504 |
+
LANG = tag_config["lang"]
|
| 505 |
+
LANG_STR = self.config["id_to_lang"][LANG]
|
| 506 |
+
|
| 507 |
+
# If the output will be returned as python list:
|
| 508 |
+
if tag_config["write_output_to"]==None:
|
| 509 |
+
all_tagged_sentences = []
|
| 510 |
+
for input_ids, tags, lemmas in zip(all_input_ids, all_tags, all_lemmas):
|
| 511 |
+
this_sentence=[]
|
| 512 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 513 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 514 |
+
break
|
| 515 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 516 |
+
if len(this_sentence)>0:
|
| 517 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 518 |
+
else:
|
| 519 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 520 |
+
else:
|
| 521 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 522 |
+
all_tagged_sentences.append({"lang":LANG_STR, "sent": [ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence] })
|
| 523 |
+
return all_tagged_sentences
|
| 524 |
+
|
| 525 |
+
# If the output is in TSV format
|
| 526 |
+
elif tag_config["output_tsv"]:
|
| 527 |
+
for input_ids, tags, lemmas in zip(all_input_ids, all_tags, all_lemmas):
|
| 528 |
+
this_sentence=[]
|
| 529 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 530 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 531 |
+
break
|
| 532 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 533 |
+
if len(this_sentence)>0:
|
| 534 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 535 |
+
else:
|
| 536 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 537 |
+
else:
|
| 538 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 539 |
+
this_sentence=[ {"w":i["w"], "t":self.tags_str[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]
|
| 540 |
+
tag_config["write_output_to"].write(LANG_STR)
|
| 541 |
+
for lin in this_sentence:
|
| 542 |
+
tag_config["write_output_to"].write("\t")
|
| 543 |
+
tag_config["write_output_to"].write(lin["w"])
|
| 544 |
+
tag_config["write_output_to"].write("\t")
|
| 545 |
+
tag_config["write_output_to"].write(lin["l"])
|
| 546 |
+
tag_config["write_output_to"].write("\t")
|
| 547 |
+
tag_config["write_output_to"].write(lin["t"])
|
| 548 |
+
tag_config["write_output_to"].write("\n")
|
| 549 |
+
tag_config["write_output_to"].write("\n")
|
| 550 |
+
|
| 551 |
+
# If output format will be json
|
| 552 |
+
else:
|
| 553 |
+
for input_ids, tags, lemmas in zip(all_input_ids, all_tags, all_lemmas):
|
| 554 |
+
this_sentence=[]
|
| 555 |
+
for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
|
| 556 |
+
if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
|
| 557 |
+
break
|
| 558 |
+
if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
|
| 559 |
+
if len(this_sentence)>0:
|
| 560 |
+
this_sentence[-1]["w"] += self.tokenizer.decode(inps)
|
| 561 |
+
else:
|
| 562 |
+
this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
|
| 563 |
+
else:
|
| 564 |
+
this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
|
| 565 |
+
|
| 566 |
+
json.dump({"lang":LANG_STR, "sent":[ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]}, tag_config["write_output_to"])
|
| 567 |
+
tag_config["write_output_to"].write("\n")
|
| 568 |
+
|
| 569 |
+
def _check_if_text_file_and_return_content(self, filepath):
|
| 570 |
+
try:
|
| 571 |
+
with open(filepath, 'r') as f:
|
| 572 |
+
return f.read()
|
| 573 |
+
except Exception as e:
|
| 574 |
+
return False
|
| 575 |
+
|
| 576 |
+
@torch.no_grad()
|
| 577 |
+
def tag(self, inp=None, **tag_config):
|
| 578 |
+
self.eval()
|
| 579 |
+
if "one_sentence_per_line" not in tag_config:
|
| 580 |
+
tag_config["one_sentence_per_line"]=False
|
| 581 |
+
|
| 582 |
+
if "lang" not in tag_config:
|
| 583 |
+
tag_config["lang"]=-1
|
| 584 |
+
else:
|
| 585 |
+
if tag_config["lang"] in self.config["lang_to_id"]:
|
| 586 |
+
tag_config["lang"] = self.config["lang_to_id"][tag_config["lang"]]
|
| 587 |
+
else:
|
| 588 |
+
tag_config["lang"]=-1
|
| 589 |
+
if "output_tsv" not in tag_config:
|
| 590 |
+
tag_config["output_tsv"] = False
|
| 591 |
+
|
| 592 |
+
if "lang_per_sentence" not in tag_config:
|
| 593 |
+
tag_config["lang_per_sentence"] = False
|
| 594 |
+
|
| 595 |
+
elif tag_config["lang_per_sentence"]:
|
| 596 |
+
tag_config["lang_per_sentence"] = True
|
| 597 |
+
|
| 598 |
+
if tag_config["lang"]!=-1 and tag_config["lang_per_sentence"]:
|
| 599 |
+
raise ValueError("lang_per_sentence and lang parameters cannot be set at the same time. ")
|
| 600 |
+
|
| 601 |
+
if "input_directory" in tag_config:
|
| 602 |
+
if not "output_directory" in tag_config:
|
| 603 |
+
raise ValueError("output_directory must be defined if input_directory is defined. ")
|
| 604 |
+
if "write_output_to" in tag_config and tag_config["write_output_to"]!=None:
|
| 605 |
+
raise ValueError("If an input and output directory is given, then write_output_to cannot be used as the output will be written to as files in output_directory.")
|
| 606 |
+
|
| 607 |
+
write_to = sys.stderr if not sys.stderr.closed else sys.stdout if not sys.stdout.closed else open("tag.log","w")
|
| 608 |
+
|
| 609 |
+
# Process directory
|
| 610 |
+
for dir_path, _, files in os.walk(tag_config["input_directory"]):
|
| 611 |
+
for f in files:
|
| 612 |
+
input_path = os.path.join(dir_path, f)
|
| 613 |
+
out_path = os.path.join(tag_config["output_directory"], os.path.relpath(dir_path, tag_config["input_directory"]), f+".tagged")
|
| 614 |
+
|
| 615 |
+
file_content=self._check_if_text_file_and_return_content(input_path)
|
| 616 |
+
|
| 617 |
+
if type(file_content)==str:
|
| 618 |
+
file_content=self._preprocess_text(file_content)
|
| 619 |
+
print (f"Tagging {input_path} to {out_path}.")
|
| 620 |
+
os.makedirs(os.path.dirname(out_path), exist_ok=True)
|
| 621 |
+
if tag_config["one_sentence_per_line"]:
|
| 622 |
+
inp = [i for i in file_content.split("\n") if i!=""]
|
| 623 |
+
inp = [i for i in inp if i!=""]
|
| 624 |
+
with open(out_path, "w") as opened_file:
|
| 625 |
+
tag_config["write_output_to"] = opened_file
|
| 626 |
+
self.tag_sentence_list(inp, **tag_config)
|
| 627 |
+
else:
|
| 628 |
+
inp = self.split_sentences(file_content, **tag_config)
|
| 629 |
+
with open(out_path, "w") as opened_file:
|
| 630 |
+
tag_config["write_output_to"] = opened_file
|
| 631 |
+
self.tag_sentence_list(inp, **tag_config)
|
| 632 |
+
else:
|
| 633 |
+
print (f"Could not properly open and read {input_path}.")
|
| 634 |
+
|
| 635 |
+
write_to.close()
|
| 636 |
+
return
|
| 637 |
+
|
| 638 |
+
else:
|
| 639 |
+
if "write_output_to" not in tag_config or "write_output_to" in tag_config and tag_config["write_output_to"]== None:
|
| 640 |
+
tag_config["write_output_to"] = sys.stdout
|
| 641 |
+
elif type(tag_config["write_output_to"]) == str and tag_config["write_output_to"]=="list":
|
| 642 |
+
tag_config["write_output_to"] = None
|
| 643 |
+
elif type(tag_config["write_output_to"]) == str:
|
| 644 |
+
tag_config["write_output_to"] = open(tag_config["write_output_to"], "w")
|
| 645 |
+
|
| 646 |
+
if inp==None:
|
| 647 |
+
pass
|
| 648 |
+
elif type(inp) == str:
|
| 649 |
+
|
| 650 |
+
# Tag one sentence per line in a string
|
| 651 |
+
if tag_config["one_sentence_per_line"]:
|
| 652 |
+
inp = [i for i in inp.split("\n") if i!=""]
|
| 653 |
+
inp = [self._preprocess_text(i) for i in inp if i!=""]
|
| 654 |
+
return self.tag_sentence_list(inp, **tag_config)
|
| 655 |
+
|
| 656 |
+
# identify sentences
|
| 657 |
+
inp = self.split_sentences(inp, **tag_config)
|
| 658 |
+
return self.tag_sentence_list(inp, **tag_config)
|
| 659 |
+
|
| 660 |
+
# Tag one sentence per list item
|
| 661 |
+
elif type(inp) == list:
|
| 662 |
+
inp=[i.strip() for i in inp]
|
| 663 |
+
inp=[self._preprocess_text(i) for i in inp if i!=""]
|
| 664 |
+
return self.tag_sentence_list(inp, **tag_config)
|
| 665 |
+
|
| 666 |
+
def identify_language_sentence_list(self, lst, **tag_config):
|
| 667 |
+
|
| 668 |
+
# If the sentences are not tokenized, tokenize while batching:
|
| 669 |
+
tokenized_batches = []
|
| 670 |
+
if type(lst[0])==str:
|
| 671 |
+
tokenized_batches = []
|
| 672 |
+
for i in range(0, len(lst), self.batch_size):
|
| 673 |
+
batch_texts = lst[i:i + self.batch_size]
|
| 674 |
+
encoded_batch = self.tokenizer(batch_texts, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
|
| 675 |
+
encoded_batch["input_ids"].to("cpu")
|
| 676 |
+
encoded_batch["attention_mask"].to("cpu")
|
| 677 |
+
tokenized_batches.append(encoded_batch)
|
| 678 |
+
|
| 679 |
+
# sentences are already tokenized, then batchify them:
|
| 680 |
+
else:
|
| 681 |
+
tokenized_batches = self._batchify(lst)
|
| 682 |
+
|
| 683 |
+
|
| 684 |
+
all_tagged_sentences = []
|
| 685 |
+
|
| 686 |
+
# Go over all batches and each sentence in each batch
|
| 687 |
+
for batch in tokenized_batches:
|
| 688 |
+
all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
|
| 689 |
+
batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
|
| 690 |
+
batch["input_ids"].to("cpu")
|
| 691 |
+
batch["attention_mask"].to("cpu")
|
| 692 |
+
all_tagged_sentences.extend(batch_langs[:, 0].tolist())
|
| 693 |
+
|
| 694 |
+
# If language will be identified per item
|
| 695 |
+
if tag_config["lang_per_item"]:
|
| 696 |
+
return [self.config["id_to_lang"][i] for i in all_tagged_sentences]
|
| 697 |
+
|
| 698 |
+
# If language will be identified according to the majority of all sentences:
|
| 699 |
+
else:
|
| 700 |
+
LANG = 1 if sum(all_tagged_sentences)/len(all_tagged_sentences)>=0.5 else 0
|
| 701 |
+
LANG_STR = self.config["id_to_lang"][LANG]
|
| 702 |
+
return [LANG_STR] * len(lst)
|
| 703 |
+
|
| 704 |
+
@torch.no_grad()
|
| 705 |
+
def identify_language(self, inp=None, **tag_config):
|
| 706 |
+
self.eval()
|
| 707 |
+
if "one_sentence_per_line" not in tag_config:
|
| 708 |
+
tag_config["one_sentence_per_line"]=False
|
| 709 |
+
if "lang" in tag_config:
|
| 710 |
+
del tag_config["lang"]
|
| 711 |
+
|
| 712 |
+
if "output_tsv" not in tag_config:
|
| 713 |
+
tag_config["output_tsv"] = False
|
| 714 |
+
|
| 715 |
+
if "lang_per_sentence" not in tag_config:
|
| 716 |
+
tag_config["lang_per_sentence"] = False
|
| 717 |
+
|
| 718 |
+
elif tag_config["lang_per_sentence"]:
|
| 719 |
+
tag_config["lang_per_sentence"] = True
|
| 720 |
+
|
| 721 |
+
if "input_directory" in tag_config and "output_directory" in tag_config and "write_output_to" in tag_config and tag_config["write_output_to"]!=None:
|
| 722 |
+
raise ValueError("If an input and output directory is given, then write_output_to cannot be used as the output will be written to as files in output_directory.")
|
| 723 |
+
|
| 724 |
+
if "write_output_to" not in tag_config or "write_output_to" in tag_config and tag_config["write_output_to"]== None:
|
| 725 |
+
tag_config["write_output_to"] = sys.stdout
|
| 726 |
+
|
| 727 |
+
elif type(tag_config["write_output_to"]) == str and tag_config["write_output_to"]=="list":
|
| 728 |
+
if tag_config["output_tsv"]:
|
| 729 |
+
raise ValueError("write_output_to cannot be set to list if output_tsv is set.")
|
| 730 |
+
if "output_directory" in tag_config and tag_config["output_directory"]:
|
| 731 |
+
raise ValueError("write_output_to cannot be set to list if output_directory is set.")
|
| 732 |
+
tag_config["write_output_to"] = None
|
| 733 |
+
|
| 734 |
+
elif type(tag_config["write_output_to"]) == str:
|
| 735 |
+
tag_config["write_output_to"] = open(tag_config["write_output_to"], "w")
|
| 736 |
+
|
| 737 |
+
if "output_directory" in tag_config:
|
| 738 |
+
tag_config["write_output_to"] = None
|
| 739 |
+
|
| 740 |
+
if "split_sentences" not in tag_config:
|
| 741 |
+
tag_config["split_sentences"] = False
|
| 742 |
+
|
| 743 |
+
if "lang_per_item" not in tag_config:
|
| 744 |
+
tag_config["lang_per_item"] = False
|
| 745 |
+
|
| 746 |
+
if "fast_mode" in tag_config:
|
| 747 |
+
|
| 748 |
+
if "input_directory" not in tag_config:
|
| 749 |
+
raise ValueError("input_directory must be defined if fast_mode is set.")
|
| 750 |
+
|
| 751 |
+
if tag_config["split_sentences"]:
|
| 752 |
+
raise ValueError("fast_mode does not split sentences, so split_sentences cannot be set in this mode.")
|
| 753 |
+
|
| 754 |
+
if tag_config["lang_per_item"]:
|
| 755 |
+
raise ValueError("fast_mode does not identify languages of each line or sentence in a file, so lang_per_item cannot be set in this mode.")
|
| 756 |
+
|
| 757 |
+
if tag_config["lang_per_sentence"]:
|
| 758 |
+
raise ValueError("fast_mode does not identify languages of sentence in a file, so lang_per_sentence cannot be set in this mode.")
|
| 759 |
+
|
| 760 |
+
general_output=[]
|
| 761 |
+
file_names=[]
|
| 762 |
+
contents=[]
|
| 763 |
+
# Process directory
|
| 764 |
+
for dir_path, _, files in os.walk(tag_config["input_directory"]):
|
| 765 |
+
for f in files:
|
| 766 |
+
input_path = os.path.join(dir_path, f)
|
| 767 |
+
if len(file_names) == self.batch_size:
|
| 768 |
+
batch = self.tokenizer(contents, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
|
| 769 |
+
langs = torch.argmax( self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))["seq_logits"], dim=-1)[:, 0].tolist()
|
| 770 |
+
del batch
|
| 771 |
+
torch.cuda.empty_cache()
|
| 772 |
+
|
| 773 |
+
if tag_config["write_output_to"]==None:
|
| 774 |
+
general_output.extend([{"f":i[0], "l":self.config["id_to_lang"][i[1]]} for i in zip(file_names, langs)])
|
| 775 |
+
elif tag_config["output_tsv"]:
|
| 776 |
+
for fil,lan in zip(file_names, langs):
|
| 777 |
+
tag_config["write_output_to"].write(fil)
|
| 778 |
+
tag_config["write_output_to"].write("\t")
|
| 779 |
+
tag_config["write_output_to"].write(self.config["id_to_lang"][lan])
|
| 780 |
+
tag_config["write_output_to"].write("\n")
|
| 781 |
+
else:
|
| 782 |
+
for fil,lan in zip(file_names, langs):
|
| 783 |
+
json.dump({"f":fil, "l":self.config["id_to_lang"][lan]})
|
| 784 |
+
file_names=[]
|
| 785 |
+
contents=[]
|
| 786 |
+
else:
|
| 787 |
+
content=None
|
| 788 |
+
try:
|
| 789 |
+
with open(input_path,"r") as ff:
|
| 790 |
+
content=ff.read(3000).replace("\n"," ").replace("\r","")
|
| 791 |
+
except:
|
| 792 |
+
pass
|
| 793 |
+
if content!=None:
|
| 794 |
+
file_names.append(input_path)
|
| 795 |
+
contents.append(content)
|
| 796 |
+
|
| 797 |
+
if len(file_names)>0:
|
| 798 |
+
batch = self.tokenizer(contents, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
|
| 799 |
+
langs = torch.argmax( self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))["seq_logits"], dim=-1)[:, 0].tolist()
|
| 800 |
+
del batch
|
| 801 |
+
torch.cuda.empty_cache()
|
| 802 |
+
|
| 803 |
+
if tag_config["write_output_to"]==None:
|
| 804 |
+
general_output.extend([{"f":i[0], "l":self.config["id_to_lang"][i[1]]} for i in zip(file_names, langs)])
|
| 805 |
+
elif tag_config["output_tsv"]:
|
| 806 |
+
for fil,lan in zip(file_names, langs):
|
| 807 |
+
tag_config["write_output_to"].write(fil)
|
| 808 |
+
tag_config["write_output_to"].write("\t")
|
| 809 |
+
tag_config["write_output_to"].write(self.config["id_to_lang"][lan])
|
| 810 |
+
tag_config["write_output_to"].write("\n")
|
| 811 |
+
else:
|
| 812 |
+
for fil,lan in zip(file_names, langs):
|
| 813 |
+
json.dump({"f":fil, "l":self.config["id_to_lang"][lan]})
|
| 814 |
+
|
| 815 |
+
return general_output if len(general_output)>0 else None
|
| 816 |
+
|
| 817 |
+
if "input_directory" in tag_config:
|
| 818 |
+
general_output=[]
|
| 819 |
+
# Process directory
|
| 820 |
+
for dir_path, _, files in os.walk(tag_config["input_directory"]):
|
| 821 |
+
for f in files:
|
| 822 |
+
input_path = os.path.join(dir_path, f)
|
| 823 |
+
|
| 824 |
+
file_content=self._check_if_text_file_and_return_content(input_path)
|
| 825 |
+
|
| 826 |
+
if type(file_content)==str:
|
| 827 |
+
file_content=self._preprocess_text(file_content)
|
| 828 |
+
new_inp=None
|
| 829 |
+
if tag_config["one_sentence_per_line"]:
|
| 830 |
+
inp = [i for i in file_content.split("\n") if i!=""]
|
| 831 |
+
inp = [i for i in inp if i!=""]
|
| 832 |
+
out = self.identify_language_sentence_list(inp, **tag_config)
|
| 833 |
+
else:
|
| 834 |
+
inp = self.split_sentences(file_content, **tag_config)
|
| 835 |
+
out = self.identify_language_sentence_list(inp, **tag_config)
|
| 836 |
+
new_inp=[self.tokenizer.decode(i[1:]).split("[SEP]")[0].strip() for i in inp]
|
| 837 |
+
|
| 838 |
+
if new_inp!=None:
|
| 839 |
+
inp=new_inp
|
| 840 |
+
|
| 841 |
+
# If no output pipe is available than write to
|
| 842 |
+
if tag_config["write_output_to"]==None:
|
| 843 |
+
if "output_directory" in tag_config:
|
| 844 |
+
out_path = os.path.join(tag_config["output_directory"], os.path.relpath(dir_path, tag_config["input_directory"]), f+".lang")
|
| 845 |
+
os.makedirs(os.path.dirname(out_path), exist_ok=True)
|
| 846 |
+
with open(out_path, "w") as opened_file:
|
| 847 |
+
if tag_config["lang_per_sentence"]:
|
| 848 |
+
if tag_config["output_tsv"]:
|
| 849 |
+
for sen,lan in zip(inp, out):
|
| 850 |
+
opened_file.write(sen)
|
| 851 |
+
opened_file.write("\t")
|
| 852 |
+
opened_file.write(lan)
|
| 853 |
+
opened_file.write("\n")
|
| 854 |
+
else:
|
| 855 |
+
json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , opened_file)
|
| 856 |
+
else:
|
| 857 |
+
if tag_config["output_tsv"]:
|
| 858 |
+
opened_file.write(out[0])
|
| 859 |
+
else:
|
| 860 |
+
json.dump({"l":out[0]} , opened_file)
|
| 861 |
+
else:
|
| 862 |
+
if tag_config["lang_per_sentence"]:
|
| 863 |
+
general_output.extend([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ])
|
| 864 |
+
else:
|
| 865 |
+
general_output.append({"f":input_path, "l":out[0]})
|
| 866 |
+
|
| 867 |
+
# If there is an opened pipe already
|
| 868 |
+
else:
|
| 869 |
+
if tag_config["lang_per_sentence"]:
|
| 870 |
+
if tag_config["output_tsv"]:
|
| 871 |
+
for sen,lan in zip(inp, out):
|
| 872 |
+
tag_config["write_output_to"].write(sen)
|
| 873 |
+
tag_config["write_output_to"].write("\t")
|
| 874 |
+
tag_config["write_output_to"].write(lan)
|
| 875 |
+
tag_config["write_output_to"].write("\n")
|
| 876 |
+
tag_config["write_output_to"].write("\n")
|
| 877 |
+
else:
|
| 878 |
+
json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , tag_config["write_output_to"])
|
| 879 |
+
tag_config["write_output_to"].write("\n")
|
| 880 |
+
else:
|
| 881 |
+
if tag_config["output_tsv"]:
|
| 882 |
+
tag_config["write_output_to"].write(input_path)
|
| 883 |
+
tag_config["write_output_to"].write("\t")
|
| 884 |
+
tag_config["write_output_to"].write(out[0])
|
| 885 |
+
tag_config["write_output_to"].write("\n")
|
| 886 |
+
else:
|
| 887 |
+
json.dump({"f":input_path, "l":out[0]} , tag_config["write_output_to"])
|
| 888 |
+
tag_config["write_output_to"].write("\n")
|
| 889 |
+
|
| 890 |
+
else:
|
| 891 |
+
if tag_config["output_tsv"]:
|
| 892 |
+
tag_config["write_output_to"].write(input_path)
|
| 893 |
+
tag_config["write_output_to"].write("\t")
|
| 894 |
+
tag_config["write_output_to"].write("err")
|
| 895 |
+
tag_config["write_output_to"].write("\n")
|
| 896 |
+
else:
|
| 897 |
+
json.dump({"f":input_path, "l":"err"} , tag_config["write_output_to"])
|
| 898 |
+
tag_config["write_output_to"].write("\n")
|
| 899 |
+
|
| 900 |
+
if tag_config["write_output_to"] and tag_config["write_output_to"]!=sys.stdout and tag_config["write_output_to"]!=sys.stderr:
|
| 901 |
+
tag_config["write_output_to"].close()
|
| 902 |
+
|
| 903 |
+
return general_output if len(general_output)>0 else None
|
| 904 |
+
|
| 905 |
+
if inp==None:
|
| 906 |
+
pass
|
| 907 |
+
elif type(inp) == str:
|
| 908 |
+
new_inp=None
|
| 909 |
+
# if split sentences is set
|
| 910 |
+
if tag_config["split_sentences"]:
|
| 911 |
+
inp = self._preprocess_text(inp)
|
| 912 |
+
inp = self.split_sentences(inp, **tag_config)
|
| 913 |
+
new_inp=[self.tokenizer.decode(i[1:]).strip() for i in inp]
|
| 914 |
+
if tag_config["lang_per_sentence"]:
|
| 915 |
+
tag_config["lang_per_item"] = True
|
| 916 |
+
|
| 917 |
+
# if tag one sentence per line in a string
|
| 918 |
+
elif tag_config["one_sentence_per_line"]:
|
| 919 |
+
inp = [i for i in inp.split("\n") if i!=""]
|
| 920 |
+
inp = [self._preprocess_text(i) for i in inp if i!=""]
|
| 921 |
+
if tag_config["lang_per_sentence"]:
|
| 922 |
+
tag_config["lang_per_item"] = True
|
| 923 |
+
|
| 924 |
+
# Otherwise identify the language of the input string as a whole
|
| 925 |
+
else:
|
| 926 |
+
inp = [self._preprocess_text(inp)]
|
| 927 |
+
|
| 928 |
+
# Identify language
|
| 929 |
+
out = self.identify_language_sentence_list(inp, **tag_config)
|
| 930 |
+
|
| 931 |
+
if new_inp!=None:
|
| 932 |
+
inp=new_inp
|
| 933 |
+
|
| 934 |
+
# If return as list
|
| 935 |
+
if tag_config["write_output_to"]==None:
|
| 936 |
+
return [{"s":i[0], "l": i[1]} for i in zip(inp, out)]
|
| 937 |
+
|
| 938 |
+
if tag_config["output_tsv"]:
|
| 939 |
+
for sen,lan in zip(inp, out):
|
| 940 |
+
tag_config["write_output_to"].write(sen)
|
| 941 |
+
tag_config["write_output_to"].write("\t")
|
| 942 |
+
tag_config["write_output_to"].write(out)
|
| 943 |
+
tag_config["write_output_to"].write("\n")
|
| 944 |
+
else:
|
| 945 |
+
json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , tag_config["write_output_to"])
|
| 946 |
+
|
| 947 |
+
return
|
| 948 |
+
|
| 949 |
+
# Tag one sentence per list item
|
| 950 |
+
elif type(inp) == list:
|
| 951 |
+
inp=[i.strip() for i in inp]
|
| 952 |
+
inp=[self._preprocess_text(i) for i in inp if i!=""]
|
| 953 |
+
out = self.identify_language_sentence_list(inp, **tag_config)
|
| 954 |
+
|
| 955 |
+
# If return as list
|
| 956 |
+
if tag_config["write_output_to"]==None:
|
| 957 |
+
return [{"s":i[0], "l": i[1]} for i in zip(inp, out)]
|
| 958 |
+
|
| 959 |
+
if tag_config["output_tsv"]:
|
| 960 |
+
for sen,lan in zip(inp, out):
|
| 961 |
+
tag_config["write_output_to"].write(sen)
|
| 962 |
+
tag_config["write_output_to"].write("\t")
|
| 963 |
+
tag_config["write_output_to"].write(lan)
|
| 964 |
+
tag_config["write_output_to"].write("\n")
|
| 965 |
+
else:
|
| 966 |
+
json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , tag_config["write_output_to"])
|
| 967 |
+
|
| 968 |
+
return
|
| 969 |
+
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tagger_config.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"tokenizer_class": "PreTrainedTokenizerFast"
|
| 3 |
+
}
|