Ahmet Yildirim commited on
Commit
7d46aa7
·
1 Parent(s): fab0b3a

- Initial commit

Browse files
README.md CHANGED
@@ -1,3 +1,138 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language: 'no'
4
+ base_model: ltg/norbert3-base
5
+ tags:
6
+ - norsk,
7
+ - nynorsk,
8
+ - bokmål,
9
+ - språkidentifikasjon,
10
+ - morfologisk_tagging
11
+ - setningsgrensedeteksjon
12
+ ---
13
+
14
+ # Humit-Tagger Base
15
+
16
+ The official release of Norwegian Morphology Tagger - Humit Tagger as a Hugginface Model.
17
+
18
+ This specific version of the tagger is based on Norbert3-base.
19
+
20
+ The aim of this model is to make Humit-Tagger available as a HuggingFace model including all functionality that the [original code](https://github.com/humit-oslo/humit-tagger) supports.
21
+ In addition to the morphological tagging, this model supports Nynorsk/Bokmåk language identification provided by this [repository](https://github.com/humit-oslo/humit-sprakidentifikator).
22
+
23
+
24
+ This model adds four classification layers on top of the base model.
25
+ These layers do language identification, morphologic classification, lemmatization classification, and language identification.
26
+
27
+ The large version has in overal %1 more accuracy score than the base version.
28
+ According to CPU/GPU power, the following sizes could be used:
29
+
30
+ ## Humit-tagger sizes:
31
+
32
+ The humit-tagger sizes follow the sizes of [Norbert3](https://huggingface.co/ltg/norbert3-base).
33
+
34
+ - [humit-tagger-xs (15M)](https://huggingface.co/Humit-Oslo/humit-tagger-xs)
35
+ - [humit-tagger-small (40M)](https://huggingface.co/Humit-Oslo/humit-tagger-small)
36
+ - [humit-tagger-base (123M)](https://huggingface.co/Humit-Oslo/humit-tagger-base)
37
+ - [humit-tagger-large (323M)](https://huggingface.co/Humit-Oslo/humit-tagger-large)
38
+
39
+ ## Loading Model
40
+
41
+ This model implements custom functionalities such as the tag and identify\_language functions and other functions that are used by these functions.
42
+ To be able to provide these functionalities, this model uses a custom wrapper.
43
+ Therefore the model should be loaded with `trust_remote_code=True`.
44
+
45
+ The model can be loadad as the following:
46
+
47
+ ```python
48
+ from transformers import AutoModel
49
+ humit_tagger = AutoModel.from_pretrained("Humit-Oslo/humit-tagger-base", trust_remote_code=True)
50
+ ```
51
+
52
+ ## Functions and parameters
53
+
54
+ The model provides two functions: tag and identify\_language.
55
+ The tag function does the morphologic tagging of the input.
56
+ The identify\_language identifies the language of input as "nn" for nynorsk and "bm" for bokmål.
57
+ These functions receive similar parameters.
58
+
59
+ | parameter | .tag supports| .identify\_language supports | options | default | used |
60
+ | :--- | :-| :- | :- | :- | :- |
61
+ | inp | yes | yes | | None | to give the input. No need to give parameter name if the parameter is the first parameter |
62
+ | lang | yes | no | "nn", "bm", "au" | "au"| to specify the language of tags. "au" tries to identify the language automatically from the input. |
63
+ | input\_directory | yes | yes | | None | to apply the function recursively on input\_directory |
64
+ | output\_directory | yes | yes | | None | to output recursively in output\_directory. The written files will have extension ".tagged" or ".lang" according to the function called. |
65
+ | one\_sentence\_per\_line | yes | yes | True / False | False | not to apply sentence boundary detection and consider each line as a sentence in the input or the input file(s). |
66
+ | lang\_per\_sentence| yes | no | True / False | False | identify the language per sentence and output the tags according to the language identified for that sentence. If this is not set, and lang is "au" then the whole input (or a file if input\_directory is used) is used to identify the language. |
67
+ | write\_output\_to | yes | yes | a file path, a file handle, or "list" | sys.stdout | to specify where to write the output. If a file path is provided, the output will be written to that file. The file is overwritten. If a file handle is provided, then the output is written to there. If "list" is given as parameters, then the function returns a python "list". |
68
+ | output\_tsv | yes | yes | True/False | False | to specify the output format. The default is the json format. If multiple sentences exist, each line is a single valid json but not the whole output. This option cannot be used along with write\_output\_to="list" |
69
+ | lang\_per\_item | no | yes | True/False | False | consider each item in the list given as separate input for language identification. |
70
+ | fast\_mode | no | yes | True/False | False | identify languages of the files in the input directory in fast mode. This mode uses only the beginning of the files in identification. This method is much more faster for many files but is not as accurate as if this paramer is set to False. |
71
+
72
+ ## Several example use cases:
73
+
74
+ ### Tag one sentence
75
+ ```python
76
+ humit_tagger.tag("Dette er en norsk setning.")
77
+ ```
78
+
79
+ ### Tag a list of sentences
80
+ ```python
81
+ humit_tagger.tag(["Dette er en norsk setning.", "Dette er en annen norsk setning."])
82
+
83
+ ```
84
+
85
+ ### Tag a file
86
+ ```python
87
+ with open ("path/to/file", "r") as f:
88
+ humit_tagger.tag(f.read())
89
+ ```
90
+
91
+ ### Tag all files recursively in a directory
92
+
93
+ Here, input\_directory and output\_direcotry must be given as parameter.
94
+ The files that can be read in text mode will be tagged and the output will be written in the output\_directory with same directory and sub-directory structure.
95
+ The file names will have the same name with ".tagged" at the end.
96
+ Any existing files will be overwritten.
97
+
98
+ ```python
99
+ humit_tagger.tag(input_directory = "path/to/input/directory", output_directory = "path/to/output/directory" )
100
+ ```
101
+
102
+ ### Language identification
103
+ ```python
104
+ humit_tagger.identify_language("Eg elskar snø.")
105
+ ```
106
+
107
+ ### Language identification of multiple sentences:
108
+ ```python
109
+ humit_tagger.identify_language(["Jeg elsker snø.","Eg elskar snø."])
110
+ ```
111
+
112
+ ### Recursive language identification of all files in a directory
113
+ ```python
114
+ humit_tagger.identify_language(input_directory = "../inp")
115
+ ```
116
+
117
+
118
+ ## Cite us
119
+
120
+ ```bibtex
121
+ @inproceedings{haug-etal-2023-integrating,
122
+ title = "Rules and neural nets for morphological tagging of {N}orwegian - Results and challenges",
123
+ author = "Haug, Dag and
124
+ Yildirim, Ahmet and
125
+ Hagen, Kristin and
126
+ N{\o}klestad, Anders",
127
+ editor = {Alum{\"a}e, Tanel and
128
+ Fishel, Mark},
129
+ booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
130
+ month = may,
131
+ year = "2023",
132
+ address = "T{\'o}rshavn, Faroe Islands",
133
+ publisher = "University of Tartu Library",
134
+ url = "https://aclanthology.org/2023.nodalida-1.43/",
135
+ pages = "425--435",
136
+ abstract = "This paper reports on efforts to improve the Oslo-Bergen Tagger for Norwegian morphological tagging. We train two deep neural network-based taggers using the recently introduced Norwegian pre-trained encoder (a BERT model for Norwegian). The first network is a sequence-to-sequence encoder-decoder and the second is a sequence classifier. We test both these configurations in a hybrid system where they combine with the existing rule-based system, and on their own. The sequence-to-sequence system performs better in the hybrid configuration, but the classifier system performs so well that combining it with the rules is actually slightly detrimental to performance."
137
+ }
138
+ ```
__init__.py ADDED
File without changes
config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HumitTaggerModel"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_humit_tagger.HumitTaggerConfig",
7
+ "AutoModel": "modeling_humit_tagger.HumitTaggerModel"
8
+ },
9
+ "humit_tagger_configuration":"tagger_config.json",
10
+ "lemma_rules_py_file":"lemma_rule.py"
11
+ }
configuration_humit_tagger.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+
3
+
4
+ class HumitTaggerConfig(PretrainedConfig):
5
+ """Configuration class to store the configuration of a `HumitTaggerModel`.
6
+ """
7
+ def __init__(
8
+ self,
9
+ **kwargs,
10
+ ):
11
+ super().__init__(**kwargs)
12
+
lemma_rule.py ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Script that implements word-lemma conversions and rule extractsion.
2
+ # Most of the code has been taken from : https://github.com/hplt-project/HPLT-WP4/blob/main/evaluation/ud/lemma_rule.py
3
+ # This is a class with static members
4
+
5
+ import pickle
6
+
7
+ class LemmaHandling:
8
+ lemma_dict = dict()
9
+ lemma_list = list()
10
+ lemma_list_inverted = dict()
11
+ word_classes = dict()
12
+ def __init__(self):
13
+ pass
14
+
15
+ def min_edit_script(source, target, allow_copy):
16
+ a = [[(len(source) + len(target) + 1, None)] * (len(target) + 1) for _ in range(len(source) + 1)]
17
+ for i in range(0, len(source) + 1):
18
+ for j in range(0, len(target) + 1):
19
+ if i == 0 and j == 0:
20
+ a[i][j] = (0, "")
21
+ else:
22
+ if allow_copy and i and j and source[i - 1] == target[j - 1] and a[i-1][j-1][0] < a[i][j][0]:
23
+ a[i][j] = (a[i-1][j-1][0], a[i-1][j-1][1] + "→")
24
+ if i and a[i-1][j][0] < a[i][j][0]:
25
+ a[i][j] = (a[i-1][j][0] + 1, a[i-1][j][1] + "-")
26
+ if j and a[i][j-1][0] < a[i][j][0]:
27
+ a[i][j] = (a[i][j-1][0] + 1, a[i][j-1][1] + "+" + target[j - 1])
28
+ return a[-1][-1][1]
29
+
30
+
31
+ def gen_lemma_rule(form, lemma, allow_copy):
32
+ best, best_form, best_lemma = 0, 0, 0
33
+ for l in range(len(lemma)):
34
+ for f in range(len(form)):
35
+ cpl = 0
36
+ while f + cpl < len(form) and l + cpl < len(lemma) and form[f + cpl].lower() == lemma[l + cpl].lower():
37
+ cpl += 1
38
+ if cpl > best:
39
+ best = cpl
40
+ best_form = f
41
+ best_lemma = l
42
+
43
+ if not best:
44
+ return {"case": None, "prefix": None, "suffix": None, "absolute": "a" + lemma}
45
+
46
+ prefix_rule = LemmaHandling.min_edit_script(form[:best_form].lower(), lemma[:best_lemma].lower(), allow_copy)
47
+ suffix_rule = LemmaHandling.min_edit_script(form[best_form + best:].lower(), lemma[best_lemma + best:].lower(), allow_copy)
48
+
49
+ if lemma.islower():
50
+ return {"case": "lower", "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}
51
+
52
+ generated_lemma = LemmaHandling.apply_lemma_rule(form, {"case": "lower", "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}, apply_casing=False)
53
+ if generated_lemma == lemma:
54
+ return {"case": "keep", "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}
55
+
56
+ previous_case = -1
57
+ lemma_casing = ""
58
+ for i, c in enumerate(lemma):
59
+ case = "↑" if c.lower() != c else "↓"
60
+ if case != previous_case:
61
+ lemma_casing += "{}{}{}".format("¦" if lemma_casing else "", case, i if i <= len(lemma) // 2 else i - len(lemma))
62
+ previous_case = case
63
+
64
+ return {"case": lemma_casing, "prefix": prefix_rule, "suffix": suffix_rule, "absolute": "relative"}
65
+
66
+
67
+ def apply_lemma_rule(form, lemma_rule, apply_casing=True):
68
+ if lemma_rule["absolute"].startswith("a"):
69
+ return lemma_rule["absolute"][1:]
70
+
71
+ if any(rule is None for rule in lemma_rule.values()):
72
+ return form
73
+
74
+ rules, rule_sources = (lemma_rule["prefix"], lemma_rule["suffix"]), []
75
+ for rule in rules:
76
+ source, i = 0, 0
77
+ while i < len(rule):
78
+ if rule[i] == "→" or rule[i] == "-":
79
+ source += 1
80
+ else:
81
+ assert rule[i] == "+"
82
+ i += 1
83
+ i += 1
84
+ rule_sources.append(source)
85
+
86
+ try:
87
+ lemma, form_offset = "", 0
88
+ for i in range(2):
89
+ j, offset = 0, (0 if i == 0 else len(form) - rule_sources[1])
90
+ while j < len(rules[i]):
91
+ if rules[i][j] == "→":
92
+ lemma += form[offset]
93
+ offset += 1
94
+ elif rules[i][j] == "-":
95
+ offset += 1
96
+ else:
97
+ assert(rules[i][j] == "+")
98
+ lemma += rules[i][j + 1]
99
+ j += 1
100
+ j += 1
101
+ if i == 0:
102
+ lemma += form[rule_sources[0] : len(form) - rule_sources[1]]
103
+ except:
104
+ lemma = form
105
+
106
+ if not apply_casing:
107
+ return lemma
108
+
109
+ if lemma_rule["case"] == "lower":
110
+ return lemma.lower()
111
+ elif lemma_rule["case"] == "keep":
112
+ return lemma
113
+
114
+ lemma = lemma.lower()
115
+ for rule in lemma_rule["case"].split("¦"):
116
+ if rule == "↓0": continue # The lemma is lowercased initially
117
+ if not rule: continue # Empty lemma might generate empty casing rule
118
+ case, offset = rule[0], int(rule[1:])
119
+ lemma = lemma[:offset] + (lemma[offset:].upper() if case == "↑" else lemma[offset:].lower())
120
+
121
+ return lemma
122
+
123
+ # Extracts lemma rule given word and its lemma and adds the rule to the lemma rules dictionary if the rule does not exist
124
+ def add_lemma_rule_to_dict(word, lemma, word_class=None):
125
+ r=LemmaHandling.gen_lemma_rule(word,lemma, True)
126
+ st=[r['case'], r['prefix'], r['suffix'], r['absolute']]
127
+
128
+ st=";".join(["§" if i==None else i for i in st])
129
+ if st not in LemmaHandling.lemma_dict:
130
+ LemmaHandling.lemma_dict[st]=r
131
+ if word_class==None:
132
+ word_class="ukjent"
133
+ if st not in LemmaHandling.word_classes:
134
+ LemmaHandling.word_classes[st]=[word_class]
135
+ else:
136
+ LemmaHandling.word_classes[st].append(word_class)
137
+ LemmaHandling.word_classes[st]=sorted(list(set(LemmaHandling.word_classes[st])))
138
+
139
+ # This function initializes lemma rule directory and lists
140
+ def start_lemma_rule_extraction():
141
+ LemmaHandling.lemma_list=[]
142
+ LemmaHandling.lemma_list_inverted={}
143
+ LemmaHandling.lemma_dict={}
144
+
145
+ # This function extracts lemma_list using the lemma_dict
146
+ def done_lemma_list_extraction():
147
+ LemmaHandling.lemma_list=["[NONE]"] + list(LemmaHandling.lemma_dict.keys())
148
+ LemmaHandling.lemma_list_inverted={j:i for i,j in enumerate(LemmaHandling.lemma_list)}
149
+
150
+ # This saves lemma rules to a file
151
+ def save_lemma_rules(file_name):
152
+ with open(file_name, "wb") as fil:
153
+ pickle.dump([LemmaHandling.lemma_dict, LemmaHandling.lemma_list, LemmaHandling.word_classes ], fil)
154
+
155
+ # This function loads an already saved rules file
156
+ def load_lemma_rules(dict_file):
157
+ with open(dict_file, 'rb') as fil:
158
+ LemmaHandling.lemma_dict, LemmaHandling.lemma_list, LemmaHandling.word_classes = pickle.load(fil)
159
+ LemmaHandling.lemma_list_inverted={j:i for i,j in enumerate(LemmaHandling.lemma_list)}
160
+
161
+ # This function loads lemma rules from an object
162
+ def load_lemma_rules_from_obj(obj):
163
+ LemmaHandling.lemma_dict, LemmaHandling.lemma_list, LemmaHandling.word_classes = obj
164
+ LemmaHandling.lemma_list_inverted={j:i for i,j in enumerate(LemmaHandling.lemma_list)}
165
+
166
+
167
+ # This returns the lemma given the word and its rule index
168
+ # If the index is not found returns the word as lemma
169
+ def get_lemma_and_word_classes_given_word_and_lemma_list_index(word, lemma_list_index):
170
+ if lemma_list_index>=len(LemmaHandling.lemma_dict):
171
+ return word
172
+ st = LemmaHandling.lemma_list[lemma_list_index]
173
+ return LemmaHandling.apply_lemma_rule(word, LemmaHandling.lemma_dict[st], apply_casing=True) , LemmaHandling.word_classes[st]
174
+
175
+ # Same as before without word classes
176
+ def get_lemma_given_word_and_lemma_list_index(word, lemma_list_index):
177
+ if lemma_list_index>=len(LemmaHandling.lemma_dict) or lemma_list_index==0:
178
+ return word
179
+ return LemmaHandling.apply_lemma_rule(word, LemmaHandling.lemma_dict[LemmaHandling.lemma_list[lemma_list_index]], apply_casing=True)
180
+
181
+
182
+ # This function returns lemma_rule index given word and lemma
183
+ def get_lemma_rule_index(word, lemma):
184
+ r=LemmaHandling.gen_lemma_rule(word,lemma, True)
185
+ st=[r['case'], r['prefix'], r['suffix'], r['absolute']]
186
+ st=";".join(["§" if i==None else i for i in st])
187
+ if st not in LemmaHandling.lemma_dict:
188
+ return 0
189
+ return LemmaHandling.lemma_list_inverted[st]
190
+
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e5ea6c6918d8598ecec750d5f71d6da616db0cc4e8dda0409ed56098463edd9
3
+ size 496451528
modeling_humit_tagger.py ADDED
@@ -0,0 +1,969 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import (
2
+ AutoModel,
3
+ AutoTokenizer
4
+ )
5
+ import torch
6
+ from huggingface_hub import hf_hub_download
7
+ import os
8
+ import importlib.util
9
+ import sys
10
+ import shutil
11
+ from safetensors.torch import load_model
12
+ import json
13
+ import re
14
+ import copy
15
+
16
+ class HumitTaggerModel(torch.nn.Module):
17
+
18
+ # We do not need to do anything to register our class as this class will only be used
19
+ # for easily getting humit-tagger worki
20
+ def register_for_auto_class(auto_class):
21
+ pass
22
+ return
23
+
24
+ # Define our own from-pretrained to load the weights and other files needed for the tagger to work
25
+ def from_pretrained(repo_name, **kwargs):
26
+
27
+ # Download this model's config:
28
+ this_model_config_path = hf_hub_download(repo_id=repo_name, filename=kwargs["config"].humit_tagger_configuration)
29
+
30
+ # load this model's config
31
+ with open(this_model_config_path,"r") as js:
32
+ kwargs["this_model_config"]=json.load(js)
33
+
34
+
35
+ # Download this model's config:
36
+ lemma_rules_path = hf_hub_download(repo_id=repo_name, filename=kwargs["config"].lemma_rules_py_file)
37
+
38
+ # load lemma rules class
39
+ sys.path.append(os.path.dirname(lemma_rules_path))
40
+ spec = importlib.util.spec_from_file_location("lemma_rules", lemma_rules_path)
41
+ lemma_rules = importlib.util.module_from_spec(spec)
42
+ sys.modules["lemma_rules"] = lemma_rules
43
+ spec.loader.exec_module(lemma_rules)
44
+
45
+ # Download base_model files into cache
46
+ base_config_file = hf_hub_download(repo_id=kwargs["this_model_config"]["base_model"], filename=kwargs["this_model_config"]["base_model_config_file"])
47
+ base_model_file = hf_hub_download(repo_id=kwargs["this_model_config"]["base_model"], filename=kwargs["this_model_config"]["base_model_model_file"])
48
+ base_model_config_json_file = hf_hub_download(repo_id=kwargs["this_model_config"]["base_model"], filename=kwargs["this_model_config"]["base_model_config_json_file"])
49
+
50
+ # Copy base model's configuration python file into our working directory
51
+ config_file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)) , os.path.basename(base_config_file))
52
+ shutil.copyfile(base_config_file, config_file_path)
53
+
54
+ # HACK: Modify base model main file since __init.py__ has already been read and the new file must not contain relative imports
55
+ base_model_file_path = os.path.join(os.path.dirname(os.path.abspath(__file__)) , os.path.basename(base_model_file))
56
+ with open(base_model_file, 'r') as file:
57
+ file_content = file.read().replace("from .", "from ")
58
+ with open(base_model_file_path, 'w') as file:
59
+ file.write(file_content)
60
+
61
+ # Register the new files:
62
+ # First register the base model config file
63
+ sys.path.append(os.path.dirname(config_file_path))
64
+ spec = importlib.util.spec_from_file_location("base_config", config_file_path)
65
+ base_config = importlib.util.module_from_spec(spec)
66
+ sys.modules["base_config"] = base_config
67
+ spec.loader.exec_module(base_config)
68
+ # Then register the base model file
69
+ sys.path.append(os.path.dirname(base_model_file_path))
70
+ spec = importlib.util.spec_from_file_location("base_model", base_model_file_path)
71
+ base_model = importlib.util.module_from_spec(spec)
72
+ sys.modules["base_model"] = base_model
73
+ spec.loader.exec_module(base_model)
74
+
75
+ # Download model weights
76
+ model_weights_path = hf_hub_download(repo_id=repo_name, filename=kwargs["this_model_config"]["model_weights"])
77
+
78
+ # load base model config
79
+ with open(base_model_config_json_file,"r") as js:
80
+ kwargs["base_model_json_cfg"] = json.load(js)
81
+
82
+ kwargs["model_weights_path"] = model_weights_path
83
+ kwargs["repo_name"] = repo_name
84
+ return HumitTaggerModel(**kwargs)
85
+
86
+ def __init__(self, **kwargs ):
87
+ super(HumitTaggerModel, self).__init__()
88
+ json_cfg = kwargs["base_model_json_cfg"]
89
+ self.config=kwargs["this_model_config"]
90
+ self.LemmaHandling = sys.modules["lemma_rules"].LemmaHandling
91
+ self.LemmaHandling.load_lemma_rules_from_obj(self.config["lemma_rules"])
92
+ cfg=sys.modules["base_config"].NorbertConfig(**json_cfg)
93
+ self.bert=sys.modules["base_model"].NorbertModel(cfg, pooling_type="CLS")
94
+ self.dropout = torch.nn.Dropout(self.bert.config.hidden_dropout_prob)
95
+ self.classifier1 = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels1"])
96
+ self.classifier2 = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels2"])
97
+ self.classifier3 = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels3"])
98
+ self.seq_classifier = torch.nn.Linear(self.bert.config.hidden_size, self.config["num_labels_seq"])
99
+ self.ignore_index = self.config["ignore_index"]
100
+ load_model(self, kwargs["model_weights_path"])
101
+ self.tokenizer=AutoTokenizer.from_pretrained(kwargs["repo_name"])
102
+ if "batch_size" in kwargs:
103
+ self.batch_size=kwargs["batch_size"]
104
+ else:
105
+ self.batch_size=8
106
+
107
+ if "device" in kwargs:
108
+ self.device = torch.device(kwargs["device"])
109
+ else:
110
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
111
+
112
+ self.MAX_LENGTH_WITHOUT_CLS = self.bert.config.max_position_embeddings -1
113
+ self.tags=self.config["tags"]
114
+ self.tags_str=[[" ".join(i) for i in self.config["tags"][0]], [" ".join(i) for i in self.config["tags"][1]]]
115
+ self.to(self.device)
116
+ self.REPLACE_DICT = self.config["replace_dict"]
117
+ self.REPLACE_PATTERN = '|'.join(sorted(re.escape(k) for k in self.REPLACE_DICT))
118
+ self.MAX_LENGTH = self.bert.config.max_position_embeddings
119
+
120
+ def forward(self, input_ids=None, attention_mask=None ):
121
+ outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, return_dict=True )
122
+ sequence_output = self.dropout(outputs.last_hidden_state)
123
+ logits1 = self.classifier1(sequence_output)
124
+ logits2 = self.classifier2(sequence_output)
125
+ logits3 = self.classifier3(sequence_output)
126
+ seq_logits = self.seq_classifier(sequence_output)
127
+ total_loss = 0
128
+ return {
129
+ "logits1": logits1,
130
+ "logits2": logits2,
131
+ "logits3": logits3,
132
+ "seq_logits": seq_logits,
133
+ }
134
+
135
+ def _preprocess_text(self,text):
136
+ new_text = re.sub(self.REPLACE_PATTERN, lambda m: self.REPLACE_DICT.get(m.group(0).upper()), text)
137
+ while new_text != text:
138
+ text = new_text
139
+ new_text = re.sub(self.REPLACE_PATTERN, lambda m: self.REPLACE_DICT.get(m.group(0).upper()), text)
140
+ return new_text
141
+
142
+ def _batchify(self, lst):
143
+
144
+ # Create batches
145
+ batched_sentences=[]
146
+ my_batch=[]
147
+ for sentence in lst:
148
+ sentence.append(self.tokenizer.sep_token_id)
149
+ my_batch.append(sentence)
150
+ if len(my_batch)==self.batch_size:
151
+ max_len=len(max(my_batch, key=len))
152
+ if max_len > self.MAX_LENGTH:
153
+ max_len = self.MAX_LENGTH
154
+ my_attentions=torch.LongTensor([[1] * len(i[0:max_len]) + [0]*(max_len-len(i[0:max_len])) for i in my_batch]).to("cpu")
155
+ my_batch=[i[0:max_len] + [0]*(max_len-len(i[0:max_len])) for i in my_batch]
156
+ to_append={
157
+ "input_ids": torch.LongTensor(my_batch).to("cpu"),
158
+ "attention_mask": my_attentions,
159
+ }
160
+ batched_sentences.append(to_append)
161
+ my_batch=[]
162
+ if len(my_batch)>0:
163
+ max_len=len(max(my_batch, key=len))
164
+ if max_len > self.MAX_LENGTH:
165
+ max_len = self.MAX_LENGTH
166
+ my_attentions=torch.LongTensor([[1] * len(i[0:max_len]) + [0]*(max_len-len(i[0:max_len])) for i in my_batch]).to("cpu")
167
+ my_batch=[i[0:max_len] + [0]*(max_len-len(i[0:max_len])) for i in my_batch]
168
+ to_append={
169
+ "input_ids": torch.LongTensor(my_batch).to("cpu"),
170
+ "attention_mask": my_attentions,
171
+ }
172
+ batched_sentences.append(to_append)
173
+
174
+ torch.cuda.empty_cache()
175
+
176
+ return batched_sentences
177
+
178
+ def _split_sentences(self, inp):
179
+
180
+ # Here we get the whole text tokenized.
181
+ encodings = self.tokenizer(inp,add_special_tokens=False, return_tensors="pt").to(self.device)
182
+
183
+ # Save a copy of the tokenization
184
+ original_encodings=copy.deepcopy(encodings)
185
+ original_encodings=original_encodings.to("cpu")
186
+ torch.cuda.empty_cache()
187
+
188
+ # Pad to the complete size (model max_size -1 (-1 to add CLS))
189
+ old_size=encodings["input_ids"][0].size()[0]
190
+
191
+ # Pad size
192
+ pad_size=self.MAX_LENGTH_WITHOUT_CLS - old_size % self.MAX_LENGTH_WITHOUT_CLS
193
+
194
+ # Number of rows
195
+ row_count=int(old_size/self.MAX_LENGTH_WITHOUT_CLS) + 1
196
+
197
+ # Do padding with pad_id to the pad_size that we have calculated.
198
+ encodings["input_ids"] = torch.nn.functional.pad(input=encodings["input_ids"], pad=(0, pad_size), mode="constant", value=self.tokenizer.pad_token_id)
199
+
200
+ # Set the last token as SENTENCE END (SEP)
201
+ encodings["input_ids"][0][old_size]=self.tokenizer.sep_token_id
202
+
203
+ # Chunk into max_length items
204
+ encodings["input_ids"]=torch.reshape(encodings["input_ids"],(row_count,self.MAX_LENGTH_WITHOUT_CLS))
205
+
206
+ # Add CLS to each item
207
+ encodings["input_ids"]=torch.cat(( torch.full((row_count,1), self.tokenizer.cls_token_id, device=self.device) ,encodings["input_ids"]),dim=1)
208
+
209
+ # Create attention mask
210
+ encodings["attention_mask"]=torch.ones_like(encodings["input_ids"], device=self.device)
211
+
212
+ # Create batches
213
+ input_ids_batched=torch.split(encodings["input_ids"], self.batch_size)
214
+ attention_mask_batched=torch.split(encodings["attention_mask"], self.batch_size)
215
+
216
+ # Set the last chunk's attention mask according to its size
217
+ attention_mask_batched[-1][-1][pad_size +1:] = 0
218
+
219
+ encodings=encodings.to("cpu")
220
+
221
+ # Now pass all chunks through the model and get the labels
222
+ # While passing, we count the number of bokmal and nynorsk markers
223
+ labels_output=[]
224
+
225
+ # First get them back to CPU to open space on GPU
226
+ input_ids_batched=[i.to("cpu") for i in input_ids_batched]
227
+ attention_mask_batched=[i.to("cpu") for i in attention_mask_batched]
228
+ torch.cuda.empty_cache()
229
+
230
+ for input_ids, attention_masks in zip(input_ids_batched, attention_mask_batched):
231
+ current_batch={"input_ids":input_ids.to(self.device).long(), "attention_mask":attention_masks.to(self.device).long()}
232
+ outputs = self(**current_batch)
233
+ del current_batch
234
+ torch.cuda.empty_cache()
235
+
236
+ label_data=outputs["logits1"].argmax(-1)
237
+ labels_output.extend(label_data)
238
+
239
+ # Serialize back
240
+ labels_output=torch.stack(labels_output ,dim=0)
241
+ labels_output=labels_output[:, range(1,self.MAX_LENGTH)]
242
+ labels_output=torch.reshape(labels_output,(1,row_count * self.MAX_LENGTH_WITHOUT_CLS))
243
+ torch.cuda.empty_cache()
244
+
245
+ # Now the data is split into sentences
246
+ # So, now create sentence data as list so that this could be used
247
+ # in torch operations and can be input to the models
248
+ sentence_list=[]
249
+ this_sentence=[self.tokenizer.cls_token_id]
250
+ for token, label in zip(original_encodings["input_ids"][0].tolist(), labels_output[0].tolist()):
251
+ if label==0:
252
+ this_sentence.append(token)
253
+ else:
254
+ this_sentence.append(token)
255
+ sentence_list.append(this_sentence)
256
+ this_sentence=[self.tokenizer.cls_token_id]
257
+
258
+ if len(this_sentence)>1:
259
+ sentence_list.append(this_sentence)
260
+ del original_encodings
261
+ del labels_output
262
+ del attention_mask_batched
263
+ del input_ids_batched
264
+ del encodings
265
+ del old_size
266
+ del inp
267
+ del outputs
268
+ torch.cuda.empty_cache()
269
+
270
+ return sentence_list
271
+
272
+ def _matcher(self, o):
273
+ return o.group(0)[0] + "\n\n" + o.group(0)[2]
274
+
275
+ def split_sentences(self, inp, **tag_config):
276
+ inp = [i.replace("\n"," ") for i in re.sub(r"[^.!\?](\n)([^a-z,æ,ø,å,\\ ])", self._matcher, inp).split("\n\n")]
277
+ sentences = []
278
+ for i in inp:
279
+ sentences.extend(self._split_sentences(i.strip()))
280
+ return sentences
281
+
282
+ def tag_sentence_list(self, lst, **tag_config):
283
+
284
+ # If the sentences are not tokenized, tokenize while batching:
285
+ tokenized_batches = []
286
+ if type(lst[0])==str:
287
+ tokenized_batches = []
288
+ for i in range(0, len(lst), self.batch_size):
289
+ batch_texts = lst[i:i + self.batch_size]
290
+ encoded_batch = self.tokenizer(batch_texts, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
291
+ encoded_batch["input_ids"].to("cpu")
292
+ encoded_batch["attention_mask"].to("cpu")
293
+ tokenized_batches.append(encoded_batch)
294
+
295
+ # sentences are already tokenized, then batchify them:
296
+ else:
297
+ tokenized_batches = self._batchify(lst)
298
+
299
+ # If language will be identified per sentence
300
+ if tag_config["lang_per_sentence"]:
301
+ id_to_lang = self.config["id_to_lang"]
302
+ # If the output will be to a python list
303
+ if tag_config["write_output_to"]==None:
304
+ all_tagged_sentences = []
305
+ for batch in tokenized_batches:
306
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
307
+ batch_tags = torch.argmax(all_out["logits2"], dim=-1)
308
+ batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
309
+ batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
310
+ batch["input_ids"].to("cpu")
311
+ batch["attention_mask"].to("cpu")
312
+
313
+ for input_ids, tags, lemmas, lang in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
314
+ batch_lemmas.tolist(), batch_langs[:, 0].tolist()):
315
+ this_sentence=[]
316
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
317
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
318
+ break
319
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
320
+ if len(this_sentence)>0:
321
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
322
+ else:
323
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
324
+ else:
325
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
326
+ all_tagged_sentences.append({"lang":id_to_lang[lang], "sent": [ {"w":i["w"], "t":self.tags[lang][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]})
327
+
328
+ return all_tagged_sentences
329
+
330
+ # If the output is in TSV format to a pipe (stdout or a file handle)
331
+ elif tag_config["output_tsv"]:
332
+ for batch in tokenized_batches:
333
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
334
+ batch_tags = torch.argmax(all_out["logits2"], dim=-1)
335
+ batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
336
+ batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
337
+ batch["input_ids"].to("cpu")
338
+ batch["attention_mask"].to("cpu")
339
+
340
+ for input_ids, tags, lemmas, lang in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
341
+ batch_lemmas.tolist(), batch_langs[:, 0].tolist()):
342
+ this_sentence=[]
343
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
344
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
345
+ break
346
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
347
+ if len(this_sentence)>0:
348
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
349
+ else:
350
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
351
+ else:
352
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
353
+ this_sentence=[ {"w":i["w"], "t":self.tags_str[lang][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]
354
+ tag_config["write_output_to"].write(id_to_lang[lang])
355
+ for lin in this_sentence:
356
+ tag_config["write_output_to"].write("\t")
357
+ tag_config["write_output_to"].write(lin["w"])
358
+ tag_config["write_output_to"].write("\t")
359
+ tag_config["write_output_to"].write(lin["l"])
360
+ tag_config["write_output_to"].write("\t")
361
+ tag_config["write_output_to"].write(lin["t"])
362
+ tag_config["write_output_to"].write("\n")
363
+ tag_config["write_output_to"].write("\n")
364
+
365
+ # If output format will be json to a pipe (stdout or a file handle)
366
+ else:
367
+ for batch in tokenized_batches:
368
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
369
+ batch_tags = torch.argmax(all_out["logits2"], dim=-1)
370
+ batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
371
+ batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
372
+ batch["input_ids"].to("cpu")
373
+ batch["attention_mask"].to("cpu")
374
+
375
+ for input_ids, tags, lemmas, lang in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
376
+ batch_lemmas.tolist(), batch_langs[:, 0].tolist()):
377
+ this_sentence=[]
378
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
379
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
380
+ break
381
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
382
+ if len(this_sentence)>0:
383
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
384
+ else:
385
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
386
+ else:
387
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
388
+
389
+ json.dump({"lang":id_to_lang[lang], "sent":[ {"w":i["w"], "t":self.tags[lang][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]}, tag_config["write_output_to"])
390
+ tag_config["write_output_to"].write("\n")
391
+
392
+ # If the language is set as parameter
393
+ elif tag_config["lang"] != -1:
394
+ LANG = tag_config["lang"]
395
+ LANG_STR = self.config["id_to_lang"][LANG]
396
+ # If the output will be to a python list
397
+ if tag_config["write_output_to"]==None:
398
+ all_tagged_sentences = []
399
+ for batch in tokenized_batches:
400
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
401
+ batch_tags = torch.argmax(all_out["logits2"], dim=-1)
402
+ batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
403
+ batch["input_ids"].to("cpu")
404
+ batch["attention_mask"].to("cpu")
405
+
406
+ for input_ids, tags, lemmas in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
407
+ batch_lemmas.tolist()):
408
+ this_sentence=[]
409
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
410
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
411
+ break
412
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
413
+ if len(this_sentence)>0:
414
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
415
+ else:
416
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
417
+ else:
418
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
419
+ all_tagged_sentences.append({"lang":LANG_STR, "sent": [ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]})
420
+
421
+ return all_tagged_sentences
422
+
423
+ # If the output is in TSV format to a pipe (stdout or a file handle)
424
+ elif tag_config["output_tsv"]:
425
+ for batch in tokenized_batches:
426
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
427
+ batch_tags = torch.argmax(all_out["logits2"], dim=-1)
428
+ batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
429
+ batch["input_ids"].to("cpu")
430
+ batch["attention_mask"].to("cpu")
431
+
432
+ for input_ids, tags, lemmas in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
433
+ batch_lemmas.tolist()):
434
+ this_sentence=[]
435
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
436
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
437
+ break
438
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
439
+ if len(this_sentence)>0:
440
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
441
+ else:
442
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
443
+ else:
444
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
445
+ this_sentence=[ {"w":i["w"], "t":self.tags_str[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]
446
+ tag_config["write_output_to"].write(LANG_STR)
447
+ for lin in this_sentence:
448
+ tag_config["write_output_to"].write("\t")
449
+ tag_config["write_output_to"].write(lin["w"])
450
+ tag_config["write_output_to"].write("\t")
451
+ tag_config["write_output_to"].write(lin["l"])
452
+ tag_config["write_output_to"].write("\t")
453
+ tag_config["write_output_to"].write(lin["t"])
454
+ tag_config["write_output_to"].write("\n")
455
+ tag_config["write_output_to"].write("\n")
456
+
457
+ # If output format will be json to a pipe (stdout or a file handle)
458
+ else:
459
+ for batch in tokenized_batches:
460
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
461
+ batch_tags = torch.argmax(all_out["logits2"], dim=-1)
462
+ batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
463
+ batch["input_ids"].to("cpu")
464
+ batch["attention_mask"].to("cpu")
465
+
466
+ for input_ids, tags, lemmas in zip(batch["input_ids"].tolist(), batch_tags.tolist(),
467
+ batch_lemmas.tolist()):
468
+ this_sentence=[]
469
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
470
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
471
+ break
472
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
473
+ if len(this_sentence)>0:
474
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
475
+ else:
476
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
477
+ else:
478
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
479
+
480
+ json.dump({"lang":LANG_STR, "sent": [ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]}, tag_config["write_output_to"])
481
+ tag_config["write_output_to"].write("\n")
482
+
483
+ # If language will be identified according to the majority of all sentences:
484
+ else:
485
+ all_tags=[]
486
+ all_lemmas=[]
487
+ all_langs=[]
488
+ all_input_ids=[]
489
+ # Go over all batches and each sentence in each batch
490
+ for batch in tokenized_batches:
491
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
492
+ batch_tags = torch.argmax(all_out["logits2"], dim=-1)
493
+ batch_lemmas = torch.argmax(all_out["logits3"], dim=-1)
494
+ batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
495
+ all_input_ids.extend(batch["input_ids"].tolist())
496
+ batch["input_ids"].to("cpu")
497
+ batch["attention_mask"].to("cpu")
498
+ all_langs.extend(batch_langs[:, 0].tolist())
499
+ all_tags.extend(batch_tags.tolist())
500
+ all_lemmas.extend(batch_lemmas.tolist())
501
+
502
+ # Identify the language
503
+ tag_config["lang"] = 1 if sum(all_langs)/len(all_langs)>=0.5 else 0
504
+ LANG = tag_config["lang"]
505
+ LANG_STR = self.config["id_to_lang"][LANG]
506
+
507
+ # If the output will be returned as python list:
508
+ if tag_config["write_output_to"]==None:
509
+ all_tagged_sentences = []
510
+ for input_ids, tags, lemmas in zip(all_input_ids, all_tags, all_lemmas):
511
+ this_sentence=[]
512
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
513
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
514
+ break
515
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
516
+ if len(this_sentence)>0:
517
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
518
+ else:
519
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
520
+ else:
521
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
522
+ all_tagged_sentences.append({"lang":LANG_STR, "sent": [ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence] })
523
+ return all_tagged_sentences
524
+
525
+ # If the output is in TSV format
526
+ elif tag_config["output_tsv"]:
527
+ for input_ids, tags, lemmas in zip(all_input_ids, all_tags, all_lemmas):
528
+ this_sentence=[]
529
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
530
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
531
+ break
532
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
533
+ if len(this_sentence)>0:
534
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
535
+ else:
536
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
537
+ else:
538
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
539
+ this_sentence=[ {"w":i["w"], "t":self.tags_str[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]
540
+ tag_config["write_output_to"].write(LANG_STR)
541
+ for lin in this_sentence:
542
+ tag_config["write_output_to"].write("\t")
543
+ tag_config["write_output_to"].write(lin["w"])
544
+ tag_config["write_output_to"].write("\t")
545
+ tag_config["write_output_to"].write(lin["l"])
546
+ tag_config["write_output_to"].write("\t")
547
+ tag_config["write_output_to"].write(lin["t"])
548
+ tag_config["write_output_to"].write("\n")
549
+ tag_config["write_output_to"].write("\n")
550
+
551
+ # If output format will be json
552
+ else:
553
+ for input_ids, tags, lemmas in zip(all_input_ids, all_tags, all_lemmas):
554
+ this_sentence=[]
555
+ for inps, tag, lemma in zip(input_ids[1:], tags[1:], lemmas[1:]):
556
+ if inps == self.tokenizer.sep_token_id or inps == self.tokenizer.pad_token_id:
557
+ break
558
+ if lemma == 0: # If there is no lemma here, that means we haven't reached the end of the word
559
+ if len(this_sentence)>0:
560
+ this_sentence[-1]["w"] += self.tokenizer.decode(inps)
561
+ else:
562
+ this_sentence.append({"w": self.tokenizer.decode(inps), "t": tag, "l":lemma})
563
+ else:
564
+ this_sentence.append({"w":self.tokenizer.decode(inps).strip(), "t":tag, "l":lemma})
565
+
566
+ json.dump({"lang":LANG_STR, "sent":[ {"w":i["w"], "t":self.tags[LANG][i["t"]], "l":self.LemmaHandling.get_lemma_given_word_and_lemma_list_index(i["w"],i["l"])} for i in this_sentence]}, tag_config["write_output_to"])
567
+ tag_config["write_output_to"].write("\n")
568
+
569
+ def _check_if_text_file_and_return_content(self, filepath):
570
+ try:
571
+ with open(filepath, 'r') as f:
572
+ return f.read()
573
+ except Exception as e:
574
+ return False
575
+
576
+ @torch.no_grad()
577
+ def tag(self, inp=None, **tag_config):
578
+ self.eval()
579
+ if "one_sentence_per_line" not in tag_config:
580
+ tag_config["one_sentence_per_line"]=False
581
+
582
+ if "lang" not in tag_config:
583
+ tag_config["lang"]=-1
584
+ else:
585
+ if tag_config["lang"] in self.config["lang_to_id"]:
586
+ tag_config["lang"] = self.config["lang_to_id"][tag_config["lang"]]
587
+ else:
588
+ tag_config["lang"]=-1
589
+ if "output_tsv" not in tag_config:
590
+ tag_config["output_tsv"] = False
591
+
592
+ if "lang_per_sentence" not in tag_config:
593
+ tag_config["lang_per_sentence"] = False
594
+
595
+ elif tag_config["lang_per_sentence"]:
596
+ tag_config["lang_per_sentence"] = True
597
+
598
+ if tag_config["lang"]!=-1 and tag_config["lang_per_sentence"]:
599
+ raise ValueError("lang_per_sentence and lang parameters cannot be set at the same time. ")
600
+
601
+ if "input_directory" in tag_config:
602
+ if not "output_directory" in tag_config:
603
+ raise ValueError("output_directory must be defined if input_directory is defined. ")
604
+ if "write_output_to" in tag_config and tag_config["write_output_to"]!=None:
605
+ raise ValueError("If an input and output directory is given, then write_output_to cannot be used as the output will be written to as files in output_directory.")
606
+
607
+ write_to = sys.stderr if not sys.stderr.closed else sys.stdout if not sys.stdout.closed else open("tag.log","w")
608
+
609
+ # Process directory
610
+ for dir_path, _, files in os.walk(tag_config["input_directory"]):
611
+ for f in files:
612
+ input_path = os.path.join(dir_path, f)
613
+ out_path = os.path.join(tag_config["output_directory"], os.path.relpath(dir_path, tag_config["input_directory"]), f+".tagged")
614
+
615
+ file_content=self._check_if_text_file_and_return_content(input_path)
616
+
617
+ if type(file_content)==str:
618
+ file_content=self._preprocess_text(file_content)
619
+ print (f"Tagging {input_path} to {out_path}.")
620
+ os.makedirs(os.path.dirname(out_path), exist_ok=True)
621
+ if tag_config["one_sentence_per_line"]:
622
+ inp = [i for i in file_content.split("\n") if i!=""]
623
+ inp = [i for i in inp if i!=""]
624
+ with open(out_path, "w") as opened_file:
625
+ tag_config["write_output_to"] = opened_file
626
+ self.tag_sentence_list(inp, **tag_config)
627
+ else:
628
+ inp = self.split_sentences(file_content, **tag_config)
629
+ with open(out_path, "w") as opened_file:
630
+ tag_config["write_output_to"] = opened_file
631
+ self.tag_sentence_list(inp, **tag_config)
632
+ else:
633
+ print (f"Could not properly open and read {input_path}.")
634
+
635
+ write_to.close()
636
+ return
637
+
638
+ else:
639
+ if "write_output_to" not in tag_config or "write_output_to" in tag_config and tag_config["write_output_to"]== None:
640
+ tag_config["write_output_to"] = sys.stdout
641
+ elif type(tag_config["write_output_to"]) == str and tag_config["write_output_to"]=="list":
642
+ tag_config["write_output_to"] = None
643
+ elif type(tag_config["write_output_to"]) == str:
644
+ tag_config["write_output_to"] = open(tag_config["write_output_to"], "w")
645
+
646
+ if inp==None:
647
+ pass
648
+ elif type(inp) == str:
649
+
650
+ # Tag one sentence per line in a string
651
+ if tag_config["one_sentence_per_line"]:
652
+ inp = [i for i in inp.split("\n") if i!=""]
653
+ inp = [self._preprocess_text(i) for i in inp if i!=""]
654
+ return self.tag_sentence_list(inp, **tag_config)
655
+
656
+ # identify sentences
657
+ inp = self.split_sentences(inp, **tag_config)
658
+ return self.tag_sentence_list(inp, **tag_config)
659
+
660
+ # Tag one sentence per list item
661
+ elif type(inp) == list:
662
+ inp=[i.strip() for i in inp]
663
+ inp=[self._preprocess_text(i) for i in inp if i!=""]
664
+ return self.tag_sentence_list(inp, **tag_config)
665
+
666
+ def identify_language_sentence_list(self, lst, **tag_config):
667
+
668
+ # If the sentences are not tokenized, tokenize while batching:
669
+ tokenized_batches = []
670
+ if type(lst[0])==str:
671
+ tokenized_batches = []
672
+ for i in range(0, len(lst), self.batch_size):
673
+ batch_texts = lst[i:i + self.batch_size]
674
+ encoded_batch = self.tokenizer(batch_texts, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
675
+ encoded_batch["input_ids"].to("cpu")
676
+ encoded_batch["attention_mask"].to("cpu")
677
+ tokenized_batches.append(encoded_batch)
678
+
679
+ # sentences are already tokenized, then batchify them:
680
+ else:
681
+ tokenized_batches = self._batchify(lst)
682
+
683
+
684
+ all_tagged_sentences = []
685
+
686
+ # Go over all batches and each sentence in each batch
687
+ for batch in tokenized_batches:
688
+ all_out = self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))
689
+ batch_langs = torch.argmax(all_out["seq_logits"], dim=-1)
690
+ batch["input_ids"].to("cpu")
691
+ batch["attention_mask"].to("cpu")
692
+ all_tagged_sentences.extend(batch_langs[:, 0].tolist())
693
+
694
+ # If language will be identified per item
695
+ if tag_config["lang_per_item"]:
696
+ return [self.config["id_to_lang"][i] for i in all_tagged_sentences]
697
+
698
+ # If language will be identified according to the majority of all sentences:
699
+ else:
700
+ LANG = 1 if sum(all_tagged_sentences)/len(all_tagged_sentences)>=0.5 else 0
701
+ LANG_STR = self.config["id_to_lang"][LANG]
702
+ return [LANG_STR] * len(lst)
703
+
704
+ @torch.no_grad()
705
+ def identify_language(self, inp=None, **tag_config):
706
+ self.eval()
707
+ if "one_sentence_per_line" not in tag_config:
708
+ tag_config["one_sentence_per_line"]=False
709
+ if "lang" in tag_config:
710
+ del tag_config["lang"]
711
+
712
+ if "output_tsv" not in tag_config:
713
+ tag_config["output_tsv"] = False
714
+
715
+ if "lang_per_sentence" not in tag_config:
716
+ tag_config["lang_per_sentence"] = False
717
+
718
+ elif tag_config["lang_per_sentence"]:
719
+ tag_config["lang_per_sentence"] = True
720
+
721
+ if "input_directory" in tag_config and "output_directory" in tag_config and "write_output_to" in tag_config and tag_config["write_output_to"]!=None:
722
+ raise ValueError("If an input and output directory is given, then write_output_to cannot be used as the output will be written to as files in output_directory.")
723
+
724
+ if "write_output_to" not in tag_config or "write_output_to" in tag_config and tag_config["write_output_to"]== None:
725
+ tag_config["write_output_to"] = sys.stdout
726
+
727
+ elif type(tag_config["write_output_to"]) == str and tag_config["write_output_to"]=="list":
728
+ if tag_config["output_tsv"]:
729
+ raise ValueError("write_output_to cannot be set to list if output_tsv is set.")
730
+ if "output_directory" in tag_config and tag_config["output_directory"]:
731
+ raise ValueError("write_output_to cannot be set to list if output_directory is set.")
732
+ tag_config["write_output_to"] = None
733
+
734
+ elif type(tag_config["write_output_to"]) == str:
735
+ tag_config["write_output_to"] = open(tag_config["write_output_to"], "w")
736
+
737
+ if "output_directory" in tag_config:
738
+ tag_config["write_output_to"] = None
739
+
740
+ if "split_sentences" not in tag_config:
741
+ tag_config["split_sentences"] = False
742
+
743
+ if "lang_per_item" not in tag_config:
744
+ tag_config["lang_per_item"] = False
745
+
746
+ if "fast_mode" in tag_config:
747
+
748
+ if "input_directory" not in tag_config:
749
+ raise ValueError("input_directory must be defined if fast_mode is set.")
750
+
751
+ if tag_config["split_sentences"]:
752
+ raise ValueError("fast_mode does not split sentences, so split_sentences cannot be set in this mode.")
753
+
754
+ if tag_config["lang_per_item"]:
755
+ raise ValueError("fast_mode does not identify languages of each line or sentence in a file, so lang_per_item cannot be set in this mode.")
756
+
757
+ if tag_config["lang_per_sentence"]:
758
+ raise ValueError("fast_mode does not identify languages of sentence in a file, so lang_per_sentence cannot be set in this mode.")
759
+
760
+ general_output=[]
761
+ file_names=[]
762
+ contents=[]
763
+ # Process directory
764
+ for dir_path, _, files in os.walk(tag_config["input_directory"]):
765
+ for f in files:
766
+ input_path = os.path.join(dir_path, f)
767
+ if len(file_names) == self.batch_size:
768
+ batch = self.tokenizer(contents, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
769
+ langs = torch.argmax( self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))["seq_logits"], dim=-1)[:, 0].tolist()
770
+ del batch
771
+ torch.cuda.empty_cache()
772
+
773
+ if tag_config["write_output_to"]==None:
774
+ general_output.extend([{"f":i[0], "l":self.config["id_to_lang"][i[1]]} for i in zip(file_names, langs)])
775
+ elif tag_config["output_tsv"]:
776
+ for fil,lan in zip(file_names, langs):
777
+ tag_config["write_output_to"].write(fil)
778
+ tag_config["write_output_to"].write("\t")
779
+ tag_config["write_output_to"].write(self.config["id_to_lang"][lan])
780
+ tag_config["write_output_to"].write("\n")
781
+ else:
782
+ for fil,lan in zip(file_names, langs):
783
+ json.dump({"f":fil, "l":self.config["id_to_lang"][lan]})
784
+ file_names=[]
785
+ contents=[]
786
+ else:
787
+ content=None
788
+ try:
789
+ with open(input_path,"r") as ff:
790
+ content=ff.read(3000).replace("\n"," ").replace("\r","")
791
+ except:
792
+ pass
793
+ if content!=None:
794
+ file_names.append(input_path)
795
+ contents.append(content)
796
+
797
+ if len(file_names)>0:
798
+ batch = self.tokenizer(contents, padding=True, truncation=True, max_length=self.MAX_LENGTH, return_tensors="pt", return_token_type_ids=False)
799
+ langs = torch.argmax( self(batch["input_ids"].to(self.device), batch["attention_mask"].to(self.device))["seq_logits"], dim=-1)[:, 0].tolist()
800
+ del batch
801
+ torch.cuda.empty_cache()
802
+
803
+ if tag_config["write_output_to"]==None:
804
+ general_output.extend([{"f":i[0], "l":self.config["id_to_lang"][i[1]]} for i in zip(file_names, langs)])
805
+ elif tag_config["output_tsv"]:
806
+ for fil,lan in zip(file_names, langs):
807
+ tag_config["write_output_to"].write(fil)
808
+ tag_config["write_output_to"].write("\t")
809
+ tag_config["write_output_to"].write(self.config["id_to_lang"][lan])
810
+ tag_config["write_output_to"].write("\n")
811
+ else:
812
+ for fil,lan in zip(file_names, langs):
813
+ json.dump({"f":fil, "l":self.config["id_to_lang"][lan]})
814
+
815
+ return general_output if len(general_output)>0 else None
816
+
817
+ if "input_directory" in tag_config:
818
+ general_output=[]
819
+ # Process directory
820
+ for dir_path, _, files in os.walk(tag_config["input_directory"]):
821
+ for f in files:
822
+ input_path = os.path.join(dir_path, f)
823
+
824
+ file_content=self._check_if_text_file_and_return_content(input_path)
825
+
826
+ if type(file_content)==str:
827
+ file_content=self._preprocess_text(file_content)
828
+ new_inp=None
829
+ if tag_config["one_sentence_per_line"]:
830
+ inp = [i for i in file_content.split("\n") if i!=""]
831
+ inp = [i for i in inp if i!=""]
832
+ out = self.identify_language_sentence_list(inp, **tag_config)
833
+ else:
834
+ inp = self.split_sentences(file_content, **tag_config)
835
+ out = self.identify_language_sentence_list(inp, **tag_config)
836
+ new_inp=[self.tokenizer.decode(i[1:]).split("[SEP]")[0].strip() for i in inp]
837
+
838
+ if new_inp!=None:
839
+ inp=new_inp
840
+
841
+ # If no output pipe is available than write to
842
+ if tag_config["write_output_to"]==None:
843
+ if "output_directory" in tag_config:
844
+ out_path = os.path.join(tag_config["output_directory"], os.path.relpath(dir_path, tag_config["input_directory"]), f+".lang")
845
+ os.makedirs(os.path.dirname(out_path), exist_ok=True)
846
+ with open(out_path, "w") as opened_file:
847
+ if tag_config["lang_per_sentence"]:
848
+ if tag_config["output_tsv"]:
849
+ for sen,lan in zip(inp, out):
850
+ opened_file.write(sen)
851
+ opened_file.write("\t")
852
+ opened_file.write(lan)
853
+ opened_file.write("\n")
854
+ else:
855
+ json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , opened_file)
856
+ else:
857
+ if tag_config["output_tsv"]:
858
+ opened_file.write(out[0])
859
+ else:
860
+ json.dump({"l":out[0]} , opened_file)
861
+ else:
862
+ if tag_config["lang_per_sentence"]:
863
+ general_output.extend([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ])
864
+ else:
865
+ general_output.append({"f":input_path, "l":out[0]})
866
+
867
+ # If there is an opened pipe already
868
+ else:
869
+ if tag_config["lang_per_sentence"]:
870
+ if tag_config["output_tsv"]:
871
+ for sen,lan in zip(inp, out):
872
+ tag_config["write_output_to"].write(sen)
873
+ tag_config["write_output_to"].write("\t")
874
+ tag_config["write_output_to"].write(lan)
875
+ tag_config["write_output_to"].write("\n")
876
+ tag_config["write_output_to"].write("\n")
877
+ else:
878
+ json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , tag_config["write_output_to"])
879
+ tag_config["write_output_to"].write("\n")
880
+ else:
881
+ if tag_config["output_tsv"]:
882
+ tag_config["write_output_to"].write(input_path)
883
+ tag_config["write_output_to"].write("\t")
884
+ tag_config["write_output_to"].write(out[0])
885
+ tag_config["write_output_to"].write("\n")
886
+ else:
887
+ json.dump({"f":input_path, "l":out[0]} , tag_config["write_output_to"])
888
+ tag_config["write_output_to"].write("\n")
889
+
890
+ else:
891
+ if tag_config["output_tsv"]:
892
+ tag_config["write_output_to"].write(input_path)
893
+ tag_config["write_output_to"].write("\t")
894
+ tag_config["write_output_to"].write("err")
895
+ tag_config["write_output_to"].write("\n")
896
+ else:
897
+ json.dump({"f":input_path, "l":"err"} , tag_config["write_output_to"])
898
+ tag_config["write_output_to"].write("\n")
899
+
900
+ if tag_config["write_output_to"] and tag_config["write_output_to"]!=sys.stdout and tag_config["write_output_to"]!=sys.stderr:
901
+ tag_config["write_output_to"].close()
902
+
903
+ return general_output if len(general_output)>0 else None
904
+
905
+ if inp==None:
906
+ pass
907
+ elif type(inp) == str:
908
+ new_inp=None
909
+ # if split sentences is set
910
+ if tag_config["split_sentences"]:
911
+ inp = self._preprocess_text(inp)
912
+ inp = self.split_sentences(inp, **tag_config)
913
+ new_inp=[self.tokenizer.decode(i[1:]).strip() for i in inp]
914
+ if tag_config["lang_per_sentence"]:
915
+ tag_config["lang_per_item"] = True
916
+
917
+ # if tag one sentence per line in a string
918
+ elif tag_config["one_sentence_per_line"]:
919
+ inp = [i for i in inp.split("\n") if i!=""]
920
+ inp = [self._preprocess_text(i) for i in inp if i!=""]
921
+ if tag_config["lang_per_sentence"]:
922
+ tag_config["lang_per_item"] = True
923
+
924
+ # Otherwise identify the language of the input string as a whole
925
+ else:
926
+ inp = [self._preprocess_text(inp)]
927
+
928
+ # Identify language
929
+ out = self.identify_language_sentence_list(inp, **tag_config)
930
+
931
+ if new_inp!=None:
932
+ inp=new_inp
933
+
934
+ # If return as list
935
+ if tag_config["write_output_to"]==None:
936
+ return [{"s":i[0], "l": i[1]} for i in zip(inp, out)]
937
+
938
+ if tag_config["output_tsv"]:
939
+ for sen,lan in zip(inp, out):
940
+ tag_config["write_output_to"].write(sen)
941
+ tag_config["write_output_to"].write("\t")
942
+ tag_config["write_output_to"].write(out)
943
+ tag_config["write_output_to"].write("\n")
944
+ else:
945
+ json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , tag_config["write_output_to"])
946
+
947
+ return
948
+
949
+ # Tag one sentence per list item
950
+ elif type(inp) == list:
951
+ inp=[i.strip() for i in inp]
952
+ inp=[self._preprocess_text(i) for i in inp if i!=""]
953
+ out = self.identify_language_sentence_list(inp, **tag_config)
954
+
955
+ # If return as list
956
+ if tag_config["write_output_to"]==None:
957
+ return [{"s":i[0], "l": i[1]} for i in zip(inp, out)]
958
+
959
+ if tag_config["output_tsv"]:
960
+ for sen,lan in zip(inp, out):
961
+ tag_config["write_output_to"].write(sen)
962
+ tag_config["write_output_to"].write("\t")
963
+ tag_config["write_output_to"].write(lan)
964
+ tag_config["write_output_to"].write("\n")
965
+ else:
966
+ json.dump([{"s":sen, "l":lan} for sen,lan in zip(inp, out) ] , tag_config["write_output_to"])
967
+
968
+ return
969
+
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tagger_config.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast"
3
+ }