djulian13 commited on
Commit
a9f6b6c
·
verified ·
1 Parent(s): ca8e9df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -37
README.md CHANGED
@@ -1,3 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # be-tiny-bart
2
 
3
  A model for lemmatisation of Belarusian, trained on [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master) dataset.
@@ -32,7 +47,6 @@ Downstream use and further fine-tuning (for instance, for text-to-SQL transforma
32
 
33
  The model is fine-tuned only for Modern Standard Belarusian on a rather small Belarusian-HSE dataset. Use its results only after the manual check.
34
 
35
- [More Information Needed]
36
 
37
  ### Recommendations
38
 
@@ -41,9 +55,71 @@ Use this model only for lemmatisation of Modern Standard Belarusian if you aspir
41
 
42
  ## How to Get Started with the Model
43
 
44
- Use the code below to get started with the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## Training Details
49
 
@@ -153,49 +229,38 @@ The training took around 2.5 hrs on 4 GB GPU (NVIDIA GeForce RTX 3050).
153
 
154
  ## Evaluation
155
 
156
- During the training, no implementation procedures were introduced.
157
 
158
  ### Testing Data, Factors & Metrics
159
 
160
  #### Testing Data
161
 
162
- <!-- This should link to a Dataset Card if possible. -->
163
-
164
- [More Information Needed]
165
 
166
  #### Factors
167
 
168
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
169
-
170
- [More Information Needed]
171
 
172
  #### Metrics
173
 
174
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
175
-
176
- [More Information Needed]
177
 
178
  ### Results
179
 
180
- [More Information Needed]
181
 
182
  #### Summary
183
 
 
184
 
185
 
186
- ## Model Examination [optional]
187
-
188
- <!-- Relevant interpretability work for the model goes here -->
189
-
190
- [More Information Needed]
191
-
192
  ## Environmental Impact
193
 
194
  - **Hardware Type:** Personal laptop (Xiaomi Mi Notebook Pro X 15)
195
  - **Hours used:** 4h
196
  - **Carbon emitted:** approx. 0.1 kg.
197
 
198
- ## Technical Specifications [optional]
199
 
200
  ### Model Architecture and Objective
201
 
@@ -226,24 +291,10 @@ TBP
226
  TBP
227
 
228
 
229
- ## Model Card Authors [optional]
230
 
231
  Ilia Afanasev
232
 
233
  ## Model Card Contact
234
 
235
- ilia.afanasev.1997@gmail.com
236
-
237
- ---
238
- license: mpl-2.0
239
- language:
240
- - be
241
- metrics:
242
- - accuracy
243
- base_model:
244
- - sshleifer/bart-tiny-random
245
- pipeline_tag: translation
246
- tags:
247
- - seq2seq
248
- - lemmatisation
249
- ---
 
1
+ ---
2
+ license: mpl-2.0
3
+ language:
4
+ - be
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - sshleifer/bart-tiny-random
9
+ pipeline_tag: translation
10
+ tags:
11
+ - seq2seq
12
+ - lemmatisation
13
+ library_name: transformers
14
+ ---
15
+
16
  # be-tiny-bart
17
 
18
  A model for lemmatisation of Belarusian, trained on [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master) dataset.
 
47
 
48
  The model is fine-tuned only for Modern Standard Belarusian on a rather small Belarusian-HSE dataset. Use its results only after the manual check.
49
 
 
50
 
51
  ### Recommendations
52
 
 
55
 
56
  ## How to Get Started with the Model
57
 
58
+ Use the code below to get started with the model. You will need your data in CoNLL-U format.
59
+
60
+ ```
61
+ !pip install simpletransformers
62
+ !pip install pyjarowinkler
63
+ !pip install Levenshtein
64
+
65
+ import logging
66
+ import pandas as pd
67
+ from simpletransformers.seq2seq import Seq2SeqModel
68
+ import torch
69
+ import Levenshtein
70
+ from pyjarowinkler import distance as jw
71
+ import numpy as np
72
+ from itertools import cycle
73
+ import json
74
+
75
+ def load_conllu_dataset(datafile):
76
+ arr = []
77
+ with open(datafile, encoding='utf-8') as inp:
78
+ strings = inp.readlines()
79
+ for s in strings:
80
+ if (s[0] != "#" and s.strip()):
81
+ split_string = s.split('\t')
82
+ arr.append([split_string[1] + " " + split_string[3]+ " " + split_string[5], split_string[2]])
83
+ return pd.DataFrame(arr, columns=["input_text", "target_text"])
84
+
85
+
86
+ MODEL_NAME = "djulian13/be-tiny-bart"
87
+
88
+ logging.basicConfig(level=logging.INFO)
89
+ transformers_logger = logging.getLogger("transformers")
90
+ transformers_logger.setLevel(logging.WARNING)
91
+
92
+
93
+ model = Seq2SeqModel(
94
+ encoder_decoder_type="bart",
95
+ encoder_decoder_name=MODEL_NAME,
96
+ use_cuda = torch.cuda.is_available()
97
+ )
98
+
99
+ DATA_PRED_NAME = "test.conllu"
100
+
101
+ predictions = load_conllu_dataset(DATA_PRED_NAME)
102
+
103
+ pred_data = predictions["input_text"].tolist()
104
 
105
+ predictions = model.predict(pred_data)
106
+
107
+ predictions = cycle(predictions)
108
+ with open(DATA_PRED_NAME, encoding='utf-8') as inp:
109
+ strings = inp.readlines()
110
+ predicted = []
111
+ for s in strings:
112
+ if (s[0] != "#" and s.strip()):
113
+ split_string = s.split('\t')
114
+ split_string[2] = next(predictions)
115
+ joined_string = '\t'.join(split_string)
116
+ predicted.append(joined_string)
117
+ continue
118
+ predicted.append(s)
119
+ with open("result.conllu", 'w', encoding='utf-8') as out:
120
+ out.write(''.join(predicted))
121
+
122
+ ```
123
 
124
  ## Training Details
125
 
 
229
 
230
  ## Evaluation
231
 
232
+ During the training, no evaluation procedures were introduced.
233
 
234
  ### Testing Data, Factors & Metrics
235
 
236
  #### Testing Data
237
 
238
+ [YABC](https://github.com/poritski/YABC), a freely downloadable corpus of ≈7.5M words of Belarusian newspaper articles and fiction. For the more detailed representation of the dataset, see its page on [Zenodo](https://zenodo.org/records/19349899).
 
 
239
 
240
  #### Factors
241
 
242
+ Genre differences: newspaper articles vs. fiction.
 
 
243
 
244
  #### Metrics
245
 
246
+ The evaluation process used accuracy score for the best possible comparison, alongside with the qualitative analysis of the examples.
 
 
247
 
248
  ### Results
249
 
250
+ When tested out-of-domain, the model often struggles to generate the correct lemma.
251
 
252
  #### Summary
253
 
254
+ Generally, it is possible to use this model for the preliminary tagging of Belarusian. However, if there are better options (for instance, disambiguation of existing multiple taggins with LLMs), it is better to go with them.
255
 
256
 
 
 
 
 
 
 
257
  ## Environmental Impact
258
 
259
  - **Hardware Type:** Personal laptop (Xiaomi Mi Notebook Pro X 15)
260
  - **Hours used:** 4h
261
  - **Carbon emitted:** approx. 0.1 kg.
262
 
263
+ ## Technical Specifications
264
 
265
  ### Model Architecture and Objective
266
 
 
291
  TBP
292
 
293
 
294
+ ## Model Card Authors
295
 
296
  Ilia Afanasev
297
 
298
  ## Model Card Contact
299
 
300
+ ilia.afanasev.1997@gmail.com