Create description of model

#1
by djulian13 - opened
Files changed (1) hide show
  1. README.md +248 -0
README.md ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mpl-2.0
3
+ language:
4
+ - be
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - sshleifer/bart-tiny-random
9
+ pipeline_tag: translation
10
+ tags:
11
+ - seq2seq
12
+ - lemmatisation
13
+ ---
14
+ # be-tiny-bart
15
+
16
+ A model for lemmatisation of Belarusian, trained on [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master) dataset.
17
+
18
+ ## Model Details
19
+
20
+ ### Model Description
21
+
22
+ - **Developed by:** Ilia Afanasev
23
+ - **Model type:** BART
24
+ - **Language(s) (NLP):** Belarusian
25
+ - **License:** mpl-2.0
26
+ - **Finetuned from model:** sshleifer/bart-tiny-random
27
+
28
+ ### Model Sources
29
+
30
+ - **Paper:** TBP
31
+
32
+ ## Uses
33
+
34
+ Sequence-to-sequence transformation.
35
+
36
+ ### Direct Use
37
+
38
+ The system was fine-tuned for lemmatisation of Modern Standard Belarusian.
39
+
40
+ ### Out-of-Scope Use
41
+
42
+ Downstream use and further fine-tuning (for instance, for text-to-SQL transformation) seem to be
43
+
44
+ ## Bias, Risks, and Limitations
45
+
46
+ The model is fine-tuned only for Modern Standard Belarusian on a rather small Belarusian-HSE dataset. Use its results only after the manual check.
47
+
48
+ [More Information Needed]
49
+
50
+ ### Recommendations
51
+
52
+ Use this model only for lemmatisation of Modern Standard Belarusian if you aspire for the reliable silver tagging results. Any kind of regional, territorial or social variation is going to lead to the catastrophic forgetting issues.
53
+
54
+
55
+ ## How to Get Started with the Model
56
+
57
+ Use the code below to get started with the model.
58
+
59
+ [More Information Needed]
60
+
61
+ ## Training Details
62
+
63
+ ### Training Data
64
+
65
+ [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master)
66
+
67
+ ### Training Procedure
68
+
69
+ Virtual environment:
70
+
71
+ - Python 3.10.12
72
+ - Transformers 4.34.0
73
+ - sentence-splitter==1.4
74
+ - simpletransformers==0.64.3
75
+ - stanza==1.8.1
76
+ - torch==2.1.0
77
+
78
+ The script:
79
+
80
+ ```
81
+ import logging
82
+ import pandas as pd
83
+ from simpletransformers.seq2seq import Seq2SeqModel
84
+ import argparse
85
+ import torch
86
+ import random
87
+
88
+
89
+ def load_conllu_dataset(datafile):
90
+ arr = []
91
+ with open(datafile, encoding='utf-8') as inp:
92
+ strings = inp.readlines()
93
+ for s in strings:
94
+ if (s[0] != "#" and s.strip()):
95
+ split_string = s.split('\t')
96
+ arr.append([split_string[1] + " " + split_string[3]+ " " + split_string[5], split_string[2]])
97
+ return pd.DataFrame(arr, columns=["input_text", "target_text"])
98
+
99
+ def count_matches(labels, preds):
100
+ print(labels)
101
+ print(preds)
102
+ return sum([1 if label == pred else 0 for label, pred in zip(labels, preds)])
103
+
104
+ def main(args):
105
+ train_df = load_conllu_dataset(args.train_data)
106
+ args.fraction = float(args.fraction)
107
+ print(f'Loading training dataset of {train_df.shape[0]} tokens')
108
+ eval_df = load_conllu_dataset(args.dev_data)
109
+ random.seed(int(args.seed))
110
+ print(f'Setting seed to {args.seed}')
111
+ if args.fraction > 0.0 and args.fraction < 1.0:
112
+ remainder = int(args.fraction * len(train_df))
113
+ train_df = train_df.sample(remainder)
114
+ print(f'Subsampling training dataset to {train_df.shape[0]} tokens')
115
+ model_args = {
116
+ "reprocess_input_data": True,
117
+ "overwrite_output_dir": True,
118
+ "max_seq_length": max([len(token) for token in train_df["target_text"].tolist()]),
119
+ "train_batch_size": int(args.batch),
120
+ "num_train_epochs": int(args.epochs),
121
+ "save_eval_checkpoints": False,
122
+ "save_model_every_epoch": False,
123
+ # "silent": True,
124
+ "evaluate_generated_text": False,
125
+ "evaluate_during_training": False,
126
+ "evaluate_during_training_verbose": False,
127
+ "use_multiprocessing": False,
128
+ "use_multiprocessing_for_evaluation": False,
129
+ "save_best_model": False,
130
+ "max_length": max([len(token) for token in train_df["input_text"].tolist()]),
131
+ "save_steps": -1,
132
+ }
133
+ model = Seq2SeqModel(
134
+ encoder_decoder_type=args.model_type,
135
+ encoder_decoder_name=args.model,
136
+ args=model_args,
137
+ use_cuda = torch.cuda.is_available(),)
138
+ model.train_model(train_df, eval_data=eval_df, matches=count_matches)
139
+
140
+ if __name__ == '__main__':
141
+ parser = argparse.ArgumentParser()
142
+ parser.add_argument('--train_data')
143
+ parser.add_argument('--dev_data')
144
+ parser.add_argument('--model_type', default="bart")
145
+ parser.add_argument('--model', default="tiny-bart")
146
+ parser.add_argument('--epochs', default="2")
147
+ parser.add_argument('--batch', default="4")
148
+ parser.add_argument('--fraction', help="Fraction of data", default=1.0)
149
+ parser.add_argument('--seed', help="random seed", default=1590)
150
+ args = parser.parse_args()
151
+ main(args)
152
+ ```
153
+
154
+
155
+ #### Training Hyperparameters
156
+
157
+ - **Training regime:** fp32
158
+ - **Epochs**: 2
159
+ - **Batch**: 7
160
+ - **Seed**: 1590
161
+
162
+
163
+ #### Speeds, Sizes, Times
164
+
165
+ The training took around 2.5 hrs on 4 GB GPU (NVIDIA GeForce RTX 3050).
166
+
167
+ ## Evaluation
168
+
169
+ During the training, no implementation procedures were introduced.
170
+
171
+ ### Testing Data, Factors & Metrics
172
+
173
+ #### Testing Data
174
+
175
+ <!-- This should link to a Dataset Card if possible. -->
176
+
177
+ [More Information Needed]
178
+
179
+ #### Factors
180
+
181
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
182
+
183
+ [More Information Needed]
184
+
185
+ #### Metrics
186
+
187
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
188
+
189
+ [More Information Needed]
190
+
191
+ ### Results
192
+
193
+ [More Information Needed]
194
+
195
+ #### Summary
196
+
197
+
198
+
199
+ ## Model Examination [optional]
200
+
201
+ <!-- Relevant interpretability work for the model goes here -->
202
+
203
+ [More Information Needed]
204
+
205
+ ## Environmental Impact
206
+
207
+ - **Hardware Type:** Personal laptop (Xiaomi Mi Notebook Pro X 15)
208
+ - **Hours used:** 4h
209
+ - **Carbon emitted:** approx. 0.1 kg.
210
+
211
+ ## Technical Specifications [optional]
212
+
213
+ ### Model Architecture and Objective
214
+
215
+ - Architecture: BART
216
+ - Objective: sequence-to-sequence transformation
217
+
218
+ ### Compute Infrastructure
219
+
220
+ Personal laptop
221
+
222
+ #### Hardware
223
+
224
+ - Xiaomi Mi Notebook Pro X 15
225
+
226
+ #### Software
227
+
228
+ - VS Code
229
+
230
+ ## Citation
231
+
232
+
233
+ **BibTeX:**
234
+
235
+ TBP
236
+
237
+ **APA:**
238
+
239
+ TBP
240
+
241
+
242
+ ## Model Card Authors [optional]
243
+
244
+ Ilia Afanasev
245
+
246
+ ## Model Card Contact
247
+
248
+ ilia.afanasev.1997@gmail.com