djulian13
/

be-tiny-bart

@@ -1,3 +1,18 @@
 # be-tiny-bart
 A model for lemmatisation of Belarusian, trained on [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master) dataset.
@@ -32,7 +47,6 @@ Downstream use and further fine-tuning (for instance, for text-to-SQL transforma
 The model is fine-tuned only for Modern Standard Belarusian on a rather small Belarusian-HSE dataset. Use its results only after the manual check.
-[More Information Needed]
 ### Recommendations
@@ -41,9 +55,71 @@ Use this model only for lemmatisation of Modern Standard Belarusian if you aspir
 ## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
@@ -153,49 +229,38 @@ The training took around 2.5 hrs on 4 GB GPU (NVIDIA GeForce RTX 3050).
 ## Evaluation
-During the training, no implementation procedures were introduced.
 ### Testing Data, Factors & Metrics
 #### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
 #### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
 #### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
 #### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
 - **Hardware Type:** Personal laptop (Xiaomi Mi Notebook Pro X 15)
 - **Hours used:** 4h
 - **Carbon emitted:** approx. 0.1 kg.
-## Technical Specifications [optional]
 ### Model Architecture and Objective
@@ -226,24 +291,10 @@ TBP
 TBP
-## Model Card Authors [optional]
 Ilia Afanasev
 ## Model Card Contact
-ilia.afanasev.1997@gmail.com
----
-license: mpl-2.0
-language:
-- be
-metrics:
-- accuracy
-base_model:
-- sshleifer/bart-tiny-random
-pipeline_tag: translation
-tags:
-- seq2seq
-- lemmatisation
----

+---
+license: mpl-2.0
+language:
+- be
+metrics:
+- accuracy
+base_model:
+- sshleifer/bart-tiny-random
+pipeline_tag: translation
+tags:
+- seq2seq
+- lemmatisation
+library_name: transformers
+---
 # be-tiny-bart
 A model for lemmatisation of Belarusian, trained on [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master) dataset.
 The model is fine-tuned only for Modern Standard Belarusian on a rather small Belarusian-HSE dataset. Use its results only after the manual check.
 ### Recommendations
 ## How to Get Started with the Model
+Use the code below to get started with the model. You will need your data in CoNLL-U format.
+```
+!pip install simpletransformers
+!pip install pyjarowinkler
+!pip install Levenshtein
+import logging
+import pandas as pd
+from simpletransformers.seq2seq import Seq2SeqModel
+import torch
+import Levenshtein
+from pyjarowinkler import distance as jw
+import numpy as np
+from itertools import cycle
+import json
+def load_conllu_dataset(datafile):
+    arr = []
+    with open(datafile, encoding='utf-8') as inp:
+        strings = inp.readlines()
+    for s in strings:
+      if (s[0] != "#" and s.strip()):
+          split_string = s.split('\t')
+          arr.append([split_string[1] + " " + split_string[3]+ " " + split_string[5], split_string[2]])
+    return pd.DataFrame(arr, columns=["input_text", "target_text"])
+MODEL_NAME = "djulian13/be-tiny-bart"
+logging.basicConfig(level=logging.INFO)
+transformers_logger = logging.getLogger("transformers")
+transformers_logger.setLevel(logging.WARNING)
+model = Seq2SeqModel(
+    encoder_decoder_type="bart",
+    encoder_decoder_name=MODEL_NAME,
+    use_cuda = torch.cuda.is_available()
+)
+DATA_PRED_NAME = "test.conllu"
+predictions = load_conllu_dataset(DATA_PRED_NAME)
+pred_data = predictions["input_text"].tolist()
+predictions = model.predict(pred_data)
+predictions = cycle(predictions)
+  with open(DATA_PRED_NAME, encoding='utf-8') as inp:
+      strings = inp.readlines()
+      predicted = []
+      for s in strings:
+        if (s[0] != "#" and s.strip()):
+            split_string = s.split('\t')
+            split_string[2] = next(predictions)
+            joined_string = '\t'.join(split_string)
+            predicted.append(joined_string)
+            continue
+        predicted.append(s)
+      with open("result.conllu", 'w', encoding='utf-8') as out:
+        out.write(''.join(predicted))
+```
 ## Training Details
 ## Evaluation
+During the training, no evaluation procedures were introduced.
 ### Testing Data, Factors & Metrics
 #### Testing Data
+[YABC](https://github.com/poritski/YABC), a freely downloadable corpus of ≈7.5M words of Belarusian newspaper articles and fiction. For the more detailed representation of the dataset, see its page on [Zenodo](https://zenodo.org/records/19349899).
 #### Factors
+Genre differences: newspaper articles vs. fiction.
 #### Metrics
+The evaluation process used accuracy score for the best possible comparison, alongside with the qualitative analysis of the examples.
 ### Results
+When tested out-of-domain, the model often struggles to generate the correct lemma.
 #### Summary
+Generally, it is possible to use this model for the preliminary tagging of Belarusian. However, if there are better options (for instance, disambiguation of existing multiple taggins with LLMs), it is better to go with them.
 ## Environmental Impact
 - **Hardware Type:** Personal laptop (Xiaomi Mi Notebook Pro X 15)
 - **Hours used:** 4h
 - **Carbon emitted:** approx. 0.1 kg.
+## Technical Specifications
 ### Model Architecture and Objective
 TBP
+## Model Card Authors
 Ilia Afanasev
 ## Model Card Contact
+ilia.afanasev.1997@gmail.com