Translation
Transformers
PyTorch
Safetensors
Belarusian
bart
text2text-generation
seq2seq
lemmatisation
Instructions to use djulian13/be-tiny-bart with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use djulian13/be-tiny-bart with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="djulian13/be-tiny-bart")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("djulian13/be-tiny-bart") model = AutoModelForSeq2SeqLM.from_pretrained("djulian13/be-tiny-bart") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# be-tiny-bart
|
| 2 |
|
| 3 |
A model for lemmatisation of Belarusian, trained on [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master) dataset.
|
|
@@ -32,7 +47,6 @@ Downstream use and further fine-tuning (for instance, for text-to-SQL transforma
|
|
| 32 |
|
| 33 |
The model is fine-tuned only for Modern Standard Belarusian on a rather small Belarusian-HSE dataset. Use its results only after the manual check.
|
| 34 |
|
| 35 |
-
[More Information Needed]
|
| 36 |
|
| 37 |
### Recommendations
|
| 38 |
|
|
@@ -41,9 +55,71 @@ Use this model only for lemmatisation of Modern Standard Belarusian if you aspir
|
|
| 41 |
|
| 42 |
## How to Get Started with the Model
|
| 43 |
|
| 44 |
-
Use the code below to get started with the model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
## Training Details
|
| 49 |
|
|
@@ -153,49 +229,38 @@ The training took around 2.5 hrs on 4 GB GPU (NVIDIA GeForce RTX 3050).
|
|
| 153 |
|
| 154 |
## Evaluation
|
| 155 |
|
| 156 |
-
During the training, no
|
| 157 |
|
| 158 |
### Testing Data, Factors & Metrics
|
| 159 |
|
| 160 |
#### Testing Data
|
| 161 |
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
[More Information Needed]
|
| 165 |
|
| 166 |
#### Factors
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
[More Information Needed]
|
| 171 |
|
| 172 |
#### Metrics
|
| 173 |
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
[More Information Needed]
|
| 177 |
|
| 178 |
### Results
|
| 179 |
|
| 180 |
-
|
| 181 |
|
| 182 |
#### Summary
|
| 183 |
|
|
|
|
| 184 |
|
| 185 |
|
| 186 |
-
## Model Examination [optional]
|
| 187 |
-
|
| 188 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 189 |
-
|
| 190 |
-
[More Information Needed]
|
| 191 |
-
|
| 192 |
## Environmental Impact
|
| 193 |
|
| 194 |
- **Hardware Type:** Personal laptop (Xiaomi Mi Notebook Pro X 15)
|
| 195 |
- **Hours used:** 4h
|
| 196 |
- **Carbon emitted:** approx. 0.1 kg.
|
| 197 |
|
| 198 |
-
## Technical Specifications
|
| 199 |
|
| 200 |
### Model Architecture and Objective
|
| 201 |
|
|
@@ -226,24 +291,10 @@ TBP
|
|
| 226 |
TBP
|
| 227 |
|
| 228 |
|
| 229 |
-
## Model Card Authors
|
| 230 |
|
| 231 |
Ilia Afanasev
|
| 232 |
|
| 233 |
## Model Card Contact
|
| 234 |
|
| 235 |
-
ilia.afanasev.1997@gmail.com
|
| 236 |
-
|
| 237 |
-
---
|
| 238 |
-
license: mpl-2.0
|
| 239 |
-
language:
|
| 240 |
-
- be
|
| 241 |
-
metrics:
|
| 242 |
-
- accuracy
|
| 243 |
-
base_model:
|
| 244 |
-
- sshleifer/bart-tiny-random
|
| 245 |
-
pipeline_tag: translation
|
| 246 |
-
tags:
|
| 247 |
-
- seq2seq
|
| 248 |
-
- lemmatisation
|
| 249 |
-
---
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mpl-2.0
|
| 3 |
+
language:
|
| 4 |
+
- be
|
| 5 |
+
metrics:
|
| 6 |
+
- accuracy
|
| 7 |
+
base_model:
|
| 8 |
+
- sshleifer/bart-tiny-random
|
| 9 |
+
pipeline_tag: translation
|
| 10 |
+
tags:
|
| 11 |
+
- seq2seq
|
| 12 |
+
- lemmatisation
|
| 13 |
+
library_name: transformers
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
# be-tiny-bart
|
| 17 |
|
| 18 |
A model for lemmatisation of Belarusian, trained on [Belarusian-HSE](https://github.com/UniversalDependencies/UD_Belarusian-HSE/tree/master) dataset.
|
|
|
|
| 47 |
|
| 48 |
The model is fine-tuned only for Modern Standard Belarusian on a rather small Belarusian-HSE dataset. Use its results only after the manual check.
|
| 49 |
|
|
|
|
| 50 |
|
| 51 |
### Recommendations
|
| 52 |
|
|
|
|
| 55 |
|
| 56 |
## How to Get Started with the Model
|
| 57 |
|
| 58 |
+
Use the code below to get started with the model. You will need your data in CoNLL-U format.
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
!pip install simpletransformers
|
| 62 |
+
!pip install pyjarowinkler
|
| 63 |
+
!pip install Levenshtein
|
| 64 |
+
|
| 65 |
+
import logging
|
| 66 |
+
import pandas as pd
|
| 67 |
+
from simpletransformers.seq2seq import Seq2SeqModel
|
| 68 |
+
import torch
|
| 69 |
+
import Levenshtein
|
| 70 |
+
from pyjarowinkler import distance as jw
|
| 71 |
+
import numpy as np
|
| 72 |
+
from itertools import cycle
|
| 73 |
+
import json
|
| 74 |
+
|
| 75 |
+
def load_conllu_dataset(datafile):
|
| 76 |
+
arr = []
|
| 77 |
+
with open(datafile, encoding='utf-8') as inp:
|
| 78 |
+
strings = inp.readlines()
|
| 79 |
+
for s in strings:
|
| 80 |
+
if (s[0] != "#" and s.strip()):
|
| 81 |
+
split_string = s.split('\t')
|
| 82 |
+
arr.append([split_string[1] + " " + split_string[3]+ " " + split_string[5], split_string[2]])
|
| 83 |
+
return pd.DataFrame(arr, columns=["input_text", "target_text"])
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
MODEL_NAME = "djulian13/be-tiny-bart"
|
| 87 |
+
|
| 88 |
+
logging.basicConfig(level=logging.INFO)
|
| 89 |
+
transformers_logger = logging.getLogger("transformers")
|
| 90 |
+
transformers_logger.setLevel(logging.WARNING)
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
model = Seq2SeqModel(
|
| 94 |
+
encoder_decoder_type="bart",
|
| 95 |
+
encoder_decoder_name=MODEL_NAME,
|
| 96 |
+
use_cuda = torch.cuda.is_available()
|
| 97 |
+
)
|
| 98 |
+
|
| 99 |
+
DATA_PRED_NAME = "test.conllu"
|
| 100 |
+
|
| 101 |
+
predictions = load_conllu_dataset(DATA_PRED_NAME)
|
| 102 |
+
|
| 103 |
+
pred_data = predictions["input_text"].tolist()
|
| 104 |
|
| 105 |
+
predictions = model.predict(pred_data)
|
| 106 |
+
|
| 107 |
+
predictions = cycle(predictions)
|
| 108 |
+
with open(DATA_PRED_NAME, encoding='utf-8') as inp:
|
| 109 |
+
strings = inp.readlines()
|
| 110 |
+
predicted = []
|
| 111 |
+
for s in strings:
|
| 112 |
+
if (s[0] != "#" and s.strip()):
|
| 113 |
+
split_string = s.split('\t')
|
| 114 |
+
split_string[2] = next(predictions)
|
| 115 |
+
joined_string = '\t'.join(split_string)
|
| 116 |
+
predicted.append(joined_string)
|
| 117 |
+
continue
|
| 118 |
+
predicted.append(s)
|
| 119 |
+
with open("result.conllu", 'w', encoding='utf-8') as out:
|
| 120 |
+
out.write(''.join(predicted))
|
| 121 |
+
|
| 122 |
+
```
|
| 123 |
|
| 124 |
## Training Details
|
| 125 |
|
|
|
|
| 229 |
|
| 230 |
## Evaluation
|
| 231 |
|
| 232 |
+
During the training, no evaluation procedures were introduced.
|
| 233 |
|
| 234 |
### Testing Data, Factors & Metrics
|
| 235 |
|
| 236 |
#### Testing Data
|
| 237 |
|
| 238 |
+
[YABC](https://github.com/poritski/YABC), a freely downloadable corpus of ≈7.5M words of Belarusian newspaper articles and fiction. For the more detailed representation of the dataset, see its page on [Zenodo](https://zenodo.org/records/19349899).
|
|
|
|
|
|
|
| 239 |
|
| 240 |
#### Factors
|
| 241 |
|
| 242 |
+
Genre differences: newspaper articles vs. fiction.
|
|
|
|
|
|
|
| 243 |
|
| 244 |
#### Metrics
|
| 245 |
|
| 246 |
+
The evaluation process used accuracy score for the best possible comparison, alongside with the qualitative analysis of the examples.
|
|
|
|
|
|
|
| 247 |
|
| 248 |
### Results
|
| 249 |
|
| 250 |
+
When tested out-of-domain, the model often struggles to generate the correct lemma.
|
| 251 |
|
| 252 |
#### Summary
|
| 253 |
|
| 254 |
+
Generally, it is possible to use this model for the preliminary tagging of Belarusian. However, if there are better options (for instance, disambiguation of existing multiple taggins with LLMs), it is better to go with them.
|
| 255 |
|
| 256 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 257 |
## Environmental Impact
|
| 258 |
|
| 259 |
- **Hardware Type:** Personal laptop (Xiaomi Mi Notebook Pro X 15)
|
| 260 |
- **Hours used:** 4h
|
| 261 |
- **Carbon emitted:** approx. 0.1 kg.
|
| 262 |
|
| 263 |
+
## Technical Specifications
|
| 264 |
|
| 265 |
### Model Architecture and Objective
|
| 266 |
|
|
|
|
| 291 |
TBP
|
| 292 |
|
| 293 |
|
| 294 |
+
## Model Card Authors
|
| 295 |
|
| 296 |
Ilia Afanasev
|
| 297 |
|
| 298 |
## Model Card Contact
|
| 299 |
|
| 300 |
+
ilia.afanasev.1997@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|