Raychani1
/

slovakbert-ner-v2

 ---
 license: mit
+language:
+- sk
+tags:
+- generated_from_trainer
+datasets:
+- NBS_sentence
+metrics:
+- precision
+- recall
+- f1
+- accuracy
+inference: false
+model-index:
+- name: slovakbert-ner-v2
+  results:
+  - task:
+      name: Token Classification
+      type: token-classification
+    metrics:
+    - name: Precision
+      type: precision
+      value: 0.9715
+    - name: Recall
+      type: recall
+      value: 0.9433
+    - name: F1
+      type: f1
+      value: 0.9547
+    - name: Accuracy
+      type: accuracy
+      value: 0.9897
 ---
+# SlovakBERT based Named Entity Recognition
+Deep Learning model developed for Named Entity Recognition (NER) in Slovak. The [**Gerulata/SlovakBERT**](https://huggingface.co/gerulata/slovakbert) based model is fine-tuned on webscraped Slovak news articles. The finished model supports the following IOB tagged entity categories: **PERSON**, **ORGANIZATION**, **LOCATION**, **DATE**, **TIME**, **MONEY** and **PERCENTAGE**
+### **Related Work**
+[![Thesis][Thesis]][Thesis-url]
+## Model usage
+### Simple Named Entity Recognition (NER)
+```python
+from transformers import pipeline
+ner_pipeline = pipeline(task='ner', model='Raychani1/slovakbert-ner-v2')
+input_sentence = 'Hoci podľa ostatných údajov NBS pre Bratislavský kraj je aktuálna priemerná cena nehnuteľností na úrovni 2 072 eur za štvorcový meter, ceny bytov v hlavnom meste sú podstatne vyššie.'
+classifications = ner_pipeline(input_sentence)
+```
+### Named Entity Recognition (NER) with Visualization
+For a Visualization Example please refer to the following [Gist](https://gist.github.com/Raychani1/7d4455491f0aa681ed8ea99d8b1d8279).
+### Model Prediction Output Example
+![prediction_output](https://github.com/Raychani1/Text_Parsing_Methods_Using_NLP/assets/45550552/723ab7f1-4efb-4d03-87d6-b9ac1e40990f)
+## Model Training
+### Training Hyperparameters
+|        **Parameter**        | **Value** |
+|:---------------------------:|:---------:|
+| per_device_train_batch_size |     4     |
+|  per_device_eval_batch_size |     4     |
+|        learning_rate        |   5e-05   |
+|          adam_beta1         |    0.9    |
+|          adam_beta1         |   0.999   |
+|         adam_epsilon        |   1e-08   |
+|       num_train_epochs      |     15    |
+|      lr_scheduler_type      |   linear  |
+|             seed            |     42    |
+### Training results
+Best model results are reached in the 8th training epoch.
+| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
+|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
+| 0.6721        | 1.0   | 70   | 0.2214          | 0.6972    | 0.7308 | 0.7136 | 0.9324   |
+| 0.1849        | 2.0   | 140  | 0.1697          | 0.8056    | 0.8365 | 0.8208 | 0.952    |
+| 0.0968        | 3.0   | 210  | 0.1213          | 0.882     | 0.8622 | 0.872  | 0.9728   |
+| 0.0468        | 4.0   | 280  | 0.1107          | 0.8372    | 0.907  | 0.8708 | 0.9684   |
+| 0.0415        | 5.0   | 350  | 0.1644          | 0.8059    | 0.8782 | 0.8405 | 0.9615   |
+| 0.0233        | 6.0   | 420  | 0.1255          | 0.8576    | 0.8878 | 0.8724 | 0.9716   |
+| 0.0198        | 7.0   | 490  | 0.1383          | 0.8545    | 0.8846 | 0.8693 | 0.9703   |
+| 0.0133        | 8.0   | 560  | 0.1241          | 0.884     | 0.9038 | 0.8938 | 0.9735   |
+## Model Evaluation
+### Evaluation Dataset Distribution
+|    **NER Tag**    | **Number of Tokens** |
+|:-----------------:|:--------------------:|
+|       **0**       |         6568         |
+|    **B-Person**   |          96          |
+|    **I-Person**   |          83          |
+| **B-Organizaton** |          583         |
+| **I-Organizaton** |          585         |
+|   **B-Location**  |          59          |
+|   **I-Location**  |          15          |
+|     **B-Date**    |          113         |
+|     **I-Date**    |          87          |
+|      **Time**     |           5          |
+|    **B-Money**    |          44          |
+|    **I-Money**    |          74          |
+|  **B-Percentage** |          57          |
+|  **I-Percentage** |          54          |
+### Evaluation Confusion Matrix
+![image](https://github.com/Raychani1/Text_Parsing_Methods_Using_NLP/assets/45550552/e6d1a1c6-e02f-4de9-9684-5882a405d31f)
+### Evaluation Model Metrics
+| **Precision** | **Macro-Precision** | **Recall** | **Macro-Recall** | **F1** | **Macro-F1** | **Accuracy** |
+|:-------------:|:-------------------:|:----------:|:----------------:|:------:|:------------:|:------------:|
+|     0.9897    |        0.9715       |   0.9897   |      0.9433      | 0.9895 |    0.9547    |    0.9897    |
+## Framework Versions
+- Transformers 4.26.1
+- PyTorch 1.13.1
+- Tokenizers 0.13.2
+<!-- Variables -->
+[Thesis]: https://img.shields.io/badge/%F0%9F%93%9C-Masters%20Thesis-blue?style=for-the-badge
+[Thesis-url]: https://opac.crzp.sk/?fn=detailBiblioForm&sid=C0DEB8E07572332BA2230915805F