mrtineu commited on
Commit
f30f4b4
·
verified ·
1 Parent(s): 6f59be4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -3
README.md CHANGED
@@ -1,3 +1,122 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - sk
4
+ license: mit
5
+ base_model: gerulata/slovakbert
6
+ tags:
7
+ - token-classification
8
+ - diacritic-restoration
9
+ - slovak
10
+ datasets:
11
+ - wikipedia
12
+ metrics:
13
+ - accuracy
14
+ model-index:
15
+ - name: mrtineu/fix-diacritic
16
+ results:
17
+ - task:
18
+ type: token-classification
19
+ name: Token Classification
20
+ dataset:
21
+ name: Slovak Wikipedia Dump
22
+ type: wikipedia
23
+ metrics:
24
+ - type: accuracy
25
+ value: 97.5
26
+ name: Accuracy
27
+ ---
28
+
29
+ # fix-diacritic
30
+
31
+ ## Model Details
32
+
33
+ ### Model Description
34
+
35
+ The **fix-diacritic** model is a fine-tuned token classification model designed to automatically restore missing diacritics (*mäkčene, dĺžne*) in Slovak text. It takes raw, non-diacritic Slovak sentences and accurately predicts the necessary structural transformations to restore proper grammar and spelling. The model was developed as part of a submission for the Slovak AI Olympics 2025/26.
36
+
37
+ - **Developed by:** Martin Šenkýř (mrtineu)
38
+ - **Model type:** Token Classification (Transformer)
39
+ - **Language(s) (NLP):** Slovak (`sk`)
40
+ - **License:** MIT
41
+ - **Finetuned from model:** `gerulata/slovakbert`
42
+
43
+ ## Uses
44
+
45
+ ### Direct Use
46
+
47
+ The model is intended to be used directly to restore diacritics in Slovak text. Use cases include:
48
+ - Restoring diacritics in informal messages, chats, or emails written without them.
49
+ - Pre-processing text for downstream NLP tasks that require grammatically correct Slovak.
50
+
51
+ ### Out-of-Scope Use
52
+
53
+ The model was trained on standard sentence lengths. It is not designed for, and may struggle with:
54
+ - **Extremely long sentences** or large uncut text blocks.
55
+ - Non-Slovak text or highly specialized/archaic dialects not present in modern Wikipedia dumps.
56
+
57
+ ## Bias, Risks, and Limitations
58
+
59
+ Since the model is trained on a Wikipedia dataset, it inherits any biases present in the Slovak Wikipedia. Furthermore, because it relies on token-level operators (e.g., predicting an explicit string change at specific character indices), malformed inputs, exotic unicode characters, or exceptionally long texts might yield unexpected outputs or fail to align properly.
60
+
61
+ ## Training Details
62
+
63
+ ### Training Data
64
+
65
+ The model was fine-tuned on a custom dataset consisting of approximately 30,000 sentences (nearly 4 million characters) extracted from a Slovak Wikipedia data dump. The data was cleaned via custom Regex parsing (without standard NLP pipelining tools, per competition constraints) and then degraded by stripping diacritics to create input-target pairs.
66
+
67
+ ### Training Procedure
68
+
69
+ Instead of a Seq2Seq translation approach, the training framed diacritic restoration as a **Token Classification** task. The foundation model (`gerulata/slovakbert`) learned token string operators (e.g., classifying a token `dazd` to apply the transformation `"1:á,3:ď"` to result in `dážď`). This decision drastically optimized training and inference speed.
70
+
71
+ #### Training Hyperparameters
72
+
73
+ - **Epochs:** 2
74
+ - **Hardware:** Google Colab T4 GPU
75
+ - **Architecture:** Token Classification over SlovakBERT
76
+
77
+ ## Evaluation
78
+
79
+ ### Testing Data, Factors & Metrics
80
+
81
+ The model was evaluated against a validation set of about 3,000 sentences drawn from the same Wikipedia distribution as the training data.
82
+
83
+ #### Metrics
84
+
85
+ - **Accuracy:** The primary evaluation metric used was prediction accuracy (exact match of token restoration).
86
+
87
+ ### Results
88
+
89
+ The fine-tuned Token Classification model achieved an outstanding accuracy in a fraction of the time compared to zero-shot baselines:
90
+ - **Accuracy:** **97.5%**
91
+ - **Inference Time (3,000 sentences):** ~7 minutes (this duration impressively included the 2-epoch fine-tuning phase as well).
92
+
93
+ ## Technical Specifications
94
+
95
+ ### Model Architecture and Objective
96
+
97
+ The architecture utilizes the Masked Language Model backbone of `gerulata/slovakbert` with a customized Token Classification head. The objective function predicts string operation labels (e.g., `KEEP`, `REPLACE:[token]`, or format `[index]:[char]`) to manipulate purely the ASCII-like characters back to their rich UTF-8 diacritic forms.
98
+
99
+ ### Compute Infrastructure
100
+
101
+ - **Hardware:** 1x NVIDIA T4 GPU
102
+ - **Compute Environment:** Google Colab
103
+
104
+ ## Citation
105
+
106
+ **Repository:** [https://github.com/mrtineu/fix-diacritic](https://github.com/mrtineu/fix-diacritic)
107
+
108
+ **BibTeX:**
109
+ ```bibtex
110
+ @misc{mrtineu2026fixdiacritic,
111
+ author = {Šenkýř, Martin},
112
+ title = {fix-diacritic: Slovak Diacritic Restoration Model},
113
+ year = {2026},
114
+ publisher = {Hugging Face},
115
+ url = {https://huggingface.co/mrtineu/fix-diacritic},
116
+ note = {GitHub: https://github.com/mrtineu/fix-diacritic}
117
+ }
118
+ ```
119
+
120
+ ## Model Card Authors
121
+
122
+ Martin Šenkýř (mrtineu)