mrtineu
/

fix-diacritic

+---
+language:
+  - sk
+license: mit
+base_model: gerulata/slovakbert
+tags:
+  - token-classification
+  - diacritic-restoration
+  - slovak
+datasets:
+  - wikipedia
+metrics:
+  - accuracy
+model-index:
+  - name: mrtineu/fix-diacritic
+    results:
+      - task:
+          type: token-classification
+          name: Token Classification
+        dataset:
+          name: Slovak Wikipedia Dump
+          type: wikipedia
+        metrics:
+          - type: accuracy
+            value: 97.5
+            name: Accuracy
+---
+# fix-diacritic
+## Model Details
+### Model Description
+The **fix-diacritic** model is a fine-tuned token classification model designed to automatically restore missing diacritics (*mäkčene, dĺžne*) in Slovak text. It takes raw, non-diacritic Slovak sentences and accurately predicts the necessary structural transformations to restore proper grammar and spelling. The model was developed as part of a submission for the Slovak AI Olympics 2025/26.
+- **Developed by:** Martin Šenkýř (mrtineu)
+- **Model type:** Token Classification (Transformer)
+- **Language(s) (NLP):** Slovak (`sk`)
+- **License:** MIT
+- **Finetuned from model:** `gerulata/slovakbert`
+## Uses
+### Direct Use
+The model is intended to be used directly to restore diacritics in Slovak text. Use cases include:
+- Restoring diacritics in informal messages, chats, or emails written without them.
+- Pre-processing text for downstream NLP tasks that require grammatically correct Slovak.
+### Out-of-Scope Use
+The model was trained on standard sentence lengths. It is not designed for, and may struggle with:
+- **Extremely long sentences** or large uncut text blocks.
+- Non-Slovak text or highly specialized/archaic dialects not present in modern Wikipedia dumps.
+## Bias, Risks, and Limitations
+Since the model is trained on a Wikipedia dataset, it inherits any biases present in the Slovak Wikipedia. Furthermore, because it relies on token-level operators (e.g., predicting an explicit string change at specific character indices), malformed inputs, exotic unicode characters, or exceptionally long texts might yield unexpected outputs or fail to align properly.
+## Training Details
+### Training Data
+The model was fine-tuned on a custom dataset consisting of approximately 30,000 sentences (nearly 4 million characters) extracted from a Slovak Wikipedia data dump. The data was cleaned via custom Regex parsing (without standard NLP pipelining tools, per competition constraints) and then degraded by stripping diacritics to create input-target pairs.
+### Training Procedure
+Instead of a Seq2Seq translation approach, the training framed diacritic restoration as a **Token Classification** task. The foundation model (`gerulata/slovakbert`) learned token string operators (e.g., classifying a token `dazd` to apply the transformation `"1:á,3:ď"` to result in `dážď`). This decision drastically optimized training and inference speed.
+#### Training Hyperparameters
+- **Epochs:** 2
+- **Hardware:** Google Colab T4 GPU
+- **Architecture:** Token Classification over SlovakBERT
+## Evaluation
+### Testing Data, Factors & Metrics
+The model was evaluated against a validation set of about 3,000 sentences drawn from the same Wikipedia distribution as the training data.
+#### Metrics
+- **Accuracy:** The primary evaluation metric used was prediction accuracy (exact match of token restoration).
+### Results
+The fine-tuned Token Classification model achieved an outstanding accuracy in a fraction of the time compared to zero-shot baselines:
+- **Accuracy:** **97.5%**
+- **Inference Time (3,000 sentences):** ~7 minutes (this duration impressively included the 2-epoch fine-tuning phase as well).
+## Technical Specifications
+### Model Architecture and Objective
+The architecture utilizes the Masked Language Model backbone of `gerulata/slovakbert` with a customized Token Classification head. The objective function predicts string operation labels (e.g., `KEEP`, `REPLACE:[token]`, or format `[index]:[char]`) to manipulate purely the ASCII-like characters back to their rich UTF-8 diacritic forms.
+### Compute Infrastructure
+- **Hardware:** 1x NVIDIA T4 GPU
+- **Compute Environment:** Google Colab
+## Citation
+**Repository:** [https://github.com/mrtineu/fix-diacritic](https://github.com/mrtineu/fix-diacritic)
+**BibTeX:**
+```bibtex
+@misc{mrtineu2026fixdiacritic,
+  author = {Šenkýř, Martin},
+  title = {fix-diacritic: Slovak Diacritic Restoration Model},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/mrtineu/fix-diacritic},
+  note = {GitHub: https://github.com/mrtineu/fix-diacritic}
+}
+```
+## Model Card Authors
+Martin Šenkýř (mrtineu)