fix-diacritic
Model Details
Model Description
The fix-diacritic model is a fine-tuned token classification model designed to automatically restore missing diacritics (mäkčene, dĺžne) in Slovak text. It takes raw, non-diacritic Slovak sentences and accurately predicts the necessary structural transformations to restore proper grammar and spelling. The model was developed as part of a submission for the Slovak AI Olympics 2025/26.
- Developed by: Martin Šenkýř (mrtineu)
- Model type: Token Classification (Transformer)
- Language(s) (NLP): Slovak (
sk) - License: MIT
- Finetuned from model:
gerulata/slovakbert
Uses
Direct Use
The model is intended to be used directly to restore diacritics in Slovak text. Use cases include:
- Restoring diacritics in informal messages, chats, or emails written without them.
- Pre-processing text for downstream NLP tasks that require grammatically correct Slovak.
Out-of-Scope Use
The model was trained on standard sentence lengths. It is not designed for, and may struggle with:
- Extremely long sentences or large uncut text blocks.
- Non-Slovak text or highly specialized/archaic dialects not present in modern Wikipedia dumps.
Bias, Risks, and Limitations
Since the model is trained on a Wikipedia dataset, it inherits any biases present in the Slovak Wikipedia. Furthermore, because it relies on token-level operators (e.g., predicting an explicit string change at specific character indices), malformed inputs, exotic unicode characters, or exceptionally long texts might yield unexpected outputs or fail to align properly.
Training Details
Training Data
The model was fine-tuned on a custom dataset consisting of approximately 30,000 sentences (nearly 4 million characters) extracted from a Slovak Wikipedia data dump. The data was cleaned via custom Regex parsing (without standard NLP pipelining tools, per competition constraints) and then degraded by stripping diacritics to create input-target pairs.
Training Procedure
Instead of a Seq2Seq translation approach, the training framed diacritic restoration as a Token Classification task. The foundation model (gerulata/slovakbert) learned token string operators (e.g., classifying a token dazd to apply the transformation "1:á,3:ď" to result in dážď). This decision drastically optimized training and inference speed.
Training Hyperparameters
- Epochs: 2
- Hardware: Google Colab T4 GPU
- Architecture: Token Classification over SlovakBERT
Evaluation
Testing Data, Factors & Metrics
The model was evaluated against a validation set of about 3,000 sentences drawn from the same Wikipedia distribution as the training data.
Metrics
- Accuracy: The primary evaluation metric used was prediction accuracy (exact match of token restoration).
Results
The fine-tuned Token Classification model achieved an outstanding accuracy in a fraction of the time compared to zero-shot baselines:
- Accuracy: 97.5%
- Inference Time (3,000 sentences): ~7 minutes (this duration impressively included the 2-epoch fine-tuning phase as well).
Technical Specifications
Model Architecture and Objective
The architecture utilizes the Masked Language Model backbone of gerulata/slovakbert with a customized Token Classification head. The objective function predicts string operation labels (e.g., KEEP, REPLACE:[token], or format [index]:[char]) to manipulate purely the ASCII-like characters back to their rich UTF-8 diacritic forms.
Compute Infrastructure
- Hardware: 1x NVIDIA T4 GPU
- Compute Environment: Google Colab
Citation
Repository: https://github.com/mrtineu/fix-diacritic
BibTeX:
@misc{mrtineu2026fixdiacritic,
author = {Šenkýř, Martin},
title = {fix-diacritic: Slovak Diacritic Restoration Model},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/mrtineu/fix-diacritic},
note = {GitHub: https://github.com/mrtineu/fix-diacritic}
}
Model Card Authors
Martin Šenkýř (mrtineu)
- Downloads last month
- 22
Model tree for mrtineu/fix-diacritic
Base model
gerulata/slovakbertDataset used to train mrtineu/fix-diacritic
Evaluation results
- Accuracy on Slovak Wikipedia Dumpself-reported97.500