fix-diacritic

Model Details

Model Description

The fix-diacritic model is a fine-tuned token classification model designed to automatically restore missing diacritics (mäkčene, dĺžne) in Slovak text. It takes raw, non-diacritic Slovak sentences and accurately predicts the necessary structural transformations to restore proper grammar and spelling. The model was developed as part of a submission for the Slovak AI Olympics 2025/26.

  • Developed by: Martin Šenkýř (mrtineu)
  • Model type: Token Classification (Transformer)
  • Language(s) (NLP): Slovak (sk)
  • License: MIT
  • Finetuned from model: gerulata/slovakbert

Uses

Direct Use

The model is intended to be used directly to restore diacritics in Slovak text. Use cases include:

  • Restoring diacritics in informal messages, chats, or emails written without them.
  • Pre-processing text for downstream NLP tasks that require grammatically correct Slovak.

Out-of-Scope Use

The model was trained on standard sentence lengths. It is not designed for, and may struggle with:

  • Extremely long sentences or large uncut text blocks.
  • Non-Slovak text or highly specialized/archaic dialects not present in modern Wikipedia dumps.

Bias, Risks, and Limitations

Since the model is trained on a Wikipedia dataset, it inherits any biases present in the Slovak Wikipedia. Furthermore, because it relies on token-level operators (e.g., predicting an explicit string change at specific character indices), malformed inputs, exotic unicode characters, or exceptionally long texts might yield unexpected outputs or fail to align properly.

Training Details

Training Data

The model was fine-tuned on a custom dataset consisting of approximately 30,000 sentences (nearly 4 million characters) extracted from a Slovak Wikipedia data dump. The data was cleaned via custom Regex parsing (without standard NLP pipelining tools, per competition constraints) and then degraded by stripping diacritics to create input-target pairs.

Training Procedure

Instead of a Seq2Seq translation approach, the training framed diacritic restoration as a Token Classification task. The foundation model (gerulata/slovakbert) learned token string operators (e.g., classifying a token dazd to apply the transformation "1:á,3:ď" to result in dážď). This decision drastically optimized training and inference speed.

Training Hyperparameters

  • Epochs: 2
  • Hardware: Google Colab T4 GPU
  • Architecture: Token Classification over SlovakBERT

Evaluation

Testing Data, Factors & Metrics

The model was evaluated against a validation set of about 3,000 sentences drawn from the same Wikipedia distribution as the training data.

Metrics

  • Accuracy: The primary evaluation metric used was prediction accuracy (exact match of token restoration).

Results

The fine-tuned Token Classification model achieved an outstanding accuracy in a fraction of the time compared to zero-shot baselines:

  • Accuracy: 97.5%
  • Inference Time (3,000 sentences): ~7 minutes (this duration impressively included the 2-epoch fine-tuning phase as well).

Technical Specifications

Model Architecture and Objective

The architecture utilizes the Masked Language Model backbone of gerulata/slovakbert with a customized Token Classification head. The objective function predicts string operation labels (e.g., KEEP, REPLACE:[token], or format [index]:[char]) to manipulate purely the ASCII-like characters back to their rich UTF-8 diacritic forms.

Compute Infrastructure

  • Hardware: 1x NVIDIA T4 GPU
  • Compute Environment: Google Colab

Citation

Repository: https://github.com/mrtineu/fix-diacritic

BibTeX:

@misc{mrtineu2026fixdiacritic,
  author = {Šenkýř, Martin},
  title = {fix-diacritic: Slovak Diacritic Restoration Model},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/mrtineu/fix-diacritic},
  note = {GitHub: https://github.com/mrtineu/fix-diacritic}
}

Model Card Authors

Martin Šenkýř (mrtineu)

Downloads last month
22
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mrtineu/fix-diacritic

Finetuned
(22)
this model

Dataset used to train mrtineu/fix-diacritic

Evaluation results