mDeBERTa-v3-base fine-tuned for Kazakh Missing Word Position Prediction

This model is a fine-tuned version of microsoft/mdeberta-v3-base for the task of missing word position prediction in Kazakh text.

The model takes a sentence with one omitted word and predicts the most likely gap position where the missing word should be inserted.

Task

Given a sentence with one missing word:

Input: sentence with one word removed
Output: index of the most likely insertion position

Example:

Original: Мен кеше мектепке бардым
Corrupted input: Мен кеше мектепке
Target output: 3

The model predicts the correct insertion position among all possible gaps in the sentence.

Model Details

Base model: microsoft/mdeberta-v3-base
Architecture: Transformer encoder + custom gap classification head
Framework: PyTorch / Hugging Face Transformers
Languages: Primarily Kazakh
Task type: Sequence-level gap position classification

Why mDeBERTa

mDeBERTa-v3-base was chosen because:

it provides strong multilingual contextual representations
it handles morphologically rich languages better than simpler baselines
DeBERTa’s disentangled attention helps model both content and position
it offers a good balance between quality and compute efficiency

Training Data

The model was trained on Kazakh sentence data using a self-supervised synthetic labeling strategy:

take a clean sentence
remove one random word
use the removed word’s original location as the target label

This turns raw text into a supervised gap prediction dataset without manual annotation.

Training setup

Training rows used: up to ~140k sampled sentences
Validation split: held-out subset from training data
Epochs: depends on final run/checkpoint
Loss: Cross-entropy
Optimizer: AdamW
Regularization: dropout, weight decay, gradient clipping
Precision: mixed precision training (FP16) with stable FP32 parameter updates

Update this section with your exact final run values before publishing if you want the card to exactly match your checkpoint.

Inference

The core model predicts probabilities for all valid insertion positions.

In the full competition pipeline, final predictions may also include:

top-k candidate selection
optional n-gram reranking
confidence-based adjustment

This repository contains the fine-tuned neural model weights.
If you upload only the model weights, the Hugging Face checkpoint corresponds to the neural prediction stage.

Intended Use

This model is intended for:

Kazakh missing-word detection research
ASR post-processing
OCR text correction pipelines
educational and linguistic error-analysis tools
research on gap prediction in low-resource languages

Limitations

trained for position prediction, not full word generation
performance depends on sentence quality and domain match
may degrade on noisy, dialectal, code-switched, or heavily corrupted text
leaderboard score may rely on additional post-processing not included in raw model weights
not intended for high-stakes decision-making

Evaluation

Reported project results:

Raw neural model: around mid-0.64 accuracy range
Best competition pipeline with reranking: around 0.65 leaderboard score

If this Hugging Face repo contains only the model checkpoint, raw standalone performance may differ slightly from the final competition submission pipeline.

Usage

Load model

import torch
from transformers import AutoTokenizer

repo_id = "YOUR_USERNAME/YOUR_MODEL_NAME"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = torch.load("model.pt", map_location="cpu")
model.eval()

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for kkenbbi/mdeberta-kaz

Base model

microsoft/mdeberta-v3-base

Finetuned

(286)

this model

Evaluation results

Kaggle leaderboard score on Kazakh Missing Words Challenge
self-reported

0.650