mDeBERTa-v3-base fine-tuned for Kazakh Missing Word Position Prediction
This model is a fine-tuned version of microsoft/mdeberta-v3-base for the task of missing word position prediction in Kazakh text.
The model takes a sentence with one omitted word and predicts the most likely gap position where the missing word should be inserted.
Task
Given a sentence with one missing word:
- Input: sentence with one word removed
- Output: index of the most likely insertion position
Example:
- Original:
Мен кеше мектепке бардым - Corrupted input:
Мен кеше мектепке - Target output:
3
The model predicts the correct insertion position among all possible gaps in the sentence.
Model Details
- Base model:
microsoft/mdeberta-v3-base - Architecture: Transformer encoder + custom gap classification head
- Framework: PyTorch / Hugging Face Transformers
- Languages: Primarily Kazakh
- Task type: Sequence-level gap position classification
Why mDeBERTa
mDeBERTa-v3-base was chosen because:
- it provides strong multilingual contextual representations
- it handles morphologically rich languages better than simpler baselines
- DeBERTa’s disentangled attention helps model both content and position
- it offers a good balance between quality and compute efficiency
Training Data
The model was trained on Kazakh sentence data using a self-supervised synthetic labeling strategy:
- take a clean sentence
- remove one random word
- use the removed word’s original location as the target label
This turns raw text into a supervised gap prediction dataset without manual annotation.
Training setup
- Training rows used: up to ~140k sampled sentences
- Validation split: held-out subset from training data
- Epochs: depends on final run/checkpoint
- Loss: Cross-entropy
- Optimizer: AdamW
- Regularization: dropout, weight decay, gradient clipping
- Precision: mixed precision training (FP16) with stable FP32 parameter updates
Update this section with your exact final run values before publishing if you want the card to exactly match your checkpoint.
Inference
The core model predicts probabilities for all valid insertion positions.
In the full competition pipeline, final predictions may also include:
- top-k candidate selection
- optional n-gram reranking
- confidence-based adjustment
This repository contains the fine-tuned neural model weights.
If you upload only the model weights, the Hugging Face checkpoint corresponds to the neural prediction stage.
Intended Use
This model is intended for:
- Kazakh missing-word detection research
- ASR post-processing
- OCR text correction pipelines
- educational and linguistic error-analysis tools
- research on gap prediction in low-resource languages
Limitations
- trained for position prediction, not full word generation
- performance depends on sentence quality and domain match
- may degrade on noisy, dialectal, code-switched, or heavily corrupted text
- leaderboard score may rely on additional post-processing not included in raw model weights
- not intended for high-stakes decision-making
Evaluation
Reported project results:
- Raw neural model: around mid-0.64 accuracy range
- Best competition pipeline with reranking: around 0.65 leaderboard score
If this Hugging Face repo contains only the model checkpoint, raw standalone performance may differ slightly from the final competition submission pipeline.
Usage
Load model
import torch
from transformers import AutoTokenizer
repo_id = "YOUR_USERNAME/YOUR_MODEL_NAME"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = torch.load("model.pt", map_location="cpu")
model.eval()
Model tree for kkenbbi/mdeberta-kaz
Base model
microsoft/mdeberta-v3-baseEvaluation results
- Kaggle leaderboard score on Kazakh Missing Words Challengeself-reported0.650