Kazakh Question Answering Model
A question answering model for Kazakh text, fine-tuned on the KazQAD dataset.
Model Description
This model is based on google-bert/bert-base-multilingual-cased and fine-tuned for extractive question answering on Kazakh text. The model can answer questions based on a given context by predicting the start and end positions of the answer span.
BERT (Bidirectional Encoder Representations from Transformers) is well-suited for extractive question answering tasks, as it can effectively understand the relationship between questions and context through its bidirectional attention mechanism.
Direct model usage
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
model_name = "r3iwan/kazakh-question-answering"
try:
tokenizer = AutoTokenizer.from_pretrained(model_name)
except:
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-cased")
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
model.eval()
context = "Қазақстан - Орталық Азиядағы мемлекет. Оның астанасы Астана қаласы."
question = "Қазақстанның астанасы қайда?"
inputs = tokenizer(
question,
context,
return_tensors="pt",
padding=True,
truncation="only_second",
max_length=512
)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
start_scores = outputs.start_logits[0]
end_scores = outputs.end_logits[0]
num_candidates = 20
max_span_length = 30
# Get top start and end candidates
start_candidates = torch.topk(start_scores, num_candidates)
end_candidates = torch.topk(end_scores, num_candidates)
best_score = float('-inf')
best_start = 0
best_end = 0
for start_idx in start_candidates.indices:
for end_idx in end_candidates.indices:
if start_idx <= end_idx and (end_idx - start_idx) <= max_span_length:
score = start_scores[start_idx].item() + end_scores[end_idx].item()
if score > best_score:
best_score = score
best_start = start_idx.item()
best_end = end_idx.item()
# Extract answer tokens
answer_tokens = inputs["input_ids"][0][best_start:best_end+1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print(f"Question: {question}")
print(f"Answer: {answer}")
Training
The model was trained on the issai/kazqad dataset with the following parameters:
- Base Model:
google-bert/bert-base-multilingual-cased - Epochs: 3
- Batch Size: 8 (per device)
- Learning Rate: 2e-5 (with linear decay)
- Weight Decay: 0.01
- Max Sequence Length: 512
- Dataset Splits:
- Train: 3,163 examples
- Validation: 764 examples
- Test: 2,713 examples
Training Progress
| Epoch | Training Loss | Validation Loss |
|---|---|---|
| 1 | - | 2.760520 |
| 2 | 2.372000 | 2.854473 |
| 3 | 1.347700 | 3.108089 |
The training loss decreased significantly from 2.37 to 1.35, indicating the model learned the task effectively.
Evaluation
The model was evaluated on the test set with the following results:
- Evaluation Loss: 3.108
- Evaluation Runtime: 76.97 seconds
- Evaluation Samples per Second: 35.25
- Evaluation Steps per Second: 4.42
Note: The evaluation loss represents the cross-entropy loss on the test set. For more detailed metrics such as Exact Match (EM) and F1 score, additional evaluation with a question-answering metric computation function would be required.
Model Experiments
During development, two different base models were tested:
1. mT5-small (Initial Experiment)
- Base Model:
google/mt5-small - Final Validation Loss: 5.826
- Result: Poor performance - model failed to learn the task effectively
- Issues: mT5 is an encoder-decoder model designed for generative tasks, making it suboptimal for extractive QA. The validation loss remained high (~5.8-6.0) and answers were often extracted from the question rather than the context.
2. BERT-base-multilingual-cased (Final Model) ✅
- Base Model:
google-bert/bert-base-multilingual-cased - Final Validation Loss: 3.108
- Result: Significant improvement - model learned the task successfully
- Advantages: BERT's bidirectional encoder architecture is well-suited for extractive QA. The training loss decreased from 2.37 to 1.35, and the model produces correct answers from the context.
The BERT-based model shows much better performance and was selected as the final model for deployment.
Requirements
The following packages are required to use this model:
torch>=2.1.0
transformers==5.0.0rc1
datasets>=2.14.0,<3.0.0
accelerate>=0.24.0
tiktoken>=0.1.0
sentencepiece>=0.1.99
numpy>=1.25.2,<2.0.0
pandas>=2.0.0
tqdm>=4.66.0
Install them with:
pip install torch transformers datasets accelerate tiktoken sentencepiece numpy pandas tqdm
Or install from the project's requirements.txt:
pip install -r requirements.txt
Limitations
The model is trained on a limited dataset of 3,163 training examples and may perform better on texts similar to the training data. For other domains or question types, additional fine-tuning may be required.
The model uses a maximum sequence length of 512 tokens, so longer contexts will be truncated.
For detailed performance metrics (Exact Match and F1 score), additional evaluation with a question-answering metric computation function would be required.
Citations
If you use this model, please cite the KazQAD dataset:
@inproceedings{yeshpanov-etal-2024-kazqad,
title = "{K}az{QAD}: {K}azakh Open-Domain Question Answering Dataset",
author = "Yeshpanov, Rustem and
Efimov, Pavel and
Boytsov, Leonid and
Shalkarbayuli, Ardak and
Braslavski, Pavel",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.843",
pages = "9645--9656"
}
Author
R3iwan
License
The model itself is released under Apache 2.0 license. However, please note that this model was fine-tuned on the KazQAD dataset, which is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). When using this model, you should comply with both licenses.
- Downloads last month
- 46
Model tree for R3iwan/kazakh-question-answering
Base model
google-bert/bert-base-multilingual-cased