Kazakh Question Answering Model

A question answering model for Kazakh text, fine-tuned on the KazQAD dataset.

Model Description

This model is based on google-bert/bert-base-multilingual-cased and fine-tuned for extractive question answering on Kazakh text. The model can answer questions based on a given context by predicting the start and end positions of the answer span.

BERT (Bidirectional Encoder Representations from Transformers) is well-suited for extractive question answering tasks, as it can effectively understand the relationship between questions and context through its bidirectional attention mechanism.

Direct model usage

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

model_name = "r3iwan/kazakh-question-answering"
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
except:
    tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-cased")

model = AutoModelForQuestionAnswering.from_pretrained(model_name)
model.eval()

context = "Қазақстан - Орталық Азиядағы мемлекет. Оның астанасы Астана қаласы."
question = "Қазақстанның астанасы қайда?"

inputs = tokenizer(
    question, 
    context, 
    return_tensors="pt", 
    padding=True, 
    truncation="only_second", 
    max_length=512
)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    start_scores = outputs.start_logits[0]
    end_scores = outputs.end_logits[0]

num_candidates = 20
max_span_length = 30

# Get top start and end candidates
start_candidates = torch.topk(start_scores, num_candidates)
end_candidates = torch.topk(end_scores, num_candidates)

best_score = float('-inf')
best_start = 0
best_end = 0

for start_idx in start_candidates.indices:
    for end_idx in end_candidates.indices:
        if start_idx <= end_idx and (end_idx - start_idx) <= max_span_length:
            score = start_scores[start_idx].item() + end_scores[end_idx].item()
            if score > best_score:
                best_score = score
                best_start = start_idx.item()
                best_end = end_idx.item()

# Extract answer tokens
answer_tokens = inputs["input_ids"][0][best_start:best_end+1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

print(f"Question: {question}")
print(f"Answer: {answer}")

Training

The model was trained on the issai/kazqad dataset with the following parameters:

Base Model: google-bert/bert-base-multilingual-cased
Epochs: 3
Batch Size: 8 (per device)
Learning Rate: 2e-5 (with linear decay)
Weight Decay: 0.01
Max Sequence Length: 512
Dataset Splits:
- Train: 3,163 examples
- Validation: 764 examples
- Test: 2,713 examples

Training Progress

Epoch	Training Loss	Validation Loss
1	-	2.760520
2	2.372000	2.854473
3	1.347700	3.108089

The training loss decreased significantly from 2.37 to 1.35, indicating the model learned the task effectively.

Evaluation

The model was evaluated on the test set with the following results:

Evaluation Loss: 3.108
Evaluation Runtime: 76.97 seconds
Evaluation Samples per Second: 35.25
Evaluation Steps per Second: 4.42

Note: The evaluation loss represents the cross-entropy loss on the test set. For more detailed metrics such as Exact Match (EM) and F1 score, additional evaluation with a question-answering metric computation function would be required.

Model Experiments

During development, two different base models were tested:

1. mT5-small (Initial Experiment)

Base Model: google/mt5-small
Final Validation Loss: 5.826
Result: Poor performance - model failed to learn the task effectively
Issues: mT5 is an encoder-decoder model designed for generative tasks, making it suboptimal for extractive QA. The validation loss remained high (~5.8-6.0) and answers were often extracted from the question rather than the context.

2. BERT-base-multilingual-cased (Final Model) ✅

Base Model: google-bert/bert-base-multilingual-cased
Final Validation Loss: 3.108
Result: Significant improvement - model learned the task successfully
Advantages: BERT's bidirectional encoder architecture is well-suited for extractive QA. The training loss decreased from 2.37 to 1.35, and the model produces correct answers from the context.

The BERT-based model shows much better performance and was selected as the final model for deployment.

Requirements

The following packages are required to use this model:

torch>=2.1.0
transformers==5.0.0rc1
datasets>=2.14.0,<3.0.0
accelerate>=0.24.0
tiktoken>=0.1.0
sentencepiece>=0.1.99
numpy>=1.25.2,<2.0.0
pandas>=2.0.0
tqdm>=4.66.0

Install them with:

pip install torch transformers datasets accelerate tiktoken sentencepiece numpy pandas tqdm

Or install from the project's requirements.txt:

pip install -r requirements.txt

Limitations

The model is trained on a limited dataset of 3,163 training examples and may perform better on texts similar to the training data. For other domains or question types, additional fine-tuning may be required.

The model uses a maximum sequence length of 512 tokens, so longer contexts will be truncated.

For detailed performance metrics (Exact Match and F1 score), additional evaluation with a question-answering metric computation function would be required.

Citations

If you use this model, please cite the KazQAD dataset:

@inproceedings{yeshpanov-etal-2024-kazqad,
    title = "{K}az{QAD}: {K}azakh Open-Domain Question Answering Dataset",
    author = "Yeshpanov, Rustem  and
      Efimov, Pavel  and
      Boytsov, Leonid  and
      Shalkarbayuli, Ardak  and
      Braslavski, Pavel",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.843",
    pages = "9645--9656"
}

Author

R3iwan

License

The model itself is released under Apache 2.0 license. However, please note that this model was fine-tuned on the KazQAD dataset, which is licensed under Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). When using this model, you should comply with both licenses.