Model Card for SwahBERT-KenSwQuAD-baseline
Model Summary
This model is a SwahBERT-base-cased model fine-tuned on the KenSwQuAD dataset for extractive question answering in Swahili.
Unlike sequence-to-sequence approaches, this model uses a span extraction architecture (BERT-style) where the model predicts the start and end positions of the answer within the context.
Model Details
| Property |
Value |
| Developed by |
Benjamin (kikwaib) |
| Model Type |
BERT-based Encoder (Extractive QA) |
| Base Model |
pranaydeeps/SwahBERT-base-cased |
| Language(s) |
Swahili (sw) |
| Task |
Extractive Question Answering |
| License |
Apache 2.0 |
Intended Use
Primary Use Cases
- Swahili Question Answering: Extract answer spans from Swahili text given a question
- Research: Baseline for Swahili extractive QA experiments
- Comparison: Benchmark against generative QA approaches (mT5-based models)
How to Use
from transformers import pipeline
qa_pipeline = pipeline("question-answering", model="kikwaib/SwahBERT-KenSwQuAD-baseline")
context = """
Mji wa Dar es Salaam ni mji mkubwa na wenye watu wengi nchini Tanzania.
Ni bandari muhimu na kitovu cha kiuchumi cha nchi.
Lugha kuu zinazozungumzwa ni Kiswahili na Kiingereza.
"""
question = "Lugha kuu zinazozungumzwa ni upi?"
result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
Alternative Usage (Manual)
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
model_name = "kikwaib/SwahBERT-KenSwQuAD-baseline"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
context = "Kenya ni nchi ya Afrika Mashariki. Nairobi ni mji mkuu wa Kenya."
question = "Mji mkuu wa Kenya ni upi?"
inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end])
)
print(f"Answer: {answer}")
Limitations
- Extractive only: Can only return text that appears verbatim in the context
- Answer must be contiguous: Cannot combine information from multiple locations
- Context length limit: Maximum 384 tokens with 128 token stride for long documents
Training Data
Dataset: KenSwQuAD
The model was fine-tuned on the full KenSwQuAD (Kenya Swahili Question Answering Dataset).
| Statistic |
Value |
| Source File |
KenSwQuAD_final_7526_QA_pairs_csv.csv |
| Context Files |
kenswquad_utf8/*.txt |
| Train/Validation Split |
90% / 10% |
| Random Seed |
42 |
Data Processing
- Questions and answers extracted from CSV
- Context loaded from corresponding
.txt files
- Answer start positions computed by finding answer text in context
- QA pairs where answer not found in context were skipped
- Long contexts handled with sliding window (stride=128 tokens)
Training Procedure
Hardware
| Component |
Specification |
| Platform |
Google Colab |
| GPU |
Tesla T4 / P100 (variable) |
Hyperparameters
| Parameter |
Value |
| Learning Rate |
2e-5 |
| Train Batch Size |
8 |
| Eval Batch Size |
8 |
| Gradient Accumulation Steps |
2 |
| Effective Batch Size |
16 |
| Epochs |
3 |
| Weight Decay |
0.01 |
| FP16 |
Enabled |
| Max Sequence Length |
384 tokens |
| Doc Stride |
128 tokens |
| Optimizer |
AdamW |
Training Strategy
- Evaluation Strategy: Per epoch
- Save Strategy: Per epoch
- Best Model Selection: Based on validation loss
- Hub Push: Enabled during training
Evaluation Results
The model was evaluated using the official SQuAD metrics:
| Metric |
Score |
| Exact Match |
15.62% |
| F1 Score |
16.90% |
Comparison with Generative Approaches
This model represents a baseline extractive approach for Swahili QA. For comparison with generative (seq2seq) approaches, see:
Key Differences
| Aspect |
SwahBERT (Extractive) |
mT5 (Generative) |
| Architecture |
Encoder-only |
Encoder-Decoder |
| Output |
Span indices |
Generated text |
| Answer source |
Must be in context |
Can paraphrase |
| Training signal |
Start/end positions |
Token generation |
| Inference speed |
Faster |
Slower |
Framework Versions
| Library |
Version |
| Transformers |
Latest |
| Datasets |
Latest |
| Evaluate |
Latest |
| Accelerate |
Latest |
Citation
If you use this model, please cite the KenSwQuAD research:
Related Models
Training Date: December 2025