Model Card for SwahBERT-KenSwQuAD-baseline

Model Summary

This model is a SwahBERT-base-cased model fine-tuned on the KenSwQuAD dataset for extractive question answering in Swahili.

Unlike sequence-to-sequence approaches, this model uses a span extraction architecture (BERT-style) where the model predicts the start and end positions of the answer within the context.

Model Details

Property Value
Developed by Benjamin (kikwaib)
Model Type BERT-based Encoder (Extractive QA)
Base Model pranaydeeps/SwahBERT-base-cased
Language(s) Swahili (sw)
Task Extractive Question Answering
License Apache 2.0

Intended Use

Primary Use Cases

  • Swahili Question Answering: Extract answer spans from Swahili text given a question
  • Research: Baseline for Swahili extractive QA experiments
  • Comparison: Benchmark against generative QA approaches (mT5-based models)

How to Use

from transformers import pipeline

# Load the QA pipeline
qa_pipeline = pipeline("question-answering", model="kikwaib/SwahBERT-KenSwQuAD-baseline")

# Example usage
context = """
Mji wa Dar es Salaam ni mji mkubwa na wenye watu wengi nchini Tanzania.
Ni bandari muhimu na kitovu cha kiuchumi cha nchi.
Lugha kuu zinazozungumzwa ni Kiswahili na Kiingereza.
"""

question = "Lugha kuu zinazozungumzwa ni upi?"

result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
# Expected Output: "Kiswahili na Kiingereza"

Alternative Usage (Manual)

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

model_name = "kikwaib/SwahBERT-KenSwQuAD-baseline"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

context = "Kenya ni nchi ya Afrika Mashariki. Nairobi ni mji mkuu wa Kenya."
question = "Mji mkuu wa Kenya ni upi?"

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

# Get the answer span
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end])
)
print(f"Answer: {answer}")

Limitations

  • Extractive only: Can only return text that appears verbatim in the context
  • Answer must be contiguous: Cannot combine information from multiple locations
  • Context length limit: Maximum 384 tokens with 128 token stride for long documents

Training Data

Dataset: KenSwQuAD

The model was fine-tuned on the full KenSwQuAD (Kenya Swahili Question Answering Dataset).

Statistic Value
Source File KenSwQuAD_final_7526_QA_pairs_csv.csv
Context Files kenswquad_utf8/*.txt
Train/Validation Split 90% / 10%
Random Seed 42

Data Processing

  • Questions and answers extracted from CSV
  • Context loaded from corresponding .txt files
  • Answer start positions computed by finding answer text in context
  • QA pairs where answer not found in context were skipped
  • Long contexts handled with sliding window (stride=128 tokens)

Training Procedure

Hardware

Component Specification
Platform Google Colab
GPU Tesla T4 / P100 (variable)

Hyperparameters

Parameter Value
Learning Rate 2e-5
Train Batch Size 8
Eval Batch Size 8
Gradient Accumulation Steps 2
Effective Batch Size 16
Epochs 3
Weight Decay 0.01
FP16 Enabled
Max Sequence Length 384 tokens
Doc Stride 128 tokens
Optimizer AdamW

Training Strategy

  • Evaluation Strategy: Per epoch
  • Save Strategy: Per epoch
  • Best Model Selection: Based on validation loss
  • Hub Push: Enabled during training

Evaluation Results

The model was evaluated using the official SQuAD metrics:

Metric Score
Exact Match 15.62%
F1 Score 16.90%

Comparison with Generative Approaches

This model represents a baseline extractive approach for Swahili QA. For comparison with generative (seq2seq) approaches, see:

Model Approach Metric Score
SwahBERT-KenSwQuAD-baseline (This) Extractive F1 16.90%
mt5-base-kenswquad-extractive Generative BLEU 48.99

Key Differences

Aspect SwahBERT (Extractive) mT5 (Generative)
Architecture Encoder-only Encoder-Decoder
Output Span indices Generated text
Answer source Must be in context Can paraphrase
Training signal Start/end positions Token generation
Inference speed Faster Slower

Framework Versions

Library Version
Transformers Latest
Datasets Latest
Evaluate Latest
Accelerate Latest

Citation

If you use this model, please cite the KenSwQuAD research:

Related Models

Model Description
pranaydeeps/SwahBERT-base-cased Base pretrained model
kikwaib/mt5-base-kenswquad-extractive Generative QA (Stage 2)
kikwaib/mt5-base-squad-transfer English SQuAD transfer (Stage 1)

Training Date: December 2025

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kikwaib/SwahBERT-KenSwQuAD-baseline

Finetuned
(1)
this model

Evaluation results