Model Card for SwahBERT-KenSwQuAD-baseline

Model Summary

This model is a SwahBERT-base-cased model fine-tuned on the KenSwQuAD dataset for extractive question answering in Swahili.

Unlike sequence-to-sequence approaches, this model uses a span extraction architecture (BERT-style) where the model predicts the start and end positions of the answer within the context.

Model Details

Property	Value
Developed by	Benjamin (kikwaib)
Model Type	BERT-based Encoder (Extractive QA)
Base Model	pranaydeeps/SwahBERT-base-cased
Language(s)	Swahili (sw)
Task	Extractive Question Answering
License	Apache 2.0

Intended Use

Primary Use Cases

Swahili Question Answering: Extract answer spans from Swahili text given a question
Research: Baseline for Swahili extractive QA experiments
Comparison: Benchmark against generative QA approaches (mT5-based models)

How to Use

from transformers import pipeline

# Load the QA pipeline
qa_pipeline = pipeline("question-answering", model="kikwaib/SwahBERT-KenSwQuAD-baseline")

# Example usage
context = """
Mji wa Dar es Salaam ni mji mkubwa na wenye watu wengi nchini Tanzania.
Ni bandari muhimu na kitovu cha kiuchumi cha nchi.
Lugha kuu zinazozungumzwa ni Kiswahili na Kiingereza.
"""

question = "Lugha kuu zinazozungumzwa ni upi?"

result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.4f}")
# Expected Output: "Kiswahili na Kiingereza"

Alternative Usage (Manual)

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

model_name = "kikwaib/SwahBERT-KenSwQuAD-baseline"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

context = "Kenya ni nchi ya Afrika Mashariki. Nairobi ni mji mkuu wa Kenya."
question = "Mji mkuu wa Kenya ni upi?"

inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)

# Get the answer span
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][answer_start:answer_end])
)
print(f"Answer: {answer}")

Limitations

Extractive only: Can only return text that appears verbatim in the context
Answer must be contiguous: Cannot combine information from multiple locations
Context length limit: Maximum 384 tokens with 128 token stride for long documents

Training Data

Dataset: KenSwQuAD

The model was fine-tuned on the full KenSwQuAD (Kenya Swahili Question Answering Dataset).

Statistic	Value
Source File	`KenSwQuAD_final_7526_QA_pairs_csv.csv`
Context Files	`kenswquad_utf8/*.txt`
Train/Validation Split	90% / 10%
Random Seed	42

Data Processing

Questions and answers extracted from CSV
Context loaded from corresponding .txt files
Answer start positions computed by finding answer text in context
QA pairs where answer not found in context were skipped
Long contexts handled with sliding window (stride=128 tokens)

Training Procedure

Hardware

Component	Specification
Platform	Google Colab
GPU	Tesla T4 / P100 (variable)

Hyperparameters

Parameter	Value
Learning Rate	2e-5
Train Batch Size	8
Eval Batch Size	8
Gradient Accumulation Steps	2
Effective Batch Size	16
Epochs	3
Weight Decay	0.01
FP16	Enabled
Max Sequence Length	384 tokens
Doc Stride	128 tokens
Optimizer	AdamW

Training Strategy

Evaluation Strategy: Per epoch
Save Strategy: Per epoch
Best Model Selection: Based on validation loss
Hub Push: Enabled during training

Evaluation Results

The model was evaluated using the official SQuAD metrics:

Metric	Score
Exact Match	15.62%
F1 Score	16.90%

Comparison with Generative Approaches

This model represents a baseline extractive approach for Swahili QA. For comparison with generative (seq2seq) approaches, see:

Model	Approach	Metric	Score
SwahBERT-KenSwQuAD-baseline (This)	Extractive	F1	16.90%
mt5-base-kenswquad-extractive	Generative	BLEU	48.99

Key Differences

Aspect	SwahBERT (Extractive)	mT5 (Generative)
Architecture	Encoder-only	Encoder-Decoder
Output	Span indices	Generated text
Answer source	Must be in context	Can paraphrase
Training signal	Start/end positions	Token generation
Inference speed	Faster	Slower

Framework Versions

Library	Version
Transformers	Latest
Datasets	Latest
Evaluate	Latest
Accelerate	Latest

Citation

If you use this model, please cite the KenSwQuAD research:

Related Models

Model	Description
pranaydeeps/SwahBERT-base-cased	Base pretrained model
kikwaib/mt5-base-kenswquad-extractive	Generative QA (Stage 2)
kikwaib/mt5-base-squad-transfer	English SQuAD transfer (Stage 1)

Training Date: December 2025

Downloads last month: 27

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for kikwaib/SwahBERT-KenSwQuAD-baseline

Base model

pranaydeeps/SwahBERT-base-cased

Finetuned

(1)

this model

Evaluation results

Exact Match on KenSwQuAD
self-reported

TBD
F1 Score on KenSwQuAD
self-reported

TBD