Kinyarwanda BERT and ColBERT models

In Rwanda, many farmers struggle to access timely, personalized agricultural information. Traditional channels - like radio, TV, and online sources - offer limited reach and interactivity, while extension services and a national call center, staffed by only two agents for over two million farmers, face capacity constraints. To address these gaps, we developed a 24/7 AI-enabled Interactive Voice Response (IVR) tool. Accessible via a Kinyarwanda-speaking hotline, this tool provides advisory on topics such as pest and disease diagnosis and agro-climatic practices, as well as information on MINAGRI’s support programs for farmers, e.g. crop insurances. By utilizing AI and IVR technology, this project will make agricultural advisories more accessible, timely, and responsive to farmers’ needs. For more information, please reach out to C4IR.

Implemented by: C4IR Rwanda & KiNLP; Supported by GIZ; Financed by: BMZ.

Introduction

This repository Pre-trained foundational models for Kinyarwanda passage retrieval/ranking. Running these models requires using DeepKIN-AgAI package.

Example uses:

1. Fine-tuning a pretrained KinyaBERT model into a KinyaColBERT retrieval model

The following example uses a pre-trained KinyaBERT base model (107M paremeters).

The training data for agricultural retrieval (i.e. "C4IR-RW/kinya-ag-retrieval" on Hugging Face) has been morphologically parsed already, but for other datasets, MorphoKIN parsing will be performed first.


# 1. Copy "kinya-ag-retrieval" dataset from Hugging face into a local directory, e.g. /home/ubuntu/DATA/kinya-ag-retrieval/

# 2. Copy "kinyabert_base_pretrained.pt" model into a local directory, e.g. /home/ubuntu/DATA/kinyabert_base_pretrained.pt

# 3. Run the following training script from DeepKIN-AgAI package:

python3 DeepKIN-AgAI/deepkin/train/flex_trainer.py  \
    --model_variant="kinya_colbert:base" \
    --colbert_embedding_dim=512 \
    --gpus=1 \
    --batch_size=12  \
    --accumulation_steps=10  \
    --dataloader_num_workers=4  \
    --dataloader_persistent_workers=True  \
    --dataloader_pin_memory=True  \
    --use_ddp=False \
    --use_mtl_optimizer=False \
    --warmup_iter=2000 \
    --peak_lr=1e-5  \
    --lr_decay_style="cosine" \
    --num_iters=152630  \
    --dataset_max_seq_len=512  \
    --use_iterable_dataset=False  \
    --train_log_steps=1  \
    --checkpoint_steps=1000 \
    --pretrained_bert_model_file="/home/ubuntu/DATA/kinyabert_base_pretrained.pt" \
    --qa_train_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \
    --qa_train_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \
    --qa_train_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \
    --qa_train_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \
    --qa_train_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_all.tsv" \
    --qa_dev_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \
    --qa_dev_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \
    --qa_dev_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \
    --qa_dev_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \
    --qa_dev_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv" \
    --load_saved_model=True  \
    --model_save_path="/home/ubuntu/DATA/kinya_colbert_base_rw_ag_retrieval_new.pt"

2. Running an API server for KinyaColBERT agricultural retrieval

First, run MorphoKIN server on Unix domain socket:


# Launch a daemon container

docker run -d -v /home/ubuntu/MORPHODATA:/MORPHODATA \
  --gpus all morphokin:latest morphokin \
  --morphokin_working_dir /MORPHODATA \
  --morphokin_config_file /MORPHODATA/data/analysis_config_file.conf  \
  --task RMS \
  --kinlp_license /MORPHODATA/licenses/KINLP_LICENSE_FILE.dat  \
  --ca_roots_pem_file /MORPHODATA/data/roots.pem \
  --morpho_socket /MORPHODATA/run/morpho.sock

Wait for MorphoKIN socket server to be ready by monitoring the container logs.


docker container ls

docker logs -f <CONTAINER ID>

# MorphoKIN server is ready once you see a message like this: MorphoKin server listening on UNIX SOCKET: /MORPHODATA/run/morpho.sock

Then, run the retrieval API server:


mkdir -p /home/ubuntu/DATA/agai_index

python3 DeepKIN-AgAI/deepkin/production/agai_backend.py

3. Evaluating KinyaColBERT pre-trained model on "C4IR-RW/kinya-ag-retrieval"

import progressbar
import torch
import torch.nn.functional as F

from deepkin.clib.libkinlp.kinlpy import ParsedFlexSentence
from deepkin.data.morpho_qa_triple_data import DOCUMENT_TYPE_ID, QUESTION_TYPE_ID
from deepkin.models.kinyabert import KinyaColBERT
from deepkin.utils.misc_functions import read_lines

DATA_DIR = '/home/ubuntu/DATA'
rank = 0
pretrained_model_file = f'{DATA_DIR}/kinya_colbert_large_rw_ag_retrieval_finetuned_512D.pt'
keyword = f'kinya_colbert_large'

qa_query_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt'
qa_query_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt'
qa_passage_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt'
qa_passage_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt'

all_queries = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))}
all_passages = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_passage_id), read_lines(qa_passage_text))}

print(f'Got: {len(all_queries)} queries, {len(all_passages)} passages', flush=True)

device = torch.device('cuda:%d' % rank)

model, args = KinyaColBERT.from_pretrained(device, pretrained_model_file, ret_args=True)
model.float()
model.eval()

passage_embeddings = dict()
DocPool = None
QueryPool = None
with torch.no_grad():
    print(f'{keyword} Embedding passages ...', flush=True)
    with progressbar.ProgressBar(max_value=len(all_passages), redirect_stdout=True) as bar:
        for itr, (passage_id, passage) in enumerate(all_passages.items()):
            if (itr % 100) == 0:
                bar.update(itr)
            passage.trim(508)
            with torch.no_grad():
                D = model.get_colbert_embeddings([passage], DOCUMENT_TYPE_ID)
            DocPool = D.view(-1,D.size(-1)) if DocPool is None else torch.cat((DocPool, D.view(-1,D.size(-1))))
            passage_embeddings[passage_id] = D

    query_embeddings = dict()
    Doc_Mean = DocPool.mean(dim=0)
    Doc_Stdev = DocPool.std(dim=0)
    del DocPool
    print(f'{keyword} Embedding queries ...', flush=True)
    with progressbar.ProgressBar(max_value=len(all_queries), redirect_stdout=True) as bar:
        for itr, (query_id, query) in enumerate(all_queries.items()):
            if (itr % 1000) == 0:
                bar.update(itr)
            query.trim(508)
            with torch.no_grad():
                Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID)
            QueryPool = Q.view(-1, Q.size(-1)) if QueryPool is None else torch.cat((QueryPool, Q.view(-1, Q.size(-1))))
            query_embeddings[query_id] = Q

    Query_Mean = QueryPool.mean(dim=0)
    Query_Stdev = QueryPool.std(dim=0)
    del QueryPool

    dev_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv'
    test_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv'

    EVAL_SETS = [('DEV', dev_triples),
                 ('TEST', test_triples)]

for eval_set_name, eval_qpn_triples in EVAL_SETS:
    eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in read_lines(eval_qpn_triples)}
    Top = [1, 5, 10, 20, 30]
    TopAcc = [0.0 for _ in Top]
    MTop = [5, 10, 20, 30]
    MRR = [0.0 for _ in MTop]
    Total = 0.0
    for itr,(query_id,target_doc_id) in enumerate(eval_query_to_passage_ids.items()):
        query = all_queries[query_id]
        with torch.no_grad():
            Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID)
        Q = (Q - Query_Mean) / Query_Stdev
        Q = F.normalize(Q, p=2, dim=2)
        results = []
        for doc_id,D in passage_embeddings.items():
            D = (D - Doc_Mean) / Doc_Stdev
            D = F.normalize(D, p=2, dim=2)
            with torch.no_grad():
                score = model.pairwise_score(Q,D).squeeze().item()
            score = score / Q.size(1)
            results.append((score, doc_id))
        Total += 1.0
        results = sorted(results, key=lambda x: x[0], reverse=True)
        for i, t in enumerate(Top):
            TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0)
        for i, t in enumerate(MTop):
            top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id]
            MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0)
    print(f'-------------------------------------------------------------------------------------------------')
    for i, t in enumerate(Top):
        print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:',
              f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})')
    for i, t in enumerate(MTop):
        print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:',
              f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})')
    print(f'-------------------------------------------------------------------------------------------------', flush=True)

4. Evaluating pre-trained RAGatouille ColBERT model on "C4IR-RW/kinya-ag-retrieval"

from deepkin.utils.misc_functions import read_lines
from ragatouille import RAGPretrainedModel

keyword = 'agai-colbert-10000'
print(f'Evaluating {keyword} ...', flush=True)
qa_query_id = 'kinya-ag-retrieval/rw_ag_retrieval_query_id.txt'
qa_query_text = 'kinya-ag-retrieval/rw_ag_retrieval_query_text.txt'

all_queries = {idx: txt for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))}

print(f'Got: {len(all_queries)} queries', flush=True)

RAG = RAGPretrainedModel.from_index(f'ragatouille-kinya-colbert/indexes/agai-colbert-10000/')

dev_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv'
test_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv'

EVAL_SETS = [('DEV', dev_triples),
             ('TEST', test_triples)]

for eval_set_name, eval_qpn_triples in EVAL_SETS:
    eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in
                                 read_lines(eval_qpn_triples)}
    Top = [1, 5, 10, 20, 30]
    TopAcc = [0.0 for _ in Top]
    MTop = [5, 10, 20, 30]
    MRR = [0.0 for _ in MTop]
    Total = 0.0
    for itr, (query_id, target_doc_id) in enumerate(eval_query_to_passage_ids.items()):
        query = all_queries[query_id]
        results = RAG.search(query=query, k=max(max(50, max(Top)), max(MTop)))
        results = [(d['score'],d['document_id']) for d in results]
        Total += 1.0
        results = sorted(results, key=lambda x: x[0], reverse=True)
        for i, t in enumerate(Top):
            TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0)
        for i, t in enumerate(MTop):
            top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id]
            MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0)
    print(f'-------------------------------------------------------------------------------------------------')
    for i, t in enumerate(Top):
        print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:',
              f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})')
    for i, t in enumerate(MTop):
        print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:',
              f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})')
    print(f'-------------------------------------------------------------------------------------------------',
          flush=True)

References

[1] Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. KinyaBERT: a Morphology-aware Kinyarwanda Language Model. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363, Dublin, Ireland. Association for Computational Linguistics.

[2] Antoine Nzeyimana, and Andre Niyongabo Rubungo. 2025. KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation. arXiv preprint arXiv:2507.03241.

License

This model is licensed under the Creative Commons Attribution 4.0 International License (CC-BY 4.0).

Attribution: Please attribute this work to C4IR Rwanda and KiNLP.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train C4IR-RW/kinyabert

Paper for C4IR-RW/kinyabert

KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

Paper • 2507.03241 • Published Jul 4, 2025