--- datasets: - C4IR-RW/kinya-ag-retrieval language: - rw metrics: - accuracy tags: - kinyarwanda - kinyabert - bert - colbert - rag - retrieval license: cc-by-4.0 --- # Kinyarwanda BERT and ColBERT models In Rwanda, many farmers struggle to access timely, personalized agricultural information. Traditional channels - like radio, TV, and online sources - offer limited reach and interactivity, while extension services and a national call center, staffed by only two agents for over two million farmers, face capacity constraints. To address these gaps, we developed a 24/7 AI-enabled Interactive Voice Response (IVR) tool. Accessible via a Kinyarwanda-speaking hotline, this tool provides advisory on topics such as pest and disease diagnosis and agro-climatic practices, as well as information on MINAGRI’s support programs for farmers, e.g. crop insurances. By utilizing AI and IVR technology, this project will make agricultural advisories more accessible, timely, and responsive to farmers’ needs. For more information, please reach out to [C4IR](https://c4ir.rw/). Implemented by: [C4IR Rwanda](https://c4ir.rw/) & [KiNLP](https://kinlp.com/); Supported by [GIZ](https://www.giz.de/); Financed by: [BMZ](https://www.bmz.de/en). ## Introduction This repository Pre-trained foundational models for Kinyarwanda passage retrieval/ranking. Running these models requires using [DeepKIN-AgAI](https://github.com/c4ir-rw/ac-ai-models/tree/main/DeepKIN-AgAI) package. ## Example uses: ### 1. Fine-tuning a pretrained KinyaBERT model into a KinyaColBERT retrieval model The following example uses a pre-trained KinyaBERT base model (107M paremeters). The training data for agricultural retrieval (i.e. ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval) on Hugging Face) has been morphologically parsed already, but for other datasets, [MorphoKIN](https://github.com/anzeyimana/morphokin) parsing will be performed first. ```shell # 1. Copy "kinya-ag-retrieval" dataset from Hugging face into a local directory, e.g. /home/ubuntu/DATA/kinya-ag-retrieval/ # 2. Copy "kinyabert_base_pretrained.pt" model into a local directory, e.g. /home/ubuntu/DATA/kinyabert_base_pretrained.pt # 3. Run the following training script from DeepKIN-AgAI package: python3 DeepKIN-AgAI/deepkin/train/flex_trainer.py \ --model_variant="kinya_colbert:base" \ --colbert_embedding_dim=512 \ --gpus=1 \ --batch_size=12 \ --accumulation_steps=10 \ --dataloader_num_workers=4 \ --dataloader_persistent_workers=True \ --dataloader_pin_memory=True \ --use_ddp=False \ --use_mtl_optimizer=False \ --warmup_iter=2000 \ --peak_lr=1e-5 \ --lr_decay_style="cosine" \ --num_iters=152630 \ --dataset_max_seq_len=512 \ --use_iterable_dataset=False \ --train_log_steps=1 \ --checkpoint_steps=1000 \ --pretrained_bert_model_file="/home/ubuntu/DATA/kinyabert_base_pretrained.pt" \ --qa_train_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \ --qa_train_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \ --qa_train_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \ --qa_train_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \ --qa_train_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_all.tsv" \ --qa_dev_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \ --qa_dev_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \ --qa_dev_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \ --qa_dev_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \ --qa_dev_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv" \ --load_saved_model=True \ --model_save_path="/home/ubuntu/DATA/kinya_colbert_base_rw_ag_retrieval_new.pt" ``` ### 2. Running an API server for KinyaColBERT agricultural retrieval 1. First, run [MorphoKIN](https://github.com/anzeyimana/morphokin) server on Unix domain socket: ```shell # Launch a daemon container docker run -d -v /home/ubuntu/MORPHODATA:/MORPHODATA \ --gpus all morphokin:latest morphokin \ --morphokin_working_dir /MORPHODATA \ --morphokin_config_file /MORPHODATA/data/analysis_config_file.conf \ --task RMS \ --kinlp_license /MORPHODATA/licenses/KINLP_LICENSE_FILE.dat \ --ca_roots_pem_file /MORPHODATA/data/roots.pem \ --morpho_socket /MORPHODATA/run/morpho.sock ``` 2. Wait for MorphoKIN socket server to be ready by monitoring the container logs. ```shell docker container ls docker logs -f # MorphoKIN server is ready once you see a message like this: MorphoKin server listening on UNIX SOCKET: /MORPHODATA/run/morpho.sock ``` 3. Then, run the retrieval API server: ```shell mkdir -p /home/ubuntu/DATA/agai_index python3 DeepKIN-AgAI/deepkin/production/agai_backend.py ``` ### 3. Evaluating KinyaColBERT pre-trained model on ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval) ```python import progressbar import torch import torch.nn.functional as F from deepkin.clib.libkinlp.kinlpy import ParsedFlexSentence from deepkin.data.morpho_qa_triple_data import DOCUMENT_TYPE_ID, QUESTION_TYPE_ID from deepkin.models.kinyabert import KinyaColBERT from deepkin.utils.misc_functions import read_lines DATA_DIR = '/home/ubuntu/DATA' rank = 0 pretrained_model_file = f'{DATA_DIR}/kinya_colbert_large_rw_ag_retrieval_finetuned_512D.pt' keyword = f'kinya_colbert_large' qa_query_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt' qa_query_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt' qa_passage_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt' qa_passage_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt' all_queries = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))} all_passages = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_passage_id), read_lines(qa_passage_text))} print(f'Got: {len(all_queries)} queries, {len(all_passages)} passages', flush=True) device = torch.device('cuda:%d' % rank) model, args = KinyaColBERT.from_pretrained(device, pretrained_model_file, ret_args=True) model.float() model.eval() passage_embeddings = dict() DocPool = None QueryPool = None with torch.no_grad(): print(f'{keyword} Embedding passages ...', flush=True) with progressbar.ProgressBar(max_value=len(all_passages), redirect_stdout=True) as bar: for itr, (passage_id, passage) in enumerate(all_passages.items()): if (itr % 100) == 0: bar.update(itr) passage.trim(508) with torch.no_grad(): D = model.get_colbert_embeddings([passage], DOCUMENT_TYPE_ID) DocPool = D.view(-1,D.size(-1)) if DocPool is None else torch.cat((DocPool, D.view(-1,D.size(-1)))) passage_embeddings[passage_id] = D query_embeddings = dict() Doc_Mean = DocPool.mean(dim=0) Doc_Stdev = DocPool.std(dim=0) del DocPool print(f'{keyword} Embedding queries ...', flush=True) with progressbar.ProgressBar(max_value=len(all_queries), redirect_stdout=True) as bar: for itr, (query_id, query) in enumerate(all_queries.items()): if (itr % 1000) == 0: bar.update(itr) query.trim(508) with torch.no_grad(): Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID) QueryPool = Q.view(-1, Q.size(-1)) if QueryPool is None else torch.cat((QueryPool, Q.view(-1, Q.size(-1)))) query_embeddings[query_id] = Q Query_Mean = QueryPool.mean(dim=0) Query_Stdev = QueryPool.std(dim=0) del QueryPool dev_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv' test_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv' EVAL_SETS = [('DEV', dev_triples), ('TEST', test_triples)] for eval_set_name, eval_qpn_triples in EVAL_SETS: eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in read_lines(eval_qpn_triples)} Top = [1, 5, 10, 20, 30] TopAcc = [0.0 for _ in Top] MTop = [5, 10, 20, 30] MRR = [0.0 for _ in MTop] Total = 0.0 for itr,(query_id,target_doc_id) in enumerate(eval_query_to_passage_ids.items()): query = all_queries[query_id] with torch.no_grad(): Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID) Q = (Q - Query_Mean) / Query_Stdev Q = F.normalize(Q, p=2, dim=2) results = [] for doc_id,D in passage_embeddings.items(): D = (D - Doc_Mean) / Doc_Stdev D = F.normalize(D, p=2, dim=2) with torch.no_grad(): score = model.pairwise_score(Q,D).squeeze().item() score = score / Q.size(1) results.append((score, doc_id)) Total += 1.0 results = sorted(results, key=lambda x: x[0], reverse=True) for i, t in enumerate(Top): TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0) for i, t in enumerate(MTop): top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id] MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0) print(f'-------------------------------------------------------------------------------------------------') for i, t in enumerate(Top): print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:', f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})') for i, t in enumerate(MTop): print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:', f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})') print(f'-------------------------------------------------------------------------------------------------', flush=True) ``` ### 4. Evaluating pre-trained RAGatouille ColBERT model on ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval) ```python from deepkin.utils.misc_functions import read_lines from ragatouille import RAGPretrainedModel keyword = 'agai-colbert-10000' print(f'Evaluating {keyword} ...', flush=True) qa_query_id = 'kinya-ag-retrieval/rw_ag_retrieval_query_id.txt' qa_query_text = 'kinya-ag-retrieval/rw_ag_retrieval_query_text.txt' all_queries = {idx: txt for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))} print(f'Got: {len(all_queries)} queries', flush=True) RAG = RAGPretrainedModel.from_index(f'ragatouille-kinya-colbert/indexes/agai-colbert-10000/') dev_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv' test_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv' EVAL_SETS = [('DEV', dev_triples), ('TEST', test_triples)] for eval_set_name, eval_qpn_triples in EVAL_SETS: eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in read_lines(eval_qpn_triples)} Top = [1, 5, 10, 20, 30] TopAcc = [0.0 for _ in Top] MTop = [5, 10, 20, 30] MRR = [0.0 for _ in MTop] Total = 0.0 for itr, (query_id, target_doc_id) in enumerate(eval_query_to_passage_ids.items()): query = all_queries[query_id] results = RAG.search(query=query, k=max(max(50, max(Top)), max(MTop))) results = [(d['score'],d['document_id']) for d in results] Total += 1.0 results = sorted(results, key=lambda x: x[0], reverse=True) for i, t in enumerate(Top): TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0) for i, t in enumerate(MTop): top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id] MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0) print(f'-------------------------------------------------------------------------------------------------') for i, t in enumerate(Top): print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:', f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})') for i, t in enumerate(MTop): print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:', f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})') print(f'-------------------------------------------------------------------------------------------------', flush=True) ``` ## References [1] Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. [KinyaBERT: a Morphology-aware Kinyarwanda Language Model](https://aclanthology.org/2022.acl-long.367/). In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363, Dublin, Ireland. Association for Computational Linguistics. [2] Antoine Nzeyimana, and Andre Niyongabo Rubungo. 2025. [KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation](https://arxiv.org/abs/2507.03241). arXiv preprint arXiv:2507.03241. ## License This model is licensed under the [Creative Commons Attribution 4.0 International License (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/). **Attribution:** Please attribute this work to C4IR Rwanda and KiNLP.