| --- |
| datasets: |
| - C4IR-RW/kinya-ag-retrieval |
| language: |
| - rw |
| metrics: |
| - accuracy |
| tags: |
| - kinyarwanda |
| - kinyabert |
| - bert |
| - colbert |
| - rag |
| - retrieval |
| license: cc-by-4.0 |
| --- |
| |
| # Kinyarwanda BERT and ColBERT models |
|
|
| In Rwanda, many farmers struggle to access timely, personalized agricultural information. Traditional channels - like radio, TV, and online sources - offer limited reach and interactivity, while extension services and a national call center, staffed by only two agents for over two million farmers, face capacity constraints. To address these gaps, we developed a 24/7 AI-enabled Interactive Voice Response (IVR) tool. Accessible via a Kinyarwanda-speaking hotline, this tool provides advisory on topics such as pest and disease diagnosis and agro-climatic practices, as well as information on MINAGRI’s support programs for farmers, e.g. crop insurances. By utilizing AI and IVR technology, this project will make agricultural advisories more accessible, timely, and responsive to farmers’ needs. For more information, please reach out to [C4IR](https://c4ir.rw/). |
|
|
| Implemented by: [C4IR Rwanda](https://c4ir.rw/) & [KiNLP](https://kinlp.com/); Supported by [GIZ](https://www.giz.de/); Financed by: [BMZ](https://www.bmz.de/en). |
|
|
| ## Introduction |
|
|
| This repository Pre-trained foundational models for Kinyarwanda passage retrieval/ranking. Running these models requires using [DeepKIN-AgAI](https://github.com/c4ir-rw/ac-ai-models/tree/main/DeepKIN-AgAI) package. |
|
|
| ## Example uses: |
|
|
| ### 1. Fine-tuning a pretrained KinyaBERT model into a KinyaColBERT retrieval model |
|
|
| The following example uses a pre-trained KinyaBERT base model (107M paremeters). |
|
|
| The training data for agricultural retrieval (i.e. ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval) on Hugging Face) has been morphologically parsed already, but for other datasets, [MorphoKIN](https://github.com/anzeyimana/morphokin) parsing will be performed first. |
|
|
| ```shell |
| |
| # 1. Copy "kinya-ag-retrieval" dataset from Hugging face into a local directory, e.g. /home/ubuntu/DATA/kinya-ag-retrieval/ |
| |
| # 2. Copy "kinyabert_base_pretrained.pt" model into a local directory, e.g. /home/ubuntu/DATA/kinyabert_base_pretrained.pt |
| |
| # 3. Run the following training script from DeepKIN-AgAI package: |
| |
| python3 DeepKIN-AgAI/deepkin/train/flex_trainer.py \ |
| --model_variant="kinya_colbert:base" \ |
| --colbert_embedding_dim=512 \ |
| --gpus=1 \ |
| --batch_size=12 \ |
| --accumulation_steps=10 \ |
| --dataloader_num_workers=4 \ |
| --dataloader_persistent_workers=True \ |
| --dataloader_pin_memory=True \ |
| --use_ddp=False \ |
| --use_mtl_optimizer=False \ |
| --warmup_iter=2000 \ |
| --peak_lr=1e-5 \ |
| --lr_decay_style="cosine" \ |
| --num_iters=152630 \ |
| --dataset_max_seq_len=512 \ |
| --use_iterable_dataset=False \ |
| --train_log_steps=1 \ |
| --checkpoint_steps=1000 \ |
| --pretrained_bert_model_file="/home/ubuntu/DATA/kinyabert_base_pretrained.pt" \ |
| --qa_train_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \ |
| --qa_train_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \ |
| --qa_train_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \ |
| --qa_train_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \ |
| --qa_train_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_all.tsv" \ |
| --qa_dev_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \ |
| --qa_dev_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \ |
| --qa_dev_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \ |
| --qa_dev_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \ |
| --qa_dev_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv" \ |
| --load_saved_model=True \ |
| --model_save_path="/home/ubuntu/DATA/kinya_colbert_base_rw_ag_retrieval_new.pt" |
| |
| |
| ``` |
|
|
| ### 2. Running an API server for KinyaColBERT agricultural retrieval |
|
|
| 1. First, run [MorphoKIN](https://github.com/anzeyimana/morphokin) server on Unix domain socket: |
|
|
| ```shell |
| |
| # Launch a daemon container |
| |
| docker run -d -v /home/ubuntu/MORPHODATA:/MORPHODATA \ |
| --gpus all morphokin:latest morphokin \ |
| --morphokin_working_dir /MORPHODATA \ |
| --morphokin_config_file /MORPHODATA/data/analysis_config_file.conf \ |
| --task RMS \ |
| --kinlp_license /MORPHODATA/licenses/KINLP_LICENSE_FILE.dat \ |
| --ca_roots_pem_file /MORPHODATA/data/roots.pem \ |
| --morpho_socket /MORPHODATA/run/morpho.sock |
| |
| |
| ``` |
|
|
| 2. Wait for MorphoKIN socket server to be ready by monitoring the container logs. |
|
|
| ```shell |
| |
| docker container ls |
| |
| docker logs -f <CONTAINER ID> |
| |
| # MorphoKIN server is ready once you see a message like this: MorphoKin server listening on UNIX SOCKET: /MORPHODATA/run/morpho.sock |
| |
| ``` |
|
|
| 3. Then, run the retrieval API server: |
|
|
| ```shell |
| |
| mkdir -p /home/ubuntu/DATA/agai_index |
| |
| python3 DeepKIN-AgAI/deepkin/production/agai_backend.py |
| |
| ``` |
|
|
| ### 3. Evaluating KinyaColBERT pre-trained model on ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval) |
|
|
|
|
| ```python |
| import progressbar |
| import torch |
| import torch.nn.functional as F |
| |
| from deepkin.clib.libkinlp.kinlpy import ParsedFlexSentence |
| from deepkin.data.morpho_qa_triple_data import DOCUMENT_TYPE_ID, QUESTION_TYPE_ID |
| from deepkin.models.kinyabert import KinyaColBERT |
| from deepkin.utils.misc_functions import read_lines |
| |
| DATA_DIR = '/home/ubuntu/DATA' |
| rank = 0 |
| pretrained_model_file = f'{DATA_DIR}/kinya_colbert_large_rw_ag_retrieval_finetuned_512D.pt' |
| keyword = f'kinya_colbert_large' |
| |
| qa_query_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt' |
| qa_query_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt' |
| qa_passage_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt' |
| qa_passage_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt' |
| |
| all_queries = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))} |
| all_passages = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_passage_id), read_lines(qa_passage_text))} |
| |
| print(f'Got: {len(all_queries)} queries, {len(all_passages)} passages', flush=True) |
| |
| device = torch.device('cuda:%d' % rank) |
| |
| model, args = KinyaColBERT.from_pretrained(device, pretrained_model_file, ret_args=True) |
| model.float() |
| model.eval() |
| |
| passage_embeddings = dict() |
| DocPool = None |
| QueryPool = None |
| with torch.no_grad(): |
| print(f'{keyword} Embedding passages ...', flush=True) |
| with progressbar.ProgressBar(max_value=len(all_passages), redirect_stdout=True) as bar: |
| for itr, (passage_id, passage) in enumerate(all_passages.items()): |
| if (itr % 100) == 0: |
| bar.update(itr) |
| passage.trim(508) |
| with torch.no_grad(): |
| D = model.get_colbert_embeddings([passage], DOCUMENT_TYPE_ID) |
| DocPool = D.view(-1,D.size(-1)) if DocPool is None else torch.cat((DocPool, D.view(-1,D.size(-1)))) |
| passage_embeddings[passage_id] = D |
| |
| query_embeddings = dict() |
| Doc_Mean = DocPool.mean(dim=0) |
| Doc_Stdev = DocPool.std(dim=0) |
| del DocPool |
| print(f'{keyword} Embedding queries ...', flush=True) |
| with progressbar.ProgressBar(max_value=len(all_queries), redirect_stdout=True) as bar: |
| for itr, (query_id, query) in enumerate(all_queries.items()): |
| if (itr % 1000) == 0: |
| bar.update(itr) |
| query.trim(508) |
| with torch.no_grad(): |
| Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID) |
| QueryPool = Q.view(-1, Q.size(-1)) if QueryPool is None else torch.cat((QueryPool, Q.view(-1, Q.size(-1)))) |
| query_embeddings[query_id] = Q |
| |
| Query_Mean = QueryPool.mean(dim=0) |
| Query_Stdev = QueryPool.std(dim=0) |
| del QueryPool |
| |
| dev_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv' |
| test_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv' |
| |
| EVAL_SETS = [('DEV', dev_triples), |
| ('TEST', test_triples)] |
| |
| for eval_set_name, eval_qpn_triples in EVAL_SETS: |
| eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in read_lines(eval_qpn_triples)} |
| Top = [1, 5, 10, 20, 30] |
| TopAcc = [0.0 for _ in Top] |
| MTop = [5, 10, 20, 30] |
| MRR = [0.0 for _ in MTop] |
| Total = 0.0 |
| for itr,(query_id,target_doc_id) in enumerate(eval_query_to_passage_ids.items()): |
| query = all_queries[query_id] |
| with torch.no_grad(): |
| Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID) |
| Q = (Q - Query_Mean) / Query_Stdev |
| Q = F.normalize(Q, p=2, dim=2) |
| results = [] |
| for doc_id,D in passage_embeddings.items(): |
| D = (D - Doc_Mean) / Doc_Stdev |
| D = F.normalize(D, p=2, dim=2) |
| with torch.no_grad(): |
| score = model.pairwise_score(Q,D).squeeze().item() |
| score = score / Q.size(1) |
| results.append((score, doc_id)) |
| Total += 1.0 |
| results = sorted(results, key=lambda x: x[0], reverse=True) |
| for i, t in enumerate(Top): |
| TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0) |
| for i, t in enumerate(MTop): |
| top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id] |
| MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0) |
| print(f'-------------------------------------------------------------------------------------------------') |
| for i, t in enumerate(Top): |
| print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:', |
| f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})') |
| for i, t in enumerate(MTop): |
| print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:', |
| f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})') |
| print(f'-------------------------------------------------------------------------------------------------', flush=True) |
| |
| |
| ``` |
|
|
|
|
| ### 4. Evaluating pre-trained RAGatouille ColBERT model on ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval) |
|
|
|
|
| ```python |
| from deepkin.utils.misc_functions import read_lines |
| from ragatouille import RAGPretrainedModel |
| |
| keyword = 'agai-colbert-10000' |
| print(f'Evaluating {keyword} ...', flush=True) |
| qa_query_id = 'kinya-ag-retrieval/rw_ag_retrieval_query_id.txt' |
| qa_query_text = 'kinya-ag-retrieval/rw_ag_retrieval_query_text.txt' |
| |
| all_queries = {idx: txt for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))} |
| |
| print(f'Got: {len(all_queries)} queries', flush=True) |
| |
| RAG = RAGPretrainedModel.from_index(f'ragatouille-kinya-colbert/indexes/agai-colbert-10000/') |
| |
| dev_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv' |
| test_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv' |
| |
| EVAL_SETS = [('DEV', dev_triples), |
| ('TEST', test_triples)] |
| |
| for eval_set_name, eval_qpn_triples in EVAL_SETS: |
| eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in |
| read_lines(eval_qpn_triples)} |
| Top = [1, 5, 10, 20, 30] |
| TopAcc = [0.0 for _ in Top] |
| MTop = [5, 10, 20, 30] |
| MRR = [0.0 for _ in MTop] |
| Total = 0.0 |
| for itr, (query_id, target_doc_id) in enumerate(eval_query_to_passage_ids.items()): |
| query = all_queries[query_id] |
| results = RAG.search(query=query, k=max(max(50, max(Top)), max(MTop))) |
| results = [(d['score'],d['document_id']) for d in results] |
| Total += 1.0 |
| results = sorted(results, key=lambda x: x[0], reverse=True) |
| for i, t in enumerate(Top): |
| TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0) |
| for i, t in enumerate(MTop): |
| top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id] |
| MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0) |
| print(f'-------------------------------------------------------------------------------------------------') |
| for i, t in enumerate(Top): |
| print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:', |
| f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})') |
| for i, t in enumerate(MTop): |
| print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:', |
| f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})') |
| print(f'-------------------------------------------------------------------------------------------------', |
| flush=True) |
| |
| |
| ``` |
|
|
| ## References |
|
|
| [1] Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. [KinyaBERT: a Morphology-aware Kinyarwanda Language Model](https://aclanthology.org/2022.acl-long.367/). In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363, Dublin, Ireland. Association for Computational Linguistics. |
|
|
| [2] Antoine Nzeyimana, and Andre Niyongabo Rubungo. 2025. [KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation](https://arxiv.org/abs/2507.03241). arXiv preprint arXiv:2507.03241. |
|
|
|
|
| ## License |
|
|
| This model is licensed under the [Creative Commons Attribution 4.0 International License (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/). |
|
|
| **Attribution:** Please attribute this work to C4IR Rwanda and KiNLP. |