Duplicate from C4IR-RW/kinyabert

dd9c97f 24 days ago

13.9 kB

	---
	datasets:
	- C4IR-RW/kinya-ag-retrieval
	language:
	- rw
	metrics:
	- accuracy
	tags:
	- kinyarwanda
	- kinyabert
	- bert
	- colbert
	- rag
	- retrieval
	license: cc-by-4.0
	---

	# Kinyarwanda BERT and ColBERT models

	In Rwanda, many farmers struggle to access timely, personalized agricultural information. Traditional channels - like radio, TV, and online sources - offer limited reach and interactivity, while extension services and a national call center, staffed by only two agents for over two million farmers, face capacity constraints. To address these gaps, we developed a 24/7 AI-enabled Interactive Voice Response (IVR) tool. Accessible via a Kinyarwanda-speaking hotline, this tool provides advisory on topics such as pest and disease diagnosis and agro-climatic practices, as well as information on MINAGRI’s support programs for farmers, e.g. crop insurances. By utilizing AI and IVR technology, this project will make agricultural advisories more accessible, timely, and responsive to farmers’ needs. For more information, please reach out to [C4IR](https://c4ir.rw/).

	Implemented by: [C4IR Rwanda](https://c4ir.rw/) & [KiNLP](https://kinlp.com/); Supported by [GIZ](https://www.giz.de/); Financed by: [BMZ](https://www.bmz.de/en).

	## Introduction

	This repository Pre-trained foundational models for Kinyarwanda passage retrieval/ranking. Running these models requires using [DeepKIN-AgAI](https://github.com/c4ir-rw/ac-ai-models/tree/main/DeepKIN-AgAI) package.

	## Example uses:

	### 1. Fine-tuning a pretrained KinyaBERT model into a KinyaColBERT retrieval model

	The following example uses a pre-trained KinyaBERT base model (107M paremeters).

	The training data for agricultural retrieval (i.e. ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval) on Hugging Face) has been morphologically parsed already, but for other datasets, [MorphoKIN](https://github.com/anzeyimana/morphokin) parsing will be performed first.

	```shell

	# 1. Copy "kinya-ag-retrieval" dataset from Hugging face into a local directory, e.g. /home/ubuntu/DATA/kinya-ag-retrieval/

	# 2. Copy "kinyabert_base_pretrained.pt" model into a local directory, e.g. /home/ubuntu/DATA/kinyabert_base_pretrained.pt

	# 3. Run the following training script from DeepKIN-AgAI package:

	python3 DeepKIN-AgAI/deepkin/train/flex_trainer.py \
	--model_variant="kinya_colbert:base" \
	--colbert_embedding_dim=512 \
	--gpus=1 \
	--batch_size=12 \
	--accumulation_steps=10 \
	--dataloader_num_workers=4 \
	--dataloader_persistent_workers=True \
	--dataloader_pin_memory=True \
	--use_ddp=False \
	--use_mtl_optimizer=False \
	--warmup_iter=2000 \
	--peak_lr=1e-5 \
	--lr_decay_style="cosine" \
	--num_iters=152630 \
	--dataset_max_seq_len=512 \
	--use_iterable_dataset=False \
	--train_log_steps=1 \
	--checkpoint_steps=1000 \
	--pretrained_bert_model_file="/home/ubuntu/DATA/kinyabert_base_pretrained.pt" \
	--qa_train_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \
	--qa_train_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \
	--qa_train_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \
	--qa_train_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \
	--qa_train_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_all.tsv" \
	--qa_dev_query_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt" \
	--qa_dev_query_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt" \
	--qa_dev_passage_id="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt" \
	--qa_dev_passage_text="/home/ubuntu/DATA/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt" \
	--qa_dev_qpn_triples="/home/ubuntu/DATA/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv" \
	--load_saved_model=True \
	--model_save_path="/home/ubuntu/DATA/kinya_colbert_base_rw_ag_retrieval_new.pt"


	```

	### 2. Running an API server for KinyaColBERT agricultural retrieval

	1. First, run [MorphoKIN](https://github.com/anzeyimana/morphokin) server on Unix domain socket:

	```shell

	# Launch a daemon container

	docker run -d -v /home/ubuntu/MORPHODATA:/MORPHODATA \
	--gpus all morphokin:latest morphokin \
	--morphokin_working_dir /MORPHODATA \
	--morphokin_config_file /MORPHODATA/data/analysis_config_file.conf \
	--task RMS \
	--kinlp_license /MORPHODATA/licenses/KINLP_LICENSE_FILE.dat \
	--ca_roots_pem_file /MORPHODATA/data/roots.pem \
	--morpho_socket /MORPHODATA/run/morpho.sock


	```

	2. Wait for MorphoKIN socket server to be ready by monitoring the container logs.

	```shell

	docker container ls

	docker logs -f <CONTAINER ID>

	# MorphoKIN server is ready once you see a message like this: MorphoKin server listening on UNIX SOCKET: /MORPHODATA/run/morpho.sock

	```

	3. Then, run the retrieval API server:

	```shell

	mkdir -p /home/ubuntu/DATA/agai_index

	python3 DeepKIN-AgAI/deepkin/production/agai_backend.py

	```

	### 3. Evaluating KinyaColBERT pre-trained model on ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval)


	```python
	import progressbar
	import torch
	import torch.nn.functional as F

	from deepkin.clib.libkinlp.kinlpy import ParsedFlexSentence
	from deepkin.data.morpho_qa_triple_data import DOCUMENT_TYPE_ID, QUESTION_TYPE_ID
	from deepkin.models.kinyabert import KinyaColBERT
	from deepkin.utils.misc_functions import read_lines

	DATA_DIR = '/home/ubuntu/DATA'
	rank = 0
	pretrained_model_file = f'{DATA_DIR}/kinya_colbert_large_rw_ag_retrieval_finetuned_512D.pt'
	keyword = f'kinya_colbert_large'

	qa_query_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_query_id.txt'
	qa_query_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_query_text.txt'
	qa_passage_id = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_passage_id.txt'
	qa_passage_text = f'{DATA_DIR}/kinya-ag-retrieval/parsed_rw_ag_retrieval_passage_text.txt'

	all_queries = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))}
	all_passages = {idx: ParsedFlexSentence(txt) for idx, txt in zip(read_lines(qa_passage_id), read_lines(qa_passage_text))}

	print(f'Got: {len(all_queries)} queries, {len(all_passages)} passages', flush=True)

	device = torch.device('cuda:%d' % rank)

	model, args = KinyaColBERT.from_pretrained(device, pretrained_model_file, ret_args=True)
	model.float()
	model.eval()

	passage_embeddings = dict()
	DocPool = None
	QueryPool = None
	with torch.no_grad():
	print(f'{keyword} Embedding passages ...', flush=True)
	with progressbar.ProgressBar(max_value=len(all_passages), redirect_stdout=True) as bar:
	for itr, (passage_id, passage) in enumerate(all_passages.items()):
	if (itr % 100) == 0:
	bar.update(itr)
	passage.trim(508)
	with torch.no_grad():
	D = model.get_colbert_embeddings([passage], DOCUMENT_TYPE_ID)
	DocPool = D.view(-1,D.size(-1)) if DocPool is None else torch.cat((DocPool, D.view(-1,D.size(-1))))
	passage_embeddings[passage_id] = D

	query_embeddings = dict()
	Doc_Mean = DocPool.mean(dim=0)
	Doc_Stdev = DocPool.std(dim=0)
	del DocPool
	print(f'{keyword} Embedding queries ...', flush=True)
	with progressbar.ProgressBar(max_value=len(all_queries), redirect_stdout=True) as bar:
	for itr, (query_id, query) in enumerate(all_queries.items()):
	if (itr % 1000) == 0:
	bar.update(itr)
	query.trim(508)
	with torch.no_grad():
	Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID)
	QueryPool = Q.view(-1, Q.size(-1)) if QueryPool is None else torch.cat((QueryPool, Q.view(-1, Q.size(-1))))
	query_embeddings[query_id] = Q

	Query_Mean = QueryPool.mean(dim=0)
	Query_Stdev = QueryPool.std(dim=0)
	del QueryPool

	dev_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv'
	test_triples = f'{DATA_DIR}/kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv'

	EVAL_SETS = [('DEV', dev_triples),
	('TEST', test_triples)]

	for eval_set_name, eval_qpn_triples in EVAL_SETS:
	eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in read_lines(eval_qpn_triples)}
	Top = [1, 5, 10, 20, 30]
	TopAcc = [0.0 for _ in Top]
	MTop = [5, 10, 20, 30]
	MRR = [0.0 for _ in MTop]
	Total = 0.0
	for itr,(query_id,target_doc_id) in enumerate(eval_query_to_passage_ids.items()):
	query = all_queries[query_id]
	with torch.no_grad():
	Q = model.get_colbert_embeddings([query], QUESTION_TYPE_ID)
	Q = (Q - Query_Mean) / Query_Stdev
	Q = F.normalize(Q, p=2, dim=2)
	results = []
	for doc_id,D in passage_embeddings.items():
	D = (D - Doc_Mean) / Doc_Stdev
	D = F.normalize(D, p=2, dim=2)
	with torch.no_grad():
	score = model.pairwise_score(Q,D).squeeze().item()
	score = score / Q.size(1)
	results.append((score, doc_id))
	Total += 1.0
	results = sorted(results, key=lambda x: x[0], reverse=True)
	for i, t in enumerate(Top):
	TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0)
	for i, t in enumerate(MTop):
	top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id]
	MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0)
	print(f'-------------------------------------------------------------------------------------------------')
	for i, t in enumerate(Top):
	print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:',
	f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})')
	for i, t in enumerate(MTop):
	print(f'@{eval_set_name} Final {keyword}-{args.colbert_embedding_dim} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:',
	f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})')
	print(f'-------------------------------------------------------------------------------------------------', flush=True)


	```


	### 4. Evaluating pre-trained RAGatouille ColBERT model on ["C4IR-RW/kinya-ag-retrieval"](https://huggingface.co/datasets/C4IR-RW/kinya-ag-retrieval)


	```python
	from deepkin.utils.misc_functions import read_lines
	from ragatouille import RAGPretrainedModel

	keyword = 'agai-colbert-10000'
	print(f'Evaluating {keyword} ...', flush=True)
	qa_query_id = 'kinya-ag-retrieval/rw_ag_retrieval_query_id.txt'
	qa_query_text = 'kinya-ag-retrieval/rw_ag_retrieval_query_text.txt'

	all_queries = {idx: txt for idx, txt in zip(read_lines(qa_query_id), read_lines(qa_query_text))}

	print(f'Got: {len(all_queries)} queries', flush=True)

	RAG = RAGPretrainedModel.from_index(f'ragatouille-kinya-colbert/indexes/agai-colbert-10000/')

	dev_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_dev.tsv'
	test_triples = 'kinya-ag-retrieval/rw_ag_retrieval_qpntriplets_test.tsv'

	EVAL_SETS = [('DEV', dev_triples),
	('TEST', test_triples)]

	for eval_set_name, eval_qpn_triples in EVAL_SETS:
	eval_query_to_passage_ids = {(line.split('\t')[0]): (line.split('\t')[1]) for line in
	read_lines(eval_qpn_triples)}
	Top = [1, 5, 10, 20, 30]
	TopAcc = [0.0 for _ in Top]
	MTop = [5, 10, 20, 30]
	MRR = [0.0 for _ in MTop]
	Total = 0.0
	for itr, (query_id, target_doc_id) in enumerate(eval_query_to_passage_ids.items()):
	query = all_queries[query_id]
	results = RAG.search(query=query, k=max(max(50, max(Top)), max(MTop)))
	results = [(d['score'],d['document_id']) for d in results]
	Total += 1.0
	results = sorted(results, key=lambda x: x[0], reverse=True)
	for i, t in enumerate(Top):
	TopAcc[i] += (1.0 if (target_doc_id in {idx for sc, idx in results[:t]}) else 0.0)
	for i, t in enumerate(MTop):
	top_rr = [(1 / (i + 1)) for i, (sc, idx) in enumerate(results[:t]) if idx == target_doc_id]
	MRR[i] += (top_rr[0] if (len(top_rr) > 0) else 0.0)
	print(f'-------------------------------------------------------------------------------------------------')
	for i, t in enumerate(Top):
	print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set Top#{t} Accuracy:',
	f'{(100.0 * TopAcc[i] / Total): .1f}% ({TopAcc[i]:.0f} / {Total:.0f})')
	for i, t in enumerate(MTop):
	print(f'@{eval_set_name} Final {keyword} kinya-ag-retrieval {eval_set_name} Set MRR@{t}:',
	f'{(100.0 * MRR[i] / Total): .1f}% ({MRR[i]:.0f} / {Total:.0f})')
	print(f'-------------------------------------------------------------------------------------------------',
	flush=True)


	```

	## References

	[1] Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. [KinyaBERT: a Morphology-aware Kinyarwanda Language Model](https://aclanthology.org/2022.acl-long.367/). In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363, Dublin, Ireland. Association for Computational Linguistics.

	[2] Antoine Nzeyimana, and Andre Niyongabo Rubungo. 2025. [KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation](https://arxiv.org/abs/2507.03241). arXiv preprint arXiv:2507.03241.


	## License

	This model is licensed under the [Creative Commons Attribution 4.0 International License (CC-BY 4.0)](https://creativecommons.org/licenses/by/4.0/).

	Attribution: Please attribute this work to C4IR Rwanda and KiNLP.