akhooli
/

Arabic-ColBERT-100K

Sentence Similarity

Model card Files Files and versions

Arabic-ColBERT-100K / README.md

akhooli's picture

Update README.md

c31c1f9 verified over 1 year ago

|

history blame contribute delete

3.58 kB

	---
	inference: false
	datasets:
	- akhooli/arabic-triplets-1m-curated-sims-len
	pipeline_tag: sentence-similarity
	tags:
	- ColBERT
	base_model:
	- aubmindlab/bert-base-arabertv02
	license: mit
	library_name: RAGatouille
	---


	# Arabic-ColBERT-100k

	First version of Arabic ColBERT (better models are available now - see the 250k and 711k ones).
	This model was trained on 100K filtered triplets of the [akhooli/arabic-triplets-1m-curated-sims-len](https://huggingface.co/datasets/akhooli/arabic-triplets-1m-curated-sims-len)
	which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score.
	More details on the dataset are available in the data card.

	Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using
	a 2-GPU Kaggle account with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model.

	If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version.
	Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more,
	just replace the pretrained model name and make sure you use Arabic text and split documents for best results.

	You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample).

	# Training script

	```python
	from datasets import load_dataset
	from ragatouille import RAGTrainer
	sample_size = 100000
	ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', split="train", trust_remote_code=True, streaming=True)

	# some data processing not in this script (data filtered based on similarity scores) and 100K selected at random
	sds = ds.shuffle(seed=42, buffer_size=10_000)
	dsf = sds
	triplets = []
	for item in iter(dsf):
	triplets.append((item["query"], item["positive"], item["negative"]))
	trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",)
	trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)

	trainer.train(batch_size=32,
	nbits=4, # How many bits will the trained model use when compressing indexes
	maxsteps=3125, # Maximum steps hard stop
	use_ib_negatives=True, # Use in-batch negative to calculate loss
	dim=128, # How many dimensions per embedding. 128 is the default and works well.
	learning_rate=1e-5, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
	doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
	use_relu=False, # Disable ReLU -- doesn't improve performance
	warmup_steps="auto", # Defaults to 10%
	)

	```
	Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert`

	Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy
	Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw

	## Citation

	```bibtex
	@online{AbedKhooli,
	author = 'Abed Khooli',
	title = 'Arabic ColBERT 100K',
	publisher = 'Hugging Face',
	month = 'jul',
	year = '2024',
	url = 'https://huggingface.co/akhooli/Arabic-ColBERT-100k',
	}
	```