|
|
--- |
|
|
inference: false |
|
|
datasets: |
|
|
- akhooli/arabic-triplets-1m-curated-sims-len |
|
|
pipeline_tag: sentence-similarity |
|
|
tags: |
|
|
- ColBERT |
|
|
base_model: |
|
|
- aubmindlab/bert-base-arabertv02 |
|
|
license: mit |
|
|
library_name: RAGatouille |
|
|
--- |
|
|
|
|
|
|
|
|
# Arabic-ColBERT-100k |
|
|
|
|
|
First version of Arabic ColBERT (better models are available now - see the 250k and 711k ones). |
|
|
This model was trained on 100K filtered triplets of the [akhooli/arabic-triplets-1m-curated-sims-len](https://huggingface.co/datasets/akhooli/arabic-triplets-1m-curated-sims-len) |
|
|
which has 1 million Arabic (translated) triplets. The dataset was curated from different sources and enriched with similarity score. |
|
|
More details on the dataset are available in the data card. |
|
|
|
|
|
Training used the [Ragatouille library](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb) using |
|
|
a 2-GPU Kaggle account with [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) as base model. |
|
|
|
|
|
If you downloaded the model before July 27th 8 pm (Jerusalem time), please try the current version. |
|
|
Use the [Ragatouille examples](https://github.com/bclavie/RAGatouille/blob/main/examples/01-basic_indexing_and_search.ipynb) to learn more, |
|
|
just replace the pretrained model name and make sure you use Arabic text and split documents for best results. |
|
|
|
|
|
You can train a better model if you have access to adequate compute (can finetune this model on more data, seed 42 was used tp pick the 100K sample). |
|
|
|
|
|
# Training script |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
from ragatouille import RAGTrainer |
|
|
sample_size = 100000 |
|
|
ds = load_dataset('akhooli/arabic-triplets-1m-curated-sims-len', split="train", trust_remote_code=True, streaming=True) |
|
|
|
|
|
# some data processing not in this script (data filtered based on similarity scores) and 100K selected at random |
|
|
sds = ds.shuffle(seed=42, buffer_size=10_000) |
|
|
dsf = sds |
|
|
triplets = [] |
|
|
for item in iter(dsf): |
|
|
triplets.append((item["query"], item["positive"], item["negative"])) |
|
|
trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",) |
|
|
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False) |
|
|
|
|
|
trainer.train(batch_size=32, |
|
|
nbits=4, # How many bits will the trained model use when compressing indexes |
|
|
maxsteps=3125, # Maximum steps hard stop |
|
|
use_ib_negatives=True, # Use in-batch negative to calculate loss |
|
|
dim=128, # How many dimensions per embedding. 128 is the default and works well. |
|
|
learning_rate=1e-5, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot) |
|
|
doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well. |
|
|
use_relu=False, # Disable ReLU -- doesn't improve performance |
|
|
warmup_steps="auto", # Defaults to 10% |
|
|
) |
|
|
|
|
|
``` |
|
|
Install `datasets` and `ragatouille` first. Last checkpoint is saved in `.ragatouille/..../colbert` |
|
|
|
|
|
Model first announced (July 14, 2024): https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy |
|
|
Dataset published and model updated (July 27, 2024): https://www.linkedin.com/posts/akhooli_arabic-1-million-curated-triplets-dataset-activity-7222951839774699521-PZcw |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@online{AbedKhooli, |
|
|
author = 'Abed Khooli', |
|
|
title = 'Arabic ColBERT 100K', |
|
|
publisher = 'Hugging Face', |
|
|
month = 'jul', |
|
|
year = '2024', |
|
|
url = 'https://huggingface.co/akhooli/Arabic-ColBERT-100k', |
|
|
} |
|
|
``` |