akhooli
/

Arabic-ColBERT-100K

Sentence Similarity

Model card Files Files and versions

akhooli commited on Jul 16, 2024

Commit

4a2ec2d

·

verified ·

1 Parent(s): f7aa2a2

Update README.md

Files changed (1) hide show

README.md +26 -0

README.md CHANGED Viewed

@@ -27,5 +27,31 @@ just replace the pretrained model name and make sure you use Arabic text and spl
 You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).
 Model first announced: https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy

 You can train a better model if you have access to adequate compute (can fine tune this model on more data, seed 42 was used tp pick the 100K sample).
+# Training script
+```
+from datasets import load_dataset
+from ragatouille import RAGTrainer
+sample_size = 100000
+ds = load_dataset('unicamp-dl/mmarco', 'arabic', split="train", trust_remote_code=True, streaming=True)
+sds = ds.shuffle(seed=42, buffer_size=10_000)
+dsf = sds.take(sample_size)
+triplets = []
+for item in iter(dsf):
+  triplets.append((item["query"], item["positive"], item["negative"]))
+trainer = RAGTrainer(model_name="Arabic-ColBERT-100k", pretrained_model_name="aubmindlab/bert-base-arabertv02", language_code="ar",)
+trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)
+trainer.train(batch_size=32,
+    nbits=2, # How many bits will the trained model use when compressing indexes
+    maxsteps=100000, # Maximum steps hard stop
+    use_ib_negatives=True, # Use in-batch negative to calculate loss
+    dim=128, # How many dimensions per embedding. 128 is the default and works well.
+    learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
+    doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
+    use_relu=False, # Disable ReLU -- doesn't improve performance
+    warmup_steps="auto", # Defaults to 10%
+    )
+```
 Model first announced: https://www.linkedin.com/posts/akhooli_this-is-probably-the-first-arabic-colbert-activity-7217969205197848576-l8Cy