mohamed2811 commited on
Commit
5fb7f46
ยท
verified ยท
1 Parent(s): a8fae2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -1
README.md CHANGED
@@ -6,4 +6,90 @@ base_model:
6
  tags:
7
  - Sentence Similarity
8
  - sentence-transformers
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  tags:
7
  - Sentence Similarity
8
  - sentence-transformers
9
+ ---
10
+
11
+
12
+ # ๐Ÿง  Muffakir: Fine-tuned Arabic Model for RAG & Dense Retrieval
13
+
14
+ [Muffakir](https://huggingface.co/mohamed2811/Muffakir_Embedding_V2) is a **state-of-the-art Arabic bi-encoder embedding model** fine-tuned from [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3).
15
+ It is optimized for use in **retrieval-augmented generation (RAG)** and dense passage retrieval pipelines. ๐Ÿš€
16
+
17
+ ---
18
+
19
+ ## ๐Ÿ” Model Overview
20
+
21
+ * ๐Ÿงฌ **Base model**: [`sayed0am/arabic-english-bge-m3`](https://huggingface.co/sayed0am/arabic-english-bge-m3)
22
+ * ๐Ÿ“š **Fine-tuning dataset**: \~70,000 Arabic sentence pairs from various topics
23
+
24
+ * ๐Ÿซ **20K** curated from Egyptian legal books
25
+ * ๐ŸŒ **50K** collected from Hugging Face datasets (multi-domain)
26
+ * ๐Ÿ‹๏ธ **Training epochs**: 3
27
+ * ๐Ÿ“ **Embedding dimension**: 1024
28
+ * ๐Ÿ”— **Loss functions**:
29
+
30
+ * [`MultipleNegativesRankingLoss`](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss)
31
+ * [`MatryoshkaLoss`](https://huggingface.co/blog/matryoshka-representations) for multi-resolution embeddings
32
+
33
+ ---
34
+
35
+ ## ๐ŸŒŸ Key Features
36
+
37
+ * ๐Ÿฅ‡ **SOTA performance** in **Arabic RAG** and dense retrieval tasks
38
+ * ๐ŸŽฏ **Multi-resolution embeddings** via Matryoshka (dims: `1024 โ†’ 64`)
39
+ * ๐ŸŒ Supports **cross-lingual (Arabic-English)** encoding
40
+ * ๐Ÿ“ฆ Ready for use in real-world search, Q\&A, and AI agent systems
41
+
42
+ ---
43
+
44
+ ## โš™๏ธ Training Details
45
+
46
+ * ๐Ÿงพ **Dataset size**: 70K examples
47
+ * ๐Ÿ—‚๏ธ **Topics**: Multi-domain (educational, legal, general knowledge, etc.)
48
+ * ๐Ÿ” **Epochs**: 3
49
+ * ๐Ÿงช **Batch size**: 8 (gradient accumulation enabled)
50
+ * ๐Ÿš€ **Learning rate**: 2e-5
51
+ * ๐Ÿงฐ **Framework**: [sentence-transformers](https://www.sbert.net)
52
+
53
+ ---
54
+
55
+ ## ๐Ÿ“€ Model Specs
56
+
57
+ * ๐Ÿ”ข Embedding size: `1024`
58
+ * ๐Ÿ”„ Supports Matryoshka-style dimension truncation
59
+ * ๐Ÿง  Bi-encoder setup, ideal for fast and scalable retrieval tasks
60
+
61
+ ---
62
+
63
+ ## ๐Ÿงช Example Usage
64
+
65
+ ```python
66
+ from sentence_transformers import SentenceTransformer
67
+ import torch
68
+
69
+ # Load the fine-tuned Muffakir model
70
+ model = SentenceTransformer("mohamed2811/Muffakir_Embedding_V2")
71
+
72
+ # Example query and candidate passages
73
+ query = "ู…ุง ู‡ูŠ ุดุฑูˆุท ุตุญุฉ ุงู„ุนู‚ุฏุŸ"
74
+ passages = [
75
+ "ูŠุดุชุฑุท ุงู„ุชุฑุงุถูŠ ู„ุตุญุฉ ุงู„ุนู‚ุฏ.",
76
+ "ูŠู†ู‚ุณู… ุงู„ู‚ุงู†ูˆู† ุฅู„ู‰ ุนุงู… ูˆุฎุงุต.",
77
+ "ุงู„ุนู‚ุฏ ุดุฑูŠุนุฉ ุงู„ู…ุชุนุงู‚ุฏูŠู†.",
78
+ "ุชู†ุชู‡ูŠ ุงู„ูˆู„ุงูŠุฉ ุงู„ู‚ุงู†ูˆู†ูŠุฉ ุจุจู„ูˆุบ ุณู† ุงู„ุฑุดุฏ."
79
+ ]
80
+
81
+ # Encode query and passages
82
+ embedding_query = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
83
+ embedding_passages = model.encode(passages, convert_to_tensor=True, normalize_embeddings=True)
84
+
85
+ # Compute cosine similarities
86
+ cosine_scores = torch.matmul(embedding_query, embedding_passages.T)
87
+
88
+ # Get best matching passage
89
+ best_idx = cosine_scores.argmax().item()
90
+ best_passage = passages[best_idx]
91
+
92
+ print(f"๐Ÿ” Best matching passage: {best_passage}")
93
+ ```
94
+
95
+ ---