|
|
--- |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- dense |
|
|
- retrieval |
|
|
- rag |
|
|
- generated_from_trainer |
|
|
- dataset_size:4038 |
|
|
- loss:TripletLoss |
|
|
- loss:MultipleNegativesRankingLoss |
|
|
base_model: Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B |
|
|
pipeline_tag: sentence-similarity |
|
|
library_name: sentence-transformers |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- Omartificial-Intelligence-Space/Saudi-Semantic-Chunks |
|
|
language: |
|
|
- ar |
|
|
--- |
|
|
|
|
|
# SA-Retrieval-Embeddings-0.2B |
|
|
### Saudi Arabic Retrieval-Optimized Sentence Embeddings |
|
|
|
|
|
|
|
|
|
|
|
This model is a **retrieval-optimized SentenceTransformer**, fine-tuned from **Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B**, and specifically designed for: |
|
|
|
|
|
- **Semantic retrieval** |
|
|
- **RAG (Retrieval-Augmented Generation)** |
|
|
- **Paragraph-level semantic search** |
|
|
- **Chunk-based document retrieval** |
|
|
- **Saudi Arabic dialect understanding** |
|
|
|
|
|
Unlike general semantic similarity models, this model is **explicitly trained to rank the correct semantic chunk at the top**, even among closely related alternatives. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔍 What makes this model different? |
|
|
|
|
|
Most Arabic embedding models are trained on **pairwise similarity only**. |
|
|
This model goes further by incorporating: |
|
|
|
|
|
- **Summary → Chunk retrieval supervision** |
|
|
- **Hard negatives from semantic chunk boundaries** |
|
|
- **Triplet-based discrimination** |
|
|
- **In-batch negatives via MNLR** |
|
|
|
|
|
As a result, it excels in **real-world retrieval scenarios**, not just sentence similarity. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Training Overview |
|
|
|
|
|
- **Base Model:** SA-STS-Embeddings-0.2B |
|
|
- **Training Objective:** |
|
|
- MultipleNegativesRankingLoss (primary) |
|
|
- TripletLoss with hard negatives (boundary-based) |
|
|
- **Embedding Dimension:** 768 |
|
|
- **Pooling Strategy:** Mean pooling |
|
|
- **Max Sequence Length:** 512 tokens |
|
|
- **Training Samples:** 4,038+ supervised retrieval examples |
|
|
- **Precision:** FP16 |
|
|
|
|
|
### Training Data |
|
|
The model was trained using **Saudi Semantic Chunking data**, where: |
|
|
- Each document is split into **3–5 semantic chunks** |
|
|
- Each chunk has a **human-written summary** |
|
|
- Retrieval task: |
|
|
*summary → correct chunk* among other chunks from the same document |
|
|
|
|
|
Dataset: |
|
|
👉 **Omartificial-Intelligence-Space/Saudi-Semantic-Chunks** |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Evaluation Results |
|
|
|
|
|
The model was evaluated on a **hard retrieval benchmark** consisting of |
|
|
**1,515 retrieval cases** across **24 Saudi domains**, using chunk-level negatives. |
|
|
|
|
|
### 🏆 Leaderboard Comparison |
|
|
|
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
|
|
|
### Key Takeaways |
|
|
- **Best Top-1 Accuracy** → correct chunk ranked first ~88% of the time |
|
|
- **Best MRR** → correct chunk appears very early in ranking |
|
|
- **Excellent Recall@5 (99.2%)** → ideal for RAG pipelines |
|
|
- **Highest FinalScore** → best overall balance of retrieval + discourse awareness |
|
|
|
|
|
--- |
|
|
|
|
|
## 📐 Metric Definitions |
|
|
|
|
|
- **Top-1:** Correct chunk ranked first |
|
|
- **MRR:** Mean Reciprocal Rank |
|
|
- **Recall@k:** Correct chunk appears in top-k |
|
|
- **nDCG:** Ranking quality with position discount |
|
|
- **Contrast:** (Intra-chunk similarity − Inter-chunk similarity) |
|
|
- **FinalScore:** 0.4 × Top-1 + 0.3 × MRR + 0.2 × Contrast + 0.1 × nDCG |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Usage |
|
|
|
|
|
### Install |
|
|
```bash |
|
|
pip install -U sentence-transformers |
|
|
``` |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer( |
|
|
"Omartificial-Intelligence-Space/SA-Retrieval-Embeddings-0.2B" |
|
|
) |
|
|
|
|
|
sentences = [ |
|
|
"أفضل وقت لزيارة العلا في الشتاء", |
|
|
"العلا تكون أجمل في الشتاء والجو معتدل", |
|
|
"زحمة الرياض اليوم غير طبيعية" |
|
|
] |
|
|
|
|
|
embeddings = model.encode(sentences, normalize_embeddings=True) |
|
|
|
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
|
|
query = "أفضل وقت لزيارة أبها" |
|
|
chunks = [ |
|
|
"أبها تتميز بأجواء معتدلة في الصيف.", |
|
|
"الرياض مدينة مزدحمة.", |
|
|
"مطاعم جدة متنوعة." |
|
|
] |
|
|
|
|
|
q_emb = model.encode(query, normalize_embeddings=True) |
|
|
c_embs = model.encode(chunks, normalize_embeddings=True) |
|
|
|
|
|
scores = cosine_similarity([q_emb], c_embs)[0] |
|
|
for s, c in sorted(zip(scores, chunks), reverse=True): |
|
|
print(round(s, 3), c) |
|
|
``` |
|
|
|
|
|
### 🎯 Intended Use |
|
|
|
|
|
- RAG systems |
|
|
- Semantic search engines |
|
|
- Knowledge base retrieval |
|
|
- Document chunk retrieval |
|
|
- Saudi dialect applications |
|
|
- Government & enterprise search |
|
|
|
|
|
|
|
|
### ⚠️ Limitations |
|
|
|
|
|
- Optimized for Saudi Arabic (dialect + MSA) |
|
|
- Not trained for cross-lingual retrieval |
|
|
- Not intended for generative tasks |
|
|
- Best performance when text is chunked semantically |
|
|
|
|
|
```bibtext |
|
|
@misc{sa_retrieval_embeddings_2025, |
|
|
title = {SA-Retrieval-Embeddings-0.2B: Retrieval-Optimized Saudi Arabic Sentence Embeddings}, |
|
|
author = {Omer Nacar}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace} |
|
|
} |
|
|
``` |