--- tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - retrieval - rag - generated_from_trainer - dataset_size:4038 - loss:TripletLoss - loss:MultipleNegativesRankingLoss base_model: Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B pipeline_tag: sentence-similarity library_name: sentence-transformers license: apache-2.0 datasets: - Omartificial-Intelligence-Space/Saudi-Semantic-Chunks language: - ar --- # SA-Retrieval-Embeddings-0.2B ### Saudi Arabic Retrieval-Optimized Sentence Embeddings This model is a **retrieval-optimized SentenceTransformer**, fine-tuned from **Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B**, and specifically designed for: - **Semantic retrieval** - **RAG (Retrieval-Augmented Generation)** - **Paragraph-level semantic search** - **Chunk-based document retrieval** - **Saudi Arabic dialect understanding** Unlike general semantic similarity models, this model is **explicitly trained to rank the correct semantic chunk at the top**, even among closely related alternatives. --- ## 🔍 What makes this model different? Most Arabic embedding models are trained on **pairwise similarity only**. This model goes further by incorporating: - **Summary → Chunk retrieval supervision** - **Hard negatives from semantic chunk boundaries** - **Triplet-based discrimination** - **In-batch negatives via MNLR** As a result, it excels in **real-world retrieval scenarios**, not just sentence similarity. --- ## 🧠 Training Overview - **Base Model:** SA-STS-Embeddings-0.2B - **Training Objective:** - MultipleNegativesRankingLoss (primary) - TripletLoss with hard negatives (boundary-based) - **Embedding Dimension:** 768 - **Pooling Strategy:** Mean pooling - **Max Sequence Length:** 512 tokens - **Training Samples:** 4,038+ supervised retrieval examples - **Precision:** FP16 ### Training Data The model was trained using **Saudi Semantic Chunking data**, where: - Each document is split into **3–5 semantic chunks** - Each chunk has a **human-written summary** - Retrieval task: *summary → correct chunk* among other chunks from the same document Dataset: 👉 **Omartificial-Intelligence-Space/Saudi-Semantic-Chunks** --- ## 📊 Evaluation Results The model was evaluated on a **hard retrieval benchmark** consisting of **1,515 retrieval cases** across **24 Saudi domains**, using chunk-level negatives. ### 🏆 Leaderboard Comparison ![leaderboard_comparison-20251223T111700](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/h3J-dMAjtvdHj1_rVTJIM.png) ### Key Takeaways - **Best Top-1 Accuracy** → correct chunk ranked first ~88% of the time - **Best MRR** → correct chunk appears very early in ranking - **Excellent Recall@5 (99.2%)** → ideal for RAG pipelines - **Highest FinalScore** → best overall balance of retrieval + discourse awareness --- ## 📐 Metric Definitions - **Top-1:** Correct chunk ranked first - **MRR:** Mean Reciprocal Rank - **Recall@k:** Correct chunk appears in top-k - **nDCG:** Ranking quality with position discount - **Contrast:** (Intra-chunk similarity − Inter-chunk similarity) - **FinalScore:** 0.4 × Top-1 + 0.3 × MRR + 0.2 × Contrast + 0.1 × nDCG --- ## 🧪 Usage ### Install ```bash pip install -U sentence-transformers ``` ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer( "Omartificial-Intelligence-Space/SA-Retrieval-Embeddings-0.2B" ) sentences = [ "أفضل وقت لزيارة العلا في الشتاء", "العلا تكون أجمل في الشتاء والجو معتدل", "زحمة الرياض اليوم غير طبيعية" ] embeddings = model.encode(sentences, normalize_embeddings=True) from sklearn.metrics.pairwise import cosine_similarity query = "أفضل وقت لزيارة أبها" chunks = [ "أبها تتميز بأجواء معتدلة في الصيف.", "الرياض مدينة مزدحمة.", "مطاعم جدة متنوعة." ] q_emb = model.encode(query, normalize_embeddings=True) c_embs = model.encode(chunks, normalize_embeddings=True) scores = cosine_similarity([q_emb], c_embs)[0] for s, c in sorted(zip(scores, chunks), reverse=True): print(round(s, 3), c) ``` ### 🎯 Intended Use - RAG systems - Semantic search engines - Knowledge base retrieval - Document chunk retrieval - Saudi dialect applications - Government & enterprise search ### ⚠️ Limitations - Optimized for Saudi Arabic (dialect + MSA) - Not trained for cross-lingual retrieval - Not intended for generative tasks - Best performance when text is chunked semantically ```bibtext @misc{sa_retrieval_embeddings_2025, title = {SA-Retrieval-Embeddings-0.2B: Retrieval-Optimized Saudi Arabic Sentence Embeddings}, author = {Omer Nacar}, year = {2025}, publisher = {HuggingFace} } ```