Arabic Semantic Highlighter
A sentence-level semantic highlighting model for Arabic text, designed for RAG (Retrieval-Augmented Generation) systems.
Model Description
This model identifies and highlights sentences in Arabic text that are relevant to a given query. It was fine-tuned on the HeshamHaroon/arabic-semantic-relevance dataset using span annotations.
Model Details
- Base Model: BAAI/bge-reranker-base
- Task: Sentence-level semantic relevance classification
- Language: Arabic (العربية)
- Training Data: ~66,000 query-sentence pairs extracted from span annotations
Performance Metrics
| Metric | Score |
|---|---|
| Accuracy | 93.13% |
| F1 Score | 94.58% |
| Precision | 94.85% |
| Recall | 94.30% |
| AUC-ROC | 98.24% |
Usage
import torch
import numpy as np
import re
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class ArabicSemanticHighlighter:
def __init__(self, model_path):
self.model = AutoModelForSequenceClassification.from_pretrained(
model_path,
num_labels=1,
)
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
self.model.eval()
def _split_sentences(self, text, language="ar"):
if language == "ar":
sentences = re.split(r'[.؟!。\n]', text)
else:
sentences = re.split(r'[.?!\n]', text)
return [s.strip() for s in sentences if s.strip() and len(s.strip()) > 5]
def _score_sentence(self, question, sentence):
inputs = self.tokenizer(
question, sentence,
truncation=True,
max_length=256,
padding='max_length',
return_tensors='pt'
).to(self.device)
with torch.no_grad():
logit = self.model(**inputs).logits.squeeze().item()
return 1 / (1 + np.exp(-logit))
def process(self, question, context, threshold=0.5, language="auto", return_sentence_metrics=False):
"""
Highlight relevant sentences in context based on the question.
Args:
question: Query string
context: Text to search for relevant sentences
threshold: Minimum probability for relevance (default: 0.5)
language: "ar", "en", or "auto"
return_sentence_metrics: Include probability scores
Returns:
dict with highlighted_sentences, all_sentences, and optionally sentence_probabilities
"""
if language == "auto":
arabic_chars = len(re.findall(r'[\u0600-\u06FF]', context))
language = "ar" if arabic_chars > len(context) * 0.3 else "en"
sentences = self._split_sentences(context, language)
probabilities = []
highlighted = []
for sentence in sentences:
prob = self._score_sentence(question, sentence)
probabilities.append(prob)
if prob >= threshold:
highlighted.append(sentence)
result = {
"highlighted_sentences": highlighted,
"all_sentences": sentences,
}
if return_sentence_metrics:
result["sentence_probabilities"] = probabilities
return result
# Load model
highlighter = ArabicSemanticHighlighter("path/to/model")
# Example usage
question = "ما هي فوائد الذكاء الاصطناعي في التعليم؟"
context = """الذكاء الاصطناعي يحدث ثورة في قطاع التعليم.
يساعد الذكاء الاصطناعي المعلمين في تخصيص المحتوى التعليمي لكل طالب.
الطقس اليوم مشمس ودافئ."""
result = highlighter.process(
question=question,
context=context,
threshold=0.5,
return_sentence_metrics=True
)
print("Highlighted sentences:", result["highlighted_sentences"])
# Output: Relevant sentences about AI in education (excludes weather sentence)
Training Details
- Epochs: 3
- Batch Size: 8
- Learning Rate: 2e-5
- Max Sequence Length: 256
- Gradient Accumulation Steps: 4
- Optimizer: AdamW with weight decay 0.01
- Training Time: ~73 minutes on NVIDIA RTX 5060
Use Cases
- RAG Systems: Highlight relevant passages for LLM context
- Search Results: Show users which parts of documents match their query
- Document QA: Identify answer-containing sentences
- Content Filtering: Extract relevant information from long documents
Limitations
- Optimized for Arabic text; may work on other languages but not tested
- Best performance on sentences 10-200 characters in length
- Requires GPU for efficient inference on large documents
Citation
If you use this model, please cite:
@misc{arabic-semantic-highlighter,
author = {Hesham Haroon},
title = {Arabic Semantic Highlighter},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/HeshamHaroon/arabic-semantic-highlighter}}
}
License
This model is released under a Non-Commercial License. See LICENSE for details.
- Downloads last month
- 7
Model tree for HeshamHaroon/arabic-semantic-highlighter
Base model
BAAI/bge-reranker-baseDataset used to train HeshamHaroon/arabic-semantic-highlighter
Evaluation results
- accuracy on Arabic Semantic Relevanceself-reported0.931
- f1 on Arabic Semantic Relevanceself-reported0.946
- precision on Arabic Semantic Relevanceself-reported0.949
- recall on Arabic Semantic Relevanceself-reported0.943