Arabic Semantic Highlighter

A sentence-level semantic highlighting model for Arabic text, designed for RAG (Retrieval-Augmented Generation) systems.

Model Description

This model identifies and highlights sentences in Arabic text that are relevant to a given query. It was fine-tuned on the HeshamHaroon/arabic-semantic-relevance dataset using span annotations.

Model Details

Base Model: BAAI/bge-reranker-base
Task: Sentence-level semantic relevance classification
Language: Arabic (العربية)
Training Data: ~66,000 query-sentence pairs extracted from span annotations

Performance Metrics

Metric	Score
Accuracy	93.13%
F1 Score	94.58%
Precision	94.85%
Recall	94.30%
AUC-ROC	98.24%

Usage

import torch
import numpy as np
import re
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class ArabicSemanticHighlighter:
    def __init__(self, model_path):
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_path,
            num_labels=1,
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.model.eval()

    def _split_sentences(self, text, language="ar"):
        if language == "ar":
            sentences = re.split(r'[.؟!。\n]', text)
        else:
            sentences = re.split(r'[.?!\n]', text)
        return [s.strip() for s in sentences if s.strip() and len(s.strip()) > 5]

    def _score_sentence(self, question, sentence):
        inputs = self.tokenizer(
            question, sentence,
            truncation=True,
            max_length=256,
            padding='max_length',
            return_tensors='pt'
        ).to(self.device)

        with torch.no_grad():
            logit = self.model(**inputs).logits.squeeze().item()
            return 1 / (1 + np.exp(-logit))

    def process(self, question, context, threshold=0.5, language="auto", return_sentence_metrics=False):
        """
        Highlight relevant sentences in context based on the question.

        Args:
            question: Query string
            context: Text to search for relevant sentences
            threshold: Minimum probability for relevance (default: 0.5)
            language: "ar", "en", or "auto"
            return_sentence_metrics: Include probability scores

        Returns:
            dict with highlighted_sentences, all_sentences, and optionally sentence_probabilities
        """
        if language == "auto":
            arabic_chars = len(re.findall(r'[\u0600-\u06FF]', context))
            language = "ar" if arabic_chars > len(context) * 0.3 else "en"

        sentences = self._split_sentences(context, language)
        probabilities = []
        highlighted = []

        for sentence in sentences:
            prob = self._score_sentence(question, sentence)
            probabilities.append(prob)
            if prob >= threshold:
                highlighted.append(sentence)

        result = {
            "highlighted_sentences": highlighted,
            "all_sentences": sentences,
        }

        if return_sentence_metrics:
            result["sentence_probabilities"] = probabilities

        return result

# Load model
highlighter = ArabicSemanticHighlighter("path/to/model")

# Example usage
question = "ما هي فوائد الذكاء الاصطناعي في التعليم؟"
context = """الذكاء الاصطناعي يحدث ثورة في قطاع التعليم.
يساعد الذكاء الاصطناعي المعلمين في تخصيص المحتوى التعليمي لكل طالب.
الطقس اليوم مشمس ودافئ."""

result = highlighter.process(
    question=question,
    context=context,
    threshold=0.5,
    return_sentence_metrics=True
)

print("Highlighted sentences:", result["highlighted_sentences"])
# Output: Relevant sentences about AI in education (excludes weather sentence)

Training Details

Epochs: 3
Batch Size: 8
Learning Rate: 2e-5
Max Sequence Length: 256
Gradient Accumulation Steps: 4
Optimizer: AdamW with weight decay 0.01
Training Time: ~73 minutes on NVIDIA RTX 5060

Use Cases

RAG Systems: Highlight relevant passages for LLM context
Search Results: Show users which parts of documents match their query
Document QA: Identify answer-containing sentences
Content Filtering: Extract relevant information from long documents

Limitations

Optimized for Arabic text; may work on other languages but not tested
Best performance on sentences 10-200 characters in length
Requires GPU for efficient inference on large documents

Citation

If you use this model, please cite:

@misc{arabic-semantic-highlighter,
  author = {Hesham Haroon},
  title = {Arabic Semantic Highlighter},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/HeshamHaroon/arabic-semantic-highlighter}}
}

License

This model is released under a Non-Commercial License. See LICENSE for details.

Downloads last month: 10

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for HeshamHaroon/arabic-semantic-highlighter

Base model

BAAI/bge-reranker-base

Finetuned

(21)

this model

Dataset used to train HeshamHaroon/arabic-semantic-highlighter

Evaluation results

accuracy on Arabic Semantic Relevance
self-reported

0.931
f1 on Arabic Semantic Relevance
self-reported

0.946
precision on Arabic Semantic Relevance
self-reported

0.949
recall on Arabic Semantic Relevance
self-reported

0.943