README.md · hamtaai/e5-large-instruct-hadith at main

File size: 3,326 Bytes

---
language:
- fa
- ar
- multilingual
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- multilingual
- persian
- arabic
- qa
- information-retrieval
- hadith
pipeline_tag: sentence-similarity
base_model:
- intfloat/multilingual-e5-large-instruct
---

# hamtaai/e5-large-instruct-hadith

This is a fine-tuned version of `intfloat/multilingual-e5-large-instruct` specifically optimized for Persian and Arabic text processing and question-answering tasks.

## Model Description

This model has been fine-tuned on a comprehensive dataset of Persian and Arabic religious texts, including:
- Persian and Arabic religious texts including Hadith collections

The model is particularly effective for:
- Semantic search in Persian and Arabic texts
- Question-answering tasks
- Information retrieval
- Cross-lingual understanding between Persian and Arabic

## Training Configuration

- **Base Model**: intfloat/multilingual-e5-large-instruct
- **Epochs**: 5
- **Batch Size**: 72
- **Learning Rate**: 2e-05
- **Warmup Steps Ratio**: 0.1
- **Evaluation Steps Ratio**: 0.5

## Usage

### Using Sentence-Transformers

```python
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('hamtaai/e5-large-instruct-hadith')

# For instruct models, use proper prefixes
query = "query: سوال شما اینجا"
passage = "passage: متن پاسخ اینجا"

# Encode texts
query_embedding = model.encode(query)
passage_embedding = model.encode(passage)

# Calculate similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(query_embedding, passage_embedding)
```

### Using Hugging Face Transformers

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('hamtaai/e5-large-instruct-hadith')
model = AutoModel.from_pretrained('hamtaai/e5-large-instruct-hadith')

# Tokenize and encode
inputs = tokenizer("متن شما", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
```

## Performance

This model has been optimized for Persian and Arabic text processing and shows improved performance on:
- Semantic similarity tasks
- Question-answering accuracy
- Cross-lingual retrieval
- Religious text understanding

## Training Data

The model was trained on a curated dataset of Persian and Arabic religious texts, including:
- Hadith collections
- Quranic commentaries (Tafsir)
- Religious question-answer pairs
- Contextual information for better understanding

## Limitations

- Primarily optimized for Persian and Arabic texts
- Performance may vary on other languages
- Best results achieved with proper text normalization
- Requires appropriate prefixes for instruct-based models

## Citation

If you use this model, please cite the original base model and mention this fine-tuned version:

```bibtex
@misc{hamtaai/e5_large_instruct_hadith,
  title={hamtaai/e5-large-instruct-hadith: Fine-tuned Multilingual E5 Model for Persian and Arabic Text Processing},
  author={Your Name},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/hamtaai/e5-large-instruct-hadith}}
}
```

## License

This model is released under the Apache 2.0 License.