hamtaai's picture
Update README.md
fd6307a verified
---
language:
- fa
- ar
- multilingual
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- multilingual
- persian
- arabic
- qa
- information-retrieval
- hadith
pipeline_tag: sentence-similarity
base_model:
- intfloat/multilingual-e5-large-instruct
---
# hamtaai/e5-large-instruct-hadith
This is a fine-tuned version of `intfloat/multilingual-e5-large-instruct` specifically optimized for Persian and Arabic text processing and question-answering tasks.
## Model Description
This model has been fine-tuned on a comprehensive dataset of Persian and Arabic religious texts, including:
- Persian and Arabic religious texts including Hadith collections
The model is particularly effective for:
- Semantic search in Persian and Arabic texts
- Question-answering tasks
- Information retrieval
- Cross-lingual understanding between Persian and Arabic
## Training Configuration
- **Base Model**: intfloat/multilingual-e5-large-instruct
- **Epochs**: 5
- **Batch Size**: 72
- **Learning Rate**: 2e-05
- **Warmup Steps Ratio**: 0.1
- **Evaluation Steps Ratio**: 0.5
## Usage
### Using Sentence-Transformers
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('hamtaai/e5-large-instruct-hadith')
# For instruct models, use proper prefixes
query = "query: سوال شما اینجا"
passage = "passage: متن پاسخ اینجا"
# Encode texts
query_embedding = model.encode(query)
passage_embedding = model.encode(passage)
# Calculate similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(query_embedding, passage_embedding)
```
### Using Hugging Face Transformers
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained('hamtaai/e5-large-instruct-hadith')
model = AutoModel.from_pretrained('hamtaai/e5-large-instruct-hadith')
# Tokenize and encode
inputs = tokenizer("متن شما", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
```
## Performance
This model has been optimized for Persian and Arabic text processing and shows improved performance on:
- Semantic similarity tasks
- Question-answering accuracy
- Cross-lingual retrieval
- Religious text understanding
## Training Data
The model was trained on a curated dataset of Persian and Arabic religious texts, including:
- Hadith collections
- Quranic commentaries (Tafsir)
- Religious question-answer pairs
- Contextual information for better understanding
## Limitations
- Primarily optimized for Persian and Arabic texts
- Performance may vary on other languages
- Best results achieved with proper text normalization
- Requires appropriate prefixes for instruct-based models
## Citation
If you use this model, please cite the original base model and mention this fine-tuned version:
```bibtex
@misc{hamtaai/e5_large_instruct_hadith,
title={hamtaai/e5-large-instruct-hadith: Fine-tuned Multilingual E5 Model for Persian and Arabic Text Processing},
author={Your Name},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/hamtaai/e5-large-instruct-hadith}}
}
```
## License
This model is released under the Apache 2.0 License.