|
|
--- |
|
|
language: |
|
|
- fa |
|
|
- ar |
|
|
- multilingual |
|
|
license: apache-2.0 |
|
|
library_name: sentence-transformers |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- multilingual |
|
|
- persian |
|
|
- arabic |
|
|
- qa |
|
|
- information-retrieval |
|
|
- hadith |
|
|
pipeline_tag: sentence-similarity |
|
|
base_model: |
|
|
- intfloat/multilingual-e5-large-instruct |
|
|
--- |
|
|
|
|
|
# hamtaai/e5-large-instruct-hadith |
|
|
|
|
|
This is a fine-tuned version of `intfloat/multilingual-e5-large-instruct` specifically optimized for Persian and Arabic text processing and question-answering tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model has been fine-tuned on a comprehensive dataset of Persian and Arabic religious texts, including: |
|
|
- Persian and Arabic religious texts including Hadith collections |
|
|
|
|
|
The model is particularly effective for: |
|
|
- Semantic search in Persian and Arabic texts |
|
|
- Question-answering tasks |
|
|
- Information retrieval |
|
|
- Cross-lingual understanding between Persian and Arabic |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
- **Base Model**: intfloat/multilingual-e5-large-instruct |
|
|
- **Epochs**: 5 |
|
|
- **Batch Size**: 72 |
|
|
- **Learning Rate**: 2e-05 |
|
|
- **Warmup Steps Ratio**: 0.1 |
|
|
- **Evaluation Steps Ratio**: 0.5 |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Sentence-Transformers |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Load the model |
|
|
model = SentenceTransformer('hamtaai/e5-large-instruct-hadith') |
|
|
|
|
|
# For instruct models, use proper prefixes |
|
|
query = "query: سوال شما اینجا" |
|
|
passage = "passage: متن پاسخ اینجا" |
|
|
|
|
|
# Encode texts |
|
|
query_embedding = model.encode(query) |
|
|
passage_embedding = model.encode(passage) |
|
|
|
|
|
# Calculate similarity |
|
|
from sentence_transformers.util import cos_sim |
|
|
similarity = cos_sim(query_embedding, passage_embedding) |
|
|
``` |
|
|
|
|
|
### Using Hugging Face Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained('hamtaai/e5-large-instruct-hadith') |
|
|
model = AutoModel.from_pretrained('hamtaai/e5-large-instruct-hadith') |
|
|
|
|
|
# Tokenize and encode |
|
|
inputs = tokenizer("متن شما", return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state.mean(dim=1) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
This model has been optimized for Persian and Arabic text processing and shows improved performance on: |
|
|
- Semantic similarity tasks |
|
|
- Question-answering accuracy |
|
|
- Cross-lingual retrieval |
|
|
- Religious text understanding |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a curated dataset of Persian and Arabic religious texts, including: |
|
|
- Hadith collections |
|
|
- Quranic commentaries (Tafsir) |
|
|
- Religious question-answer pairs |
|
|
- Contextual information for better understanding |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Primarily optimized for Persian and Arabic texts |
|
|
- Performance may vary on other languages |
|
|
- Best results achieved with proper text normalization |
|
|
- Requires appropriate prefixes for instruct-based models |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original base model and mention this fine-tuned version: |
|
|
|
|
|
```bibtex |
|
|
@misc{hamtaai/e5_large_instruct_hadith, |
|
|
title={hamtaai/e5-large-instruct-hadith: Fine-tuned Multilingual E5 Model for Persian and Arabic Text Processing}, |
|
|
author={Your Name}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/hamtaai/e5-large-instruct-hadith}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License. |