File size: 3,326 Bytes
affc618 fd6307a 855c5aa affc618 855c5aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
language:
- fa
- ar
- multilingual
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- multilingual
- persian
- arabic
- qa
- information-retrieval
- hadith
pipeline_tag: sentence-similarity
base_model:
- intfloat/multilingual-e5-large-instruct
---
# hamtaai/e5-large-instruct-hadith
This is a fine-tuned version of `intfloat/multilingual-e5-large-instruct` specifically optimized for Persian and Arabic text processing and question-answering tasks.
## Model Description
This model has been fine-tuned on a comprehensive dataset of Persian and Arabic religious texts, including:
- Persian and Arabic religious texts including Hadith collections
The model is particularly effective for:
- Semantic search in Persian and Arabic texts
- Question-answering tasks
- Information retrieval
- Cross-lingual understanding between Persian and Arabic
## Training Configuration
- **Base Model**: intfloat/multilingual-e5-large-instruct
- **Epochs**: 5
- **Batch Size**: 72
- **Learning Rate**: 2e-05
- **Warmup Steps Ratio**: 0.1
- **Evaluation Steps Ratio**: 0.5
## Usage
### Using Sentence-Transformers
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('hamtaai/e5-large-instruct-hadith')
# For instruct models, use proper prefixes
query = "query: سوال شما اینجا"
passage = "passage: متن پاسخ اینجا"
# Encode texts
query_embedding = model.encode(query)
passage_embedding = model.encode(passage)
# Calculate similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(query_embedding, passage_embedding)
```
### Using Hugging Face Transformers
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained('hamtaai/e5-large-instruct-hadith')
model = AutoModel.from_pretrained('hamtaai/e5-large-instruct-hadith')
# Tokenize and encode
inputs = tokenizer("متن شما", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
```
## Performance
This model has been optimized for Persian and Arabic text processing and shows improved performance on:
- Semantic similarity tasks
- Question-answering accuracy
- Cross-lingual retrieval
- Religious text understanding
## Training Data
The model was trained on a curated dataset of Persian and Arabic religious texts, including:
- Hadith collections
- Quranic commentaries (Tafsir)
- Religious question-answer pairs
- Contextual information for better understanding
## Limitations
- Primarily optimized for Persian and Arabic texts
- Performance may vary on other languages
- Best results achieved with proper text normalization
- Requires appropriate prefixes for instruct-based models
## Citation
If you use this model, please cite the original base model and mention this fine-tuned version:
```bibtex
@misc{hamtaai/e5_large_instruct_hadith,
title={hamtaai/e5-large-instruct-hadith: Fine-tuned Multilingual E5 Model for Persian and Arabic Text Processing},
author={Your Name},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/hamtaai/e5-large-instruct-hadith}}
}
```
## License
This model is released under the Apache 2.0 License. |