--- language: - fa - ar - multilingual license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - feature-extraction - multilingual - persian - arabic - qa - information-retrieval - hadith pipeline_tag: sentence-similarity base_model: - intfloat/multilingual-e5-large-instruct --- # hamtaai/e5-large-instruct-hadith This is a fine-tuned version of `intfloat/multilingual-e5-large-instruct` specifically optimized for Persian and Arabic text processing and question-answering tasks. ## Model Description This model has been fine-tuned on a comprehensive dataset of Persian and Arabic religious texts, including: - Persian and Arabic religious texts including Hadith collections The model is particularly effective for: - Semantic search in Persian and Arabic texts - Question-answering tasks - Information retrieval - Cross-lingual understanding between Persian and Arabic ## Training Configuration - **Base Model**: intfloat/multilingual-e5-large-instruct - **Epochs**: 5 - **Batch Size**: 72 - **Learning Rate**: 2e-05 - **Warmup Steps Ratio**: 0.1 - **Evaluation Steps Ratio**: 0.5 ## Usage ### Using Sentence-Transformers ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer('hamtaai/e5-large-instruct-hadith') # For instruct models, use proper prefixes query = "query: سوال شما اینجا" passage = "passage: متن پاسخ اینجا" # Encode texts query_embedding = model.encode(query) passage_embedding = model.encode(passage) # Calculate similarity from sentence_transformers.util import cos_sim similarity = cos_sim(query_embedding, passage_embedding) ``` ### Using Hugging Face Transformers ```python from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained('hamtaai/e5-large-instruct-hadith') model = AutoModel.from_pretrained('hamtaai/e5-large-instruct-hadith') # Tokenize and encode inputs = tokenizer("متن شما", return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state.mean(dim=1) ``` ## Performance This model has been optimized for Persian and Arabic text processing and shows improved performance on: - Semantic similarity tasks - Question-answering accuracy - Cross-lingual retrieval - Religious text understanding ## Training Data The model was trained on a curated dataset of Persian and Arabic religious texts, including: - Hadith collections - Quranic commentaries (Tafsir) - Religious question-answer pairs - Contextual information for better understanding ## Limitations - Primarily optimized for Persian and Arabic texts - Performance may vary on other languages - Best results achieved with proper text normalization - Requires appropriate prefixes for instruct-based models ## Citation If you use this model, please cite the original base model and mention this fine-tuned version: ```bibtex @misc{hamtaai/e5_large_instruct_hadith, title={hamtaai/e5-large-instruct-hadith: Fine-tuned Multilingual E5 Model for Persian and Arabic Text Processing}, author={Your Name}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/hamtaai/e5-large-instruct-hadith}} } ``` ## License This model is released under the Apache 2.0 License.