--- language: - vi - en library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - embedding - math - vietnamese - multilingual - e5 base_model: intfloat/multilingual-e5-base --- # E5-Base-Math: Fine-tuned Vietnamese Math Embedding Model ## Model Description This is a fine-tuned version of `intfloat/multilingual-e5-base` optimized for Vietnamese mathematics content. The model is specifically trained for embedding mathematical concepts, definitions, and problem-solving content in Vietnamese. ## Training Details ### Base Model - **Base model**: `intfloat/multilingual-e5-base` - **Fine-tuning objective**: Information Retrieval / Sentence Embedding - **Training date**: 2025-06-24 ### Training Configuration - **Batch size**: 4 - **Learning rate**: 2e-05 - **Epochs**: 3 - **Max sequence length**: 256 - **Warmup steps**: 100 ### Training Data - **Domain**: Vietnamese Mathematics - **Training examples**: 2055 - **Validation examples**: 229 ## Usage ### Using SentenceTransformers ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('ThanhLe0125/e5-base-math') # Encode queries (add prefix for better performance) queries = ["query: Định nghĩa hàm số đồng biến là gì?"] query_embeddings = model.encode(queries) # Encode passages/documents passages = ["passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà với mọi x1 < x2 thì f(x1) < f(x2)"] passage_embeddings = model.encode(passages) # Calculate similarity from sklearn.metrics.pairwise import cosine_similarity similarity = cosine_similarity(query_embeddings, passage_embeddings) ``` ### For RAG Applications ```python # Recommended usage for RAG def encode_query(query_text): return model.encode([f"query: {query_text}"]) def encode_passage(passage_text): return model.encode([f"passage: {passage_text}"]) # Example usage query_emb = encode_query("Định nghĩa hàm số đồng biến") passage_emb = encode_passage("Hàm số đồng biến là...") # Calculate similarity similarity = cosine_similarity(query_emb, passage_emb)[0][0] print(f"Similarity: {similarity:.4f}") ``` ## Applications - **Information Retrieval**: Finding relevant mathematical content - **RAG Systems**: Retrieval-Augmented Generation for math Q&A - **Semantic Search**: Searching through mathematical documents - **Content Recommendation**: Suggesting related mathematical concepts ## Performance This model has been fine-tuned specifically for Vietnamese mathematical content and should perform better than the base model for math-related queries in Vietnamese. ## Languages - Vietnamese (primary) - English (inherited from base model) ## License This model inherits the license from the base model `intfloat/multilingual-e5-base`. ## Citation If you use this model, please cite: ```bibtex @misc{e5-base-math, author = {ThanhLe}, title = {E5-Base-Math: Fine-tuned Vietnamese Math Embedding Model}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/ThanhLe0125/e5-base-math}} } ``` ## Contact For questions or issues, please contact via the repository discussions.