| | --- |
| | language: |
| | - vi |
| | - en |
| | library_name: sentence-transformers |
| | pipeline_tag: sentence-similarity |
| | tags: |
| | - sentence-transformers |
| | - embedding |
| | - math |
| | - vietnamese |
| | - multilingual |
| | - e5 |
| | base_model: intfloat/multilingual-e5-base |
| | --- |
| | |
| | # E5-Base-Math: Fine-tuned Vietnamese Math Embedding Model |
| |
|
| | ## Model Description |
| |
|
| | This is a fine-tuned version of `intfloat/multilingual-e5-base` optimized for Vietnamese mathematics content. The model is specifically trained for embedding mathematical concepts, definitions, and problem-solving content in Vietnamese. |
| |
|
| | ## Training Details |
| |
|
| | ### Base Model |
| | - **Base model**: `intfloat/multilingual-e5-base` |
| | - **Fine-tuning objective**: Information Retrieval / Sentence Embedding |
| | - **Training date**: 2025-06-24 |
| |
|
| | ### Training Configuration |
| | - **Batch size**: 4 |
| | - **Learning rate**: 2e-05 |
| | - **Epochs**: 3 |
| | - **Max sequence length**: 256 |
| | - **Warmup steps**: 100 |
| |
|
| | ### Training Data |
| | - **Domain**: Vietnamese Mathematics |
| | - **Training examples**: 2055 |
| | - **Validation examples**: 229 |
| |
|
| | ## Usage |
| |
|
| | ### Using SentenceTransformers |
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | model = SentenceTransformer('ThanhLe0125/e5-base-math') |
| | |
| | # Encode queries (add prefix for better performance) |
| | queries = ["query: Định nghĩa hàm số đồng biến là gì?"] |
| | query_embeddings = model.encode(queries) |
| | |
| | # Encode passages/documents |
| | passages = ["passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà với mọi x1 < x2 thì f(x1) < f(x2)"] |
| | passage_embeddings = model.encode(passages) |
| | |
| | # Calculate similarity |
| | from sklearn.metrics.pairwise import cosine_similarity |
| | similarity = cosine_similarity(query_embeddings, passage_embeddings) |
| | ``` |
| |
|
| | ### For RAG Applications |
| | ```python |
| | # Recommended usage for RAG |
| | def encode_query(query_text): |
| | return model.encode([f"query: {query_text}"]) |
| | |
| | def encode_passage(passage_text): |
| | return model.encode([f"passage: {passage_text}"]) |
| | |
| | # Example usage |
| | query_emb = encode_query("Định nghĩa hàm số đồng biến") |
| | passage_emb = encode_passage("Hàm số đồng biến là...") |
| | |
| | # Calculate similarity |
| | similarity = cosine_similarity(query_emb, passage_emb)[0][0] |
| | print(f"Similarity: {similarity:.4f}") |
| | ``` |
| |
|
| | ## Applications |
| | - **Information Retrieval**: Finding relevant mathematical content |
| | - **RAG Systems**: Retrieval-Augmented Generation for math Q&A |
| | - **Semantic Search**: Searching through mathematical documents |
| | - **Content Recommendation**: Suggesting related mathematical concepts |
| |
|
| | ## Performance |
| | This model has been fine-tuned specifically for Vietnamese mathematical content and should perform better than the base model for math-related queries in Vietnamese. |
| |
|
| | ## Languages |
| | - Vietnamese (primary) |
| | - English (inherited from base model) |
| |
|
| | ## License |
| | This model inherits the license from the base model `intfloat/multilingual-e5-base`. |
| |
|
| | ## Citation |
| | If you use this model, please cite: |
| | ```bibtex |
| | @misc{e5-base-math, |
| | author = {ThanhLe}, |
| | title = {E5-Base-Math: Fine-tuned Vietnamese Math Embedding Model}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/ThanhLe0125/e5-base-math}} |
| | } |
| | ``` |
| |
|
| | ## Contact |
| | For questions or issues, please contact via the repository discussions. |
| |
|