File size: 3,208 Bytes
5b8a6f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
language: 
- vi
- en
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- embedding
- math
- vietnamese
- multilingual
- e5
base_model: intfloat/multilingual-e5-base
---

# E5-Base-Math: Fine-tuned Vietnamese Math Embedding Model

## Model Description

This is a fine-tuned version of `intfloat/multilingual-e5-base` optimized for Vietnamese mathematics content. The model is specifically trained for embedding mathematical concepts, definitions, and problem-solving content in Vietnamese.

## Training Details

### Base Model
- **Base model**: `intfloat/multilingual-e5-base`
- **Fine-tuning objective**: Information Retrieval / Sentence Embedding
- **Training date**: 2025-06-24

### Training Configuration
- **Batch size**: 4
- **Learning rate**: 2e-05
- **Epochs**: 3
- **Max sequence length**: 256
- **Warmup steps**: 100

### Training Data
- **Domain**: Vietnamese Mathematics
- **Training examples**: 2055
- **Validation examples**: 229

## Usage

### Using SentenceTransformers
```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('ThanhLe0125/e5-base-math')

# Encode queries (add prefix for better performance)
queries = ["query: Định nghĩa hàm số đồng biến là gì?"]
query_embeddings = model.encode(queries)

# Encode passages/documents  
passages = ["passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà với mọi x1 < x2 thì f(x1) < f(x2)"]
passage_embeddings = model.encode(passages)

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(query_embeddings, passage_embeddings)
```

### For RAG Applications
```python
# Recommended usage for RAG
def encode_query(query_text):
    return model.encode([f"query: {query_text}"])

def encode_passage(passage_text):  
    return model.encode([f"passage: {passage_text}"])

# Example usage
query_emb = encode_query("Định nghĩa hàm số đồng biến")
passage_emb = encode_passage("Hàm số đồng biến là...")

# Calculate similarity
similarity = cosine_similarity(query_emb, passage_emb)[0][0]
print(f"Similarity: {similarity:.4f}")
```

## Applications
- **Information Retrieval**: Finding relevant mathematical content
- **RAG Systems**: Retrieval-Augmented Generation for math Q&A
- **Semantic Search**: Searching through mathematical documents
- **Content Recommendation**: Suggesting related mathematical concepts

## Performance
This model has been fine-tuned specifically for Vietnamese mathematical content and should perform better than the base model for math-related queries in Vietnamese.

## Languages
- Vietnamese (primary)
- English (inherited from base model)

## License
This model inherits the license from the base model `intfloat/multilingual-e5-base`.

## Citation
If you use this model, please cite:
```bibtex
@misc{e5-base-math,
  author = {ThanhLe},
  title = {E5-Base-Math: Fine-tuned Vietnamese Math Embedding Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ThanhLe0125/e5-base-math}}
}
```

## Contact
For questions or issues, please contact via the repository discussions.