File size: 7,276 Bytes
b566b79
 
 
 
 
 
 
 
 
 
2fb9f86
 
 
b566b79
2fb9f86
b566b79
 
 
2fb9f86
 
 
 
b566b79
 
2fb9f86
b566b79
 
 
2fb9f86
 
 
 
 
 
 
b566b79
 
2fb9f86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b566b79
2fb9f86
b566b79
2fb9f86
 
 
 
 
 
 
 
 
 
 
 
 
 
b566b79
 
2fb9f86
 
 
 
 
b566b79
2fb9f86
 
b566b79
 
 
 
2fb9f86
b566b79
 
 
 
 
2fb9f86
 
 
b566b79
 
2fb9f86
b566b79
 
 
 
2fb9f86
b566b79
 
 
2fb9f86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b566b79
2fb9f86
b566b79
 
2fb9f86
b566b79
2fb9f86
 
 
 
 
b566b79
2fb9f86
 
 
 
 
 
 
 
 
 
b566b79
 
2fb9f86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b566b79
 
2fb9f86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b566b79
2fb9f86
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
language: 
- vi
- en
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- mathematics
- vietnamese
- binary-classification
- hard-negatives
- loss-based-early-stopping
- e5-base
- exact-retrieval
base_model: intfloat/multilingual-e5-base
metrics:
- mean_reciprocal_rank
- hit_rate
- accuracy
datasets:
- custom-vietnamese-math
---

# E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping

## Model Overview

Fine-tuned E5-base model optimized for **exact chunk retrieval** in Vietnamese mathematics using:
- **🎯 Binary Classification**: Correct vs Incorrect (instead of 3-level hierarchy)
- **💪 Hard Negatives**: Related chunks as hard negatives for better discrimination  
- **⏰ Loss-based Early Stopping**: Stops when validation loss stops improving
- **📊 Comprehensive Evaluation**: Hit@K, Accuracy@1, MRR metrics

## Performance Summary

### Training Results
- **Best Validation Loss**: N/A
- **Training Epochs**: 10
- **Early Stopping**: ❌ Not triggered
- **Training Time**: 4661.226378917694

### Test Performance 🌟 EXCELLENT
Outstanding performance with correct chunks consistently at top positions

| Metric | Base E5 | Fine-tuned | Improvement |
|--------|---------|------------|-------------|
| **MRR** | 0.7740 | 0.8505 | +0.0765 |
| **Accuracy@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@3** | 0.9462 | 0.9247 | -0.0215 |
| **Hit@5** | 1.0000 | 0.9785 | -0.0215 |

**Total Test Queries**: 93

## Key Innovations

### 🎯 Binary Classification Approach
Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses:
- **Correct chunks**: Score 1.0 (positive examples)
- **Incorrect chunks**: Score 0.0 (includes both related and irrelevant)
- **Hard negatives**: Related chunks serve as challenging negative examples

### 💪 Hard Negatives Strategy
```python
# Training strategy
positive = correct_chunk           # Score: 1.0
hard_negative = related_chunk      # Score: 0.0 (but semantically close)
easy_negative = irrelevant_chunk   # Score: 0.0 (semantically distant)

# This forces model to learn fine-grained distinctions
```

### ⏰ Loss-based Early Stopping
- Monitors **validation loss** instead of MRR
- Stops when loss stops decreasing (patience=3)
- Prevents overfitting and saves training time

## Usage

### Basic Usage
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load model
model = SentenceTransformer('ThanhLe0125/ebd-math')

# ⚠️ CRITICAL: Must use E5 prefixes
query = "query: Định nghĩa hàm số đồng biến là gì?"
chunks = [
    "passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...",  # Should rank #1
    "passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...",         # Related (trained as hard negative)
    "passage: Phương trình bậc hai có dạng ax² + bx + c = 0"        # Irrelevant
]

# Encode and rank
query_emb = model.encode([query])
chunk_embs = model.encode(chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]

# Get rankings
ranked_indices = similarities.argsort()[::-1]
for rank, idx in enumerate(ranked_indices, 1):
    print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...")
```

### Advanced Usage with Multiple Queries
```python
def find_best_chunks(queries, chunk_pool, top_k=3):
    """Find best chunks for multiple queries"""
    results = []
    
    for query in queries:
        # Ensure E5 format
        formatted_query = f"query: {query}" if not query.startswith("query:") else query
        formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk 
                          for chunk in chunk_pool]
        
        # Encode
        query_emb = model.encode([formatted_query])
        chunk_embs = model.encode(formatted_chunks)
        similarities = cosine_similarity(query_emb, chunk_embs)[0]
        
        # Get top K
        top_indices = similarities.argsort()[::-1][:top_k]
        top_chunks = [
            {
                'chunk': chunk_pool[i],
                'similarity': similarities[i],
                'rank': rank + 1
            }
            for rank, i in enumerate(top_indices)
        ]
        
        results.append({
            'query': query,
            'top_chunks': top_chunks
        })
    
    return results

# Example
queries = [
    "Công thức tính đạo hàm của hàm hợp",
    "Cách giải phương trình bậc hai", 
    "Định nghĩa giới hạn của hàm số"
]

chunk_pool = [
    "Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)",
    "Giải phương trình bậc hai bằng công thức nghiệm",
    "Giới hạn của hàm số tại một điểm",
    # ... more chunks
]

results = find_best_chunks(queries, chunk_pool, top_k=3)
```

## Training Details

### Dataset
- **Domain**: Vietnamese mathematics education
- **Split**: Train/Validation/Test with proper separation
- **Hard Negatives**: Related mathematical concepts as challenging negatives
- **Easy Negatives**: Unrelated mathematical concepts

### Training Configuration
```python
Config:
    base_model = "intfloat/multilingual-e5-base"
    train_batch_size = 4
    learning_rate = 2e-5
    max_epochs = 10
    early_stopping_patience = 3
    loss_function = "MultipleNegativesRankingLoss"
    evaluation_metric = "validation_loss"
```

### Evaluation Methodology
1. **Training**: Binary classification with hard negatives
2. **Validation**: Loss-based monitoring for early stopping  
3. **Testing**: Comprehensive evaluation with restored 3-level hierarchy
4. **Metrics**: Hit@K, Accuracy@1, MRR comparison vs base model

## Model Architecture
- **Base**: intfloat/multilingual-e5-base
- **Max Sequence Length**: 256 tokens
- **Output Dimension**: 768
- **Similarity**: Cosine similarity
- **Training Loss**: MultipleNegativesRankingLoss

## Use Cases
-**Educational Q&A**: Find exact mathematical definitions and explanations
-**Content Retrieval**: Precise chunk retrieval for Vietnamese math content  
-**Tutoring Systems**: Quick and accurate answer finding
-**Knowledge Base Search**: Efficient mathematical concept lookup

## Performance Interpretation
- **Hit@1 ≥ 0.7**: 🌟 Excellent - Correct answer usually at #1
- **Hit@3 ≥ 0.8**: 🎯 Very Good - Correct answer in top 3
- **MRR ≥ 0.7**: 👍 Good - Low average rank for correct answers
- **Accuracy@1 ≥ 0.6**: ✅ Solid - Good precision for top result

## Limitations
- **Vietnamese-specific**: Optimized for Vietnamese mathematical terminology
- **Domain-specific**: Best performance on educational math content
- **Sequence length**: Limited to 256 tokens
- **E5 format required**: Must use "query:" and "passage:" prefixes

## Citation
```bibtex
@model{e5-math-vietnamese-binary,
  title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval},
  author={ThanhLe0125},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ThanhLe0125/ebd-math}
}
```

---
*Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.*