File size: 7,276 Bytes
b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 b566b79 2fb9f86 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | ---
language:
- vi
- en
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- mathematics
- vietnamese
- binary-classification
- hard-negatives
- loss-based-early-stopping
- e5-base
- exact-retrieval
base_model: intfloat/multilingual-e5-base
metrics:
- mean_reciprocal_rank
- hit_rate
- accuracy
datasets:
- custom-vietnamese-math
---
# E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping
## Model Overview
Fine-tuned E5-base model optimized for **exact chunk retrieval** in Vietnamese mathematics using:
- **🎯 Binary Classification**: Correct vs Incorrect (instead of 3-level hierarchy)
- **💪 Hard Negatives**: Related chunks as hard negatives for better discrimination
- **⏰ Loss-based Early Stopping**: Stops when validation loss stops improving
- **📊 Comprehensive Evaluation**: Hit@K, Accuracy@1, MRR metrics
## Performance Summary
### Training Results
- **Best Validation Loss**: N/A
- **Training Epochs**: 10
- **Early Stopping**: ❌ Not triggered
- **Training Time**: 4661.226378917694
### Test Performance 🌟 EXCELLENT
Outstanding performance with correct chunks consistently at top positions
| Metric | Base E5 | Fine-tuned | Improvement |
|--------|---------|------------|-------------|
| **MRR** | 0.7740 | 0.8505 | +0.0765 |
| **Accuracy@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@3** | 0.9462 | 0.9247 | -0.0215 |
| **Hit@5** | 1.0000 | 0.9785 | -0.0215 |
**Total Test Queries**: 93
## Key Innovations
### 🎯 Binary Classification Approach
Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses:
- **Correct chunks**: Score 1.0 (positive examples)
- **Incorrect chunks**: Score 0.0 (includes both related and irrelevant)
- **Hard negatives**: Related chunks serve as challenging negative examples
### 💪 Hard Negatives Strategy
```python
# Training strategy
positive = correct_chunk # Score: 1.0
hard_negative = related_chunk # Score: 0.0 (but semantically close)
easy_negative = irrelevant_chunk # Score: 0.0 (semantically distant)
# This forces model to learn fine-grained distinctions
```
### ⏰ Loss-based Early Stopping
- Monitors **validation loss** instead of MRR
- Stops when loss stops decreasing (patience=3)
- Prevents overfitting and saves training time
## Usage
### Basic Usage
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load model
model = SentenceTransformer('ThanhLe0125/ebd-math')
# ⚠️ CRITICAL: Must use E5 prefixes
query = "query: Định nghĩa hàm số đồng biến là gì?"
chunks = [
"passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...", # Should rank #1
"passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...", # Related (trained as hard negative)
"passage: Phương trình bậc hai có dạng ax² + bx + c = 0" # Irrelevant
]
# Encode and rank
query_emb = model.encode([query])
chunk_embs = model.encode(chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]
# Get rankings
ranked_indices = similarities.argsort()[::-1]
for rank, idx in enumerate(ranked_indices, 1):
print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...")
```
### Advanced Usage with Multiple Queries
```python
def find_best_chunks(queries, chunk_pool, top_k=3):
"""Find best chunks for multiple queries"""
results = []
for query in queries:
# Ensure E5 format
formatted_query = f"query: {query}" if not query.startswith("query:") else query
formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk
for chunk in chunk_pool]
# Encode
query_emb = model.encode([formatted_query])
chunk_embs = model.encode(formatted_chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]
# Get top K
top_indices = similarities.argsort()[::-1][:top_k]
top_chunks = [
{
'chunk': chunk_pool[i],
'similarity': similarities[i],
'rank': rank + 1
}
for rank, i in enumerate(top_indices)
]
results.append({
'query': query,
'top_chunks': top_chunks
})
return results
# Example
queries = [
"Công thức tính đạo hàm của hàm hợp",
"Cách giải phương trình bậc hai",
"Định nghĩa giới hạn của hàm số"
]
chunk_pool = [
"Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)",
"Giải phương trình bậc hai bằng công thức nghiệm",
"Giới hạn của hàm số tại một điểm",
# ... more chunks
]
results = find_best_chunks(queries, chunk_pool, top_k=3)
```
## Training Details
### Dataset
- **Domain**: Vietnamese mathematics education
- **Split**: Train/Validation/Test with proper separation
- **Hard Negatives**: Related mathematical concepts as challenging negatives
- **Easy Negatives**: Unrelated mathematical concepts
### Training Configuration
```python
Config:
base_model = "intfloat/multilingual-e5-base"
train_batch_size = 4
learning_rate = 2e-5
max_epochs = 10
early_stopping_patience = 3
loss_function = "MultipleNegativesRankingLoss"
evaluation_metric = "validation_loss"
```
### Evaluation Methodology
1. **Training**: Binary classification with hard negatives
2. **Validation**: Loss-based monitoring for early stopping
3. **Testing**: Comprehensive evaluation with restored 3-level hierarchy
4. **Metrics**: Hit@K, Accuracy@1, MRR comparison vs base model
## Model Architecture
- **Base**: intfloat/multilingual-e5-base
- **Max Sequence Length**: 256 tokens
- **Output Dimension**: 768
- **Similarity**: Cosine similarity
- **Training Loss**: MultipleNegativesRankingLoss
## Use Cases
- ✅ **Educational Q&A**: Find exact mathematical definitions and explanations
- ✅ **Content Retrieval**: Precise chunk retrieval for Vietnamese math content
- ✅ **Tutoring Systems**: Quick and accurate answer finding
- ✅ **Knowledge Base Search**: Efficient mathematical concept lookup
## Performance Interpretation
- **Hit@1 ≥ 0.7**: 🌟 Excellent - Correct answer usually at #1
- **Hit@3 ≥ 0.8**: 🎯 Very Good - Correct answer in top 3
- **MRR ≥ 0.7**: 👍 Good - Low average rank for correct answers
- **Accuracy@1 ≥ 0.6**: ✅ Solid - Good precision for top result
## Limitations
- **Vietnamese-specific**: Optimized for Vietnamese mathematical terminology
- **Domain-specific**: Best performance on educational math content
- **Sequence length**: Limited to 256 tokens
- **E5 format required**: Must use "query:" and "passage:" prefixes
## Citation
```bibtex
@model{e5-math-vietnamese-binary,
title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval},
author={ThanhLe0125},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ThanhLe0125/ebd-math}
}
```
---
*Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.*
|