ebd-math / README.md
ThanhLe0125's picture
E5-Math Binary Model - MRR: 0.8505, Acc@1: 0.7634 - 2025-07-01
2fb9f86 verified
---
language:
- vi
- en
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- mathematics
- vietnamese
- binary-classification
- hard-negatives
- loss-based-early-stopping
- e5-base
- exact-retrieval
base_model: intfloat/multilingual-e5-base
metrics:
- mean_reciprocal_rank
- hit_rate
- accuracy
datasets:
- custom-vietnamese-math
---
# E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping
## Model Overview
Fine-tuned E5-base model optimized for **exact chunk retrieval** in Vietnamese mathematics using:
- **🎯 Binary Classification**: Correct vs Incorrect (instead of 3-level hierarchy)
- **💪 Hard Negatives**: Related chunks as hard negatives for better discrimination
- **⏰ Loss-based Early Stopping**: Stops when validation loss stops improving
- **📊 Comprehensive Evaluation**: Hit@K, Accuracy@1, MRR metrics
## Performance Summary
### Training Results
- **Best Validation Loss**: N/A
- **Training Epochs**: 10
- **Early Stopping**: ❌ Not triggered
- **Training Time**: 4661.226378917694
### Test Performance 🌟 EXCELLENT
Outstanding performance with correct chunks consistently at top positions
| Metric | Base E5 | Fine-tuned | Improvement |
|--------|---------|------------|-------------|
| **MRR** | 0.7740 | 0.8505 | +0.0765 |
| **Accuracy@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@3** | 0.9462 | 0.9247 | -0.0215 |
| **Hit@5** | 1.0000 | 0.9785 | -0.0215 |
**Total Test Queries**: 93
## Key Innovations
### 🎯 Binary Classification Approach
Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses:
- **Correct chunks**: Score 1.0 (positive examples)
- **Incorrect chunks**: Score 0.0 (includes both related and irrelevant)
- **Hard negatives**: Related chunks serve as challenging negative examples
### 💪 Hard Negatives Strategy
```python
# Training strategy
positive = correct_chunk # Score: 1.0
hard_negative = related_chunk # Score: 0.0 (but semantically close)
easy_negative = irrelevant_chunk # Score: 0.0 (semantically distant)
# This forces model to learn fine-grained distinctions
```
### ⏰ Loss-based Early Stopping
- Monitors **validation loss** instead of MRR
- Stops when loss stops decreasing (patience=3)
- Prevents overfitting and saves training time
## Usage
### Basic Usage
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load model
model = SentenceTransformer('ThanhLe0125/ebd-math')
# ⚠️ CRITICAL: Must use E5 prefixes
query = "query: Định nghĩa hàm số đồng biến là gì?"
chunks = [
"passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...", # Should rank #1
"passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...", # Related (trained as hard negative)
"passage: Phương trình bậc hai có dạng ax² + bx + c = 0" # Irrelevant
]
# Encode and rank
query_emb = model.encode([query])
chunk_embs = model.encode(chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]
# Get rankings
ranked_indices = similarities.argsort()[::-1]
for rank, idx in enumerate(ranked_indices, 1):
print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...")
```
### Advanced Usage with Multiple Queries
```python
def find_best_chunks(queries, chunk_pool, top_k=3):
"""Find best chunks for multiple queries"""
results = []
for query in queries:
# Ensure E5 format
formatted_query = f"query: {query}" if not query.startswith("query:") else query
formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk
for chunk in chunk_pool]
# Encode
query_emb = model.encode([formatted_query])
chunk_embs = model.encode(formatted_chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]
# Get top K
top_indices = similarities.argsort()[::-1][:top_k]
top_chunks = [
{
'chunk': chunk_pool[i],
'similarity': similarities[i],
'rank': rank + 1
}
for rank, i in enumerate(top_indices)
]
results.append({
'query': query,
'top_chunks': top_chunks
})
return results
# Example
queries = [
"Công thức tính đạo hàm của hàm hợp",
"Cách giải phương trình bậc hai",
"Định nghĩa giới hạn của hàm số"
]
chunk_pool = [
"Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)",
"Giải phương trình bậc hai bằng công thức nghiệm",
"Giới hạn của hàm số tại một điểm",
# ... more chunks
]
results = find_best_chunks(queries, chunk_pool, top_k=3)
```
## Training Details
### Dataset
- **Domain**: Vietnamese mathematics education
- **Split**: Train/Validation/Test with proper separation
- **Hard Negatives**: Related mathematical concepts as challenging negatives
- **Easy Negatives**: Unrelated mathematical concepts
### Training Configuration
```python
Config:
base_model = "intfloat/multilingual-e5-base"
train_batch_size = 4
learning_rate = 2e-5
max_epochs = 10
early_stopping_patience = 3
loss_function = "MultipleNegativesRankingLoss"
evaluation_metric = "validation_loss"
```
### Evaluation Methodology
1. **Training**: Binary classification with hard negatives
2. **Validation**: Loss-based monitoring for early stopping
3. **Testing**: Comprehensive evaluation with restored 3-level hierarchy
4. **Metrics**: Hit@K, Accuracy@1, MRR comparison vs base model
## Model Architecture
- **Base**: intfloat/multilingual-e5-base
- **Max Sequence Length**: 256 tokens
- **Output Dimension**: 768
- **Similarity**: Cosine similarity
- **Training Loss**: MultipleNegativesRankingLoss
## Use Cases
-**Educational Q&A**: Find exact mathematical definitions and explanations
-**Content Retrieval**: Precise chunk retrieval for Vietnamese math content
-**Tutoring Systems**: Quick and accurate answer finding
-**Knowledge Base Search**: Efficient mathematical concept lookup
## Performance Interpretation
- **Hit@1 ≥ 0.7**: 🌟 Excellent - Correct answer usually at #1
- **Hit@3 ≥ 0.8**: 🎯 Very Good - Correct answer in top 3
- **MRR ≥ 0.7**: 👍 Good - Low average rank for correct answers
- **Accuracy@1 ≥ 0.6**: ✅ Solid - Good precision for top result
## Limitations
- **Vietnamese-specific**: Optimized for Vietnamese mathematical terminology
- **Domain-specific**: Best performance on educational math content
- **Sequence length**: Limited to 256 tokens
- **E5 format required**: Must use "query:" and "passage:" prefixes
## Citation
```bibtex
@model{e5-math-vietnamese-binary,
title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval},
author={ThanhLe0125},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ThanhLe0125/ebd-math}
}
```
---
*Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.*