README.md · ThanhLe0125/ebd-math at main

File size: 7,276 Bytes

---
language: 
- vi
- en
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- mathematics
- vietnamese
- binary-classification
- hard-negatives
- loss-based-early-stopping
- e5-base
- exact-retrieval
base_model: intfloat/multilingual-e5-base
metrics:
- mean_reciprocal_rank
- hit_rate
- accuracy
datasets:
- custom-vietnamese-math
---

# E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping

## Model Overview

Fine-tuned E5-base model optimized for **exact chunk retrieval** in Vietnamese mathematics using:
- **🎯 Binary Classification**: Correct vs Incorrect (instead of 3-level hierarchy)
- **💪 Hard Negatives**: Related chunks as hard negatives for better discrimination  
- **⏰ Loss-based Early Stopping**: Stops when validation loss stops improving
- **📊 Comprehensive Evaluation**: Hit@K, Accuracy@1, MRR metrics

## Performance Summary

### Training Results
- **Best Validation Loss**: N/A
- **Training Epochs**: 10
- **Early Stopping**: ❌ Not triggered
- **Training Time**: 4661.226378917694

### Test Performance 🌟 EXCELLENT
Outstanding performance with correct chunks consistently at top positions

| Metric | Base E5 | Fine-tuned | Improvement |
|--------|---------|------------|-------------|
| **MRR** | 0.7740 | 0.8505 | +0.0765 |
| **Accuracy@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@1** | 0.6129 | 0.7634 | +0.1505 |
| **Hit@3** | 0.9462 | 0.9247 | -0.0215 |
| **Hit@5** | 1.0000 | 0.9785 | -0.0215 |

**Total Test Queries**: 93

## Key Innovations

### 🎯 Binary Classification Approach
Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses:
- **Correct chunks**: Score 1.0 (positive examples)
- **Incorrect chunks**: Score 0.0 (includes both related and irrelevant)
- **Hard negatives**: Related chunks serve as challenging negative examples

### 💪 Hard Negatives Strategy
```python
# Training strategy
positive = correct_chunk           # Score: 1.0
hard_negative = related_chunk      # Score: 0.0 (but semantically close)
easy_negative = irrelevant_chunk   # Score: 0.0 (semantically distant)

# This forces model to learn fine-grained distinctions
```

### ⏰ Loss-based Early Stopping
- Monitors **validation loss** instead of MRR
- Stops when loss stops decreasing (patience=3)
- Prevents overfitting and saves training time

## Usage

### Basic Usage
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load model
model = SentenceTransformer('ThanhLe0125/ebd-math')

# ⚠️ CRITICAL: Must use E5 prefixes
query = "query: Định nghĩa hàm số đồng biến là gì?"
chunks = [
    "passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...",  # Should rank #1
    "passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...",         # Related (trained as hard negative)
    "passage: Phương trình bậc hai có dạng ax² + bx + c = 0"        # Irrelevant
]

# Encode and rank
query_emb = model.encode([query])
chunk_embs = model.encode(chunks)
similarities = cosine_similarity(query_emb, chunk_embs)[0]

# Get rankings
ranked_indices = similarities.argsort()[::-1]
for rank, idx in enumerate(ranked_indices, 1):
    print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...")
```

### Advanced Usage with Multiple Queries
```python
def find_best_chunks(queries, chunk_pool, top_k=3):
    """Find best chunks for multiple queries"""
    results = []
    
    for query in queries:
        # Ensure E5 format
        formatted_query = f"query: {query}" if not query.startswith("query:") else query
        formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk 
                          for chunk in chunk_pool]
        
        # Encode
        query_emb = model.encode([formatted_query])
        chunk_embs = model.encode(formatted_chunks)
        similarities = cosine_similarity(query_emb, chunk_embs)[0]
        
        # Get top K
        top_indices = similarities.argsort()[::-1][:top_k]
        top_chunks = [
            {
                'chunk': chunk_pool[i],
                'similarity': similarities[i],
                'rank': rank + 1
            }
            for rank, i in enumerate(top_indices)
        ]
        
        results.append({
            'query': query,
            'top_chunks': top_chunks
        })
    
    return results

# Example
queries = [
    "Công thức tính đạo hàm của hàm hợp",
    "Cách giải phương trình bậc hai", 
    "Định nghĩa giới hạn của hàm số"
]

chunk_pool = [
    "Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)",
    "Giải phương trình bậc hai bằng công thức nghiệm",
    "Giới hạn của hàm số tại một điểm",
    # ... more chunks
]

results = find_best_chunks(queries, chunk_pool, top_k=3)
```

## Training Details

### Dataset
- **Domain**: Vietnamese mathematics education
- **Split**: Train/Validation/Test with proper separation
- **Hard Negatives**: Related mathematical concepts as challenging negatives
- **Easy Negatives**: Unrelated mathematical concepts

### Training Configuration
```python
Config:
    base_model = "intfloat/multilingual-e5-base"
    train_batch_size = 4
    learning_rate = 2e-5
    max_epochs = 10
    early_stopping_patience = 3
    loss_function = "MultipleNegativesRankingLoss"
    evaluation_metric = "validation_loss"
```

### Evaluation Methodology
1. **Training**: Binary classification with hard negatives
2. **Validation**: Loss-based monitoring for early stopping  
3. **Testing**: Comprehensive evaluation with restored 3-level hierarchy
4. **Metrics**: Hit@K, Accuracy@1, MRR comparison vs base model

## Model Architecture
- **Base**: intfloat/multilingual-e5-base
- **Max Sequence Length**: 256 tokens
- **Output Dimension**: 768
- **Similarity**: Cosine similarity
- **Training Loss**: MultipleNegativesRankingLoss

## Use Cases
- ✅ **Educational Q&A**: Find exact mathematical definitions and explanations
- ✅ **Content Retrieval**: Precise chunk retrieval for Vietnamese math content  
- ✅ **Tutoring Systems**: Quick and accurate answer finding
- ✅ **Knowledge Base Search**: Efficient mathematical concept lookup

## Performance Interpretation
- **Hit@1 ≥ 0.7**: 🌟 Excellent - Correct answer usually at #1
- **Hit@3 ≥ 0.8**: 🎯 Very Good - Correct answer in top 3
- **MRR ≥ 0.7**: 👍 Good - Low average rank for correct answers
- **Accuracy@1 ≥ 0.6**: ✅ Solid - Good precision for top result

## Limitations
- **Vietnamese-specific**: Optimized for Vietnamese mathematical terminology
- **Domain-specific**: Best performance on educational math content
- **Sequence length**: Limited to 256 tokens
- **E5 format required**: Must use "query:" and "passage:" prefixes

## Citation
```bibtex
@model{e5-math-vietnamese-binary,
  title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval},
  author={ThanhLe0125},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ThanhLe0125/ebd-math}
}
```

---
*Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.*