|
|
--- |
|
|
language: |
|
|
- vi |
|
|
- en |
|
|
library_name: sentence-transformers |
|
|
pipeline_tag: sentence-similarity |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- mathematics |
|
|
- vietnamese |
|
|
- binary-classification |
|
|
- hard-negatives |
|
|
- loss-based-early-stopping |
|
|
- e5-base |
|
|
- exact-retrieval |
|
|
base_model: intfloat/multilingual-e5-base |
|
|
metrics: |
|
|
- mean_reciprocal_rank |
|
|
- hit_rate |
|
|
- accuracy |
|
|
datasets: |
|
|
- custom-vietnamese-math |
|
|
--- |
|
|
|
|
|
# E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
Fine-tuned E5-base model optimized for **exact chunk retrieval** in Vietnamese mathematics using: |
|
|
- **🎯 Binary Classification**: Correct vs Incorrect (instead of 3-level hierarchy) |
|
|
- **💪 Hard Negatives**: Related chunks as hard negatives for better discrimination |
|
|
- **⏰ Loss-based Early Stopping**: Stops when validation loss stops improving |
|
|
- **📊 Comprehensive Evaluation**: Hit@K, Accuracy@1, MRR metrics |
|
|
|
|
|
## Performance Summary |
|
|
|
|
|
### Training Results |
|
|
- **Best Validation Loss**: N/A |
|
|
- **Training Epochs**: 10 |
|
|
- **Early Stopping**: ❌ Not triggered |
|
|
- **Training Time**: 4661.226378917694 |
|
|
|
|
|
### Test Performance 🌟 EXCELLENT |
|
|
Outstanding performance with correct chunks consistently at top positions |
|
|
|
|
|
| Metric | Base E5 | Fine-tuned | Improvement | |
|
|
|--------|---------|------------|-------------| |
|
|
| **MRR** | 0.7740 | 0.8505 | +0.0765 | |
|
|
| **Accuracy@1** | 0.6129 | 0.7634 | +0.1505 | |
|
|
| **Hit@1** | 0.6129 | 0.7634 | +0.1505 | |
|
|
| **Hit@3** | 0.9462 | 0.9247 | -0.0215 | |
|
|
| **Hit@5** | 1.0000 | 0.9785 | -0.0215 | |
|
|
|
|
|
**Total Test Queries**: 93 |
|
|
|
|
|
## Key Innovations |
|
|
|
|
|
### 🎯 Binary Classification Approach |
|
|
Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses: |
|
|
- **Correct chunks**: Score 1.0 (positive examples) |
|
|
- **Incorrect chunks**: Score 0.0 (includes both related and irrelevant) |
|
|
- **Hard negatives**: Related chunks serve as challenging negative examples |
|
|
|
|
|
### 💪 Hard Negatives Strategy |
|
|
```python |
|
|
# Training strategy |
|
|
positive = correct_chunk # Score: 1.0 |
|
|
hard_negative = related_chunk # Score: 0.0 (but semantically close) |
|
|
easy_negative = irrelevant_chunk # Score: 0.0 (semantically distant) |
|
|
|
|
|
# This forces model to learn fine-grained distinctions |
|
|
``` |
|
|
|
|
|
### ⏰ Loss-based Early Stopping |
|
|
- Monitors **validation loss** instead of MRR |
|
|
- Stops when loss stops decreasing (patience=3) |
|
|
- Prevents overfitting and saves training time |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
|
|
# Load model |
|
|
model = SentenceTransformer('ThanhLe0125/ebd-math') |
|
|
|
|
|
# ⚠️ CRITICAL: Must use E5 prefixes |
|
|
query = "query: Định nghĩa hàm số đồng biến là gì?" |
|
|
chunks = [ |
|
|
"passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...", # Should rank #1 |
|
|
"passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...", # Related (trained as hard negative) |
|
|
"passage: Phương trình bậc hai có dạng ax² + bx + c = 0" # Irrelevant |
|
|
] |
|
|
|
|
|
# Encode and rank |
|
|
query_emb = model.encode([query]) |
|
|
chunk_embs = model.encode(chunks) |
|
|
similarities = cosine_similarity(query_emb, chunk_embs)[0] |
|
|
|
|
|
# Get rankings |
|
|
ranked_indices = similarities.argsort()[::-1] |
|
|
for rank, idx in enumerate(ranked_indices, 1): |
|
|
print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...") |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Multiple Queries |
|
|
```python |
|
|
def find_best_chunks(queries, chunk_pool, top_k=3): |
|
|
"""Find best chunks for multiple queries""" |
|
|
results = [] |
|
|
|
|
|
for query in queries: |
|
|
# Ensure E5 format |
|
|
formatted_query = f"query: {query}" if not query.startswith("query:") else query |
|
|
formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk |
|
|
for chunk in chunk_pool] |
|
|
|
|
|
# Encode |
|
|
query_emb = model.encode([formatted_query]) |
|
|
chunk_embs = model.encode(formatted_chunks) |
|
|
similarities = cosine_similarity(query_emb, chunk_embs)[0] |
|
|
|
|
|
# Get top K |
|
|
top_indices = similarities.argsort()[::-1][:top_k] |
|
|
top_chunks = [ |
|
|
{ |
|
|
'chunk': chunk_pool[i], |
|
|
'similarity': similarities[i], |
|
|
'rank': rank + 1 |
|
|
} |
|
|
for rank, i in enumerate(top_indices) |
|
|
] |
|
|
|
|
|
results.append({ |
|
|
'query': query, |
|
|
'top_chunks': top_chunks |
|
|
}) |
|
|
|
|
|
return results |
|
|
|
|
|
# Example |
|
|
queries = [ |
|
|
"Công thức tính đạo hàm của hàm hợp", |
|
|
"Cách giải phương trình bậc hai", |
|
|
"Định nghĩa giới hạn của hàm số" |
|
|
] |
|
|
|
|
|
chunk_pool = [ |
|
|
"Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)", |
|
|
"Giải phương trình bậc hai bằng công thức nghiệm", |
|
|
"Giới hạn của hàm số tại một điểm", |
|
|
# ... more chunks |
|
|
] |
|
|
|
|
|
results = find_best_chunks(queries, chunk_pool, top_k=3) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
- **Domain**: Vietnamese mathematics education |
|
|
- **Split**: Train/Validation/Test with proper separation |
|
|
- **Hard Negatives**: Related mathematical concepts as challenging negatives |
|
|
- **Easy Negatives**: Unrelated mathematical concepts |
|
|
|
|
|
### Training Configuration |
|
|
```python |
|
|
Config: |
|
|
base_model = "intfloat/multilingual-e5-base" |
|
|
train_batch_size = 4 |
|
|
learning_rate = 2e-5 |
|
|
max_epochs = 10 |
|
|
early_stopping_patience = 3 |
|
|
loss_function = "MultipleNegativesRankingLoss" |
|
|
evaluation_metric = "validation_loss" |
|
|
``` |
|
|
|
|
|
### Evaluation Methodology |
|
|
1. **Training**: Binary classification with hard negatives |
|
|
2. **Validation**: Loss-based monitoring for early stopping |
|
|
3. **Testing**: Comprehensive evaluation with restored 3-level hierarchy |
|
|
4. **Metrics**: Hit@K, Accuracy@1, MRR comparison vs base model |
|
|
|
|
|
## Model Architecture |
|
|
- **Base**: intfloat/multilingual-e5-base |
|
|
- **Max Sequence Length**: 256 tokens |
|
|
- **Output Dimension**: 768 |
|
|
- **Similarity**: Cosine similarity |
|
|
- **Training Loss**: MultipleNegativesRankingLoss |
|
|
|
|
|
## Use Cases |
|
|
- ✅ **Educational Q&A**: Find exact mathematical definitions and explanations |
|
|
- ✅ **Content Retrieval**: Precise chunk retrieval for Vietnamese math content |
|
|
- ✅ **Tutoring Systems**: Quick and accurate answer finding |
|
|
- ✅ **Knowledge Base Search**: Efficient mathematical concept lookup |
|
|
|
|
|
## Performance Interpretation |
|
|
- **Hit@1 ≥ 0.7**: 🌟 Excellent - Correct answer usually at #1 |
|
|
- **Hit@3 ≥ 0.8**: 🎯 Very Good - Correct answer in top 3 |
|
|
- **MRR ≥ 0.7**: 👍 Good - Low average rank for correct answers |
|
|
- **Accuracy@1 ≥ 0.6**: ✅ Solid - Good precision for top result |
|
|
|
|
|
## Limitations |
|
|
- **Vietnamese-specific**: Optimized for Vietnamese mathematical terminology |
|
|
- **Domain-specific**: Best performance on educational math content |
|
|
- **Sequence length**: Limited to 256 tokens |
|
|
- **E5 format required**: Must use "query:" and "passage:" prefixes |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@model{e5-math-vietnamese-binary, |
|
|
title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval}, |
|
|
author={ThanhLe0125}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/ThanhLe0125/ebd-math} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
*Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.* |
|
|
|