README.md · ThanhLe0125/ebd-math at main

ebd-math / README.md

ThanhLe0125

E5-Math Binary Model - MRR: 0.8505, Acc@1: 0.7634 - 2025-07-01

2fb9f86 verified 8 months ago

preview code

raw

history blame contribute delete

7.28 kB

	---
	language:
	- vi
	- en
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- mathematics
	- vietnamese
	- binary-classification
	- hard-negatives
	- loss-based-early-stopping
	- e5-base
	- exact-retrieval
	base_model: intfloat/multilingual-e5-base
	metrics:
	- mean_reciprocal_rank
	- hit_rate
	- accuracy
	datasets:
	- custom-vietnamese-math
	---

	# E5-Math-Vietnamese-Binary: Hard Negatives + Loss-based Early Stopping

	## Model Overview

	Fine-tuned E5-base model optimized for exact chunk retrieval in Vietnamese mathematics using:
	- 🎯 Binary Classification: Correct vs Incorrect (instead of 3-level hierarchy)
	- 💪 Hard Negatives: Related chunks as hard negatives for better discrimination
	- ⏰ Loss-based Early Stopping: Stops when validation loss stops improving
	- 📊 Comprehensive Evaluation: Hit@K, Accuracy@1, MRR metrics

	## Performance Summary

	### Training Results
	- Best Validation Loss: N/A
	- Training Epochs: 10
	- Early Stopping: ❌ Not triggered
	- Training Time: 4661.226378917694

	### Test Performance 🌟 EXCELLENT
	Outstanding performance with correct chunks consistently at top positions

	\| Metric \| Base E5 \| Fine-tuned \| Improvement \|
	\|--------\|---------\|------------\|-------------\|
	\| MRR \| 0.7740 \| 0.8505 \| +0.0765 \|
	\| Accuracy@1 \| 0.6129 \| 0.7634 \| +0.1505 \|
	\| Hit@1 \| 0.6129 \| 0.7634 \| +0.1505 \|
	\| Hit@3 \| 0.9462 \| 0.9247 \| -0.0215 \|
	\| Hit@5 \| 1.0000 \| 0.9785 \| -0.0215 \|

	Total Test Queries: 93

	## Key Innovations

	### 🎯 Binary Classification Approach
	Instead of traditional 3-level hierarchy (correct/related/irrelevant), this model uses:
	- Correct chunks: Score 1.0 (positive examples)
	- Incorrect chunks: Score 0.0 (includes both related and irrelevant)
	- Hard negatives: Related chunks serve as challenging negative examples

	### 💪 Hard Negatives Strategy
	```python
	# Training strategy
	positive = correct_chunk # Score: 1.0
	hard_negative = related_chunk # Score: 0.0 (but semantically close)
	easy_negative = irrelevant_chunk # Score: 0.0 (semantically distant)

	# This forces model to learn fine-grained distinctions
	```

	### ⏰ Loss-based Early Stopping
	- Monitors validation loss instead of MRR
	- Stops when loss stops decreasing (patience=3)
	- Prevents overfitting and saves training time

	## Usage

	### Basic Usage
	```python
	from sentence_transformers import SentenceTransformer
	from sklearn.metrics.pairwise import cosine_similarity

	# Load model
	model = SentenceTransformer('ThanhLe0125/ebd-math')

	# ⚠️ CRITICAL: Must use E5 prefixes
	query = "query: Định nghĩa hàm số đồng biến là gì?"
	chunks = [
	"passage: Hàm số đồng biến trên khoảng (a;b) là hàm số mà...", # Should rank #1
	"passage: Ví dụ về hàm số đồng biến: f(x) = 2x + 1...", # Related (trained as hard negative)
	"passage: Phương trình bậc hai có dạng ax² + bx + c = 0" # Irrelevant
	]

	# Encode and rank
	query_emb = model.encode([query])
	chunk_embs = model.encode(chunks)
	similarities = cosine_similarity(query_emb, chunk_embs)[0]

	# Get rankings
	ranked_indices = similarities.argsort()[::-1]
	for rank, idx in enumerate(ranked_indices, 1):
	print(f"Rank {rank}: Score {similarities[idx]:.4f} - {chunks[idx][:50]}...")
	```

	### Advanced Usage with Multiple Queries
	```python
	def find_best_chunks(queries, chunk_pool, top_k=3):
	"""Find best chunks for multiple queries"""
	results = []

	for query in queries:
	# Ensure E5 format
	formatted_query = f"query: {query}" if not query.startswith("query:") else query
	formatted_chunks = [f"passage: {chunk}" if not chunk.startswith("passage:") else chunk
	for chunk in chunk_pool]

	# Encode
	query_emb = model.encode([formatted_query])
	chunk_embs = model.encode(formatted_chunks)
	similarities = cosine_similarity(query_emb, chunk_embs)[0]

	# Get top K
	top_indices = similarities.argsort()[::-1][:top_k]
	top_chunks = [
	{
	'chunk': chunk_pool[i],
	'similarity': similarities[i],
	'rank': rank + 1
	}
	for rank, i in enumerate(top_indices)
	]

	results.append({
	'query': query,
	'top_chunks': top_chunks
	})

	return results

	# Example
	queries = [
	"Công thức tính đạo hàm của hàm hợp",
	"Cách giải phương trình bậc hai",
	"Định nghĩa giới hạn của hàm số"
	]

	chunk_pool = [
	"Đạo hàm của hàm hợp: (f(g(x)))' = f'(g(x)) × g'(x)",
	"Giải phương trình bậc hai bằng công thức nghiệm",
	"Giới hạn của hàm số tại một điểm",
	# ... more chunks
	]

	results = find_best_chunks(queries, chunk_pool, top_k=3)
	```

	## Training Details

	### Dataset
	- Domain: Vietnamese mathematics education
	- Split: Train/Validation/Test with proper separation
	- Hard Negatives: Related mathematical concepts as challenging negatives
	- Easy Negatives: Unrelated mathematical concepts

	### Training Configuration
	```python
	Config:
	base_model = "intfloat/multilingual-e5-base"
	train_batch_size = 4
	learning_rate = 2e-5
	max_epochs = 10
	early_stopping_patience = 3
	loss_function = "MultipleNegativesRankingLoss"
	evaluation_metric = "validation_loss"
	```

	### Evaluation Methodology
	1. Training: Binary classification with hard negatives
	2. Validation: Loss-based monitoring for early stopping
	3. Testing: Comprehensive evaluation with restored 3-level hierarchy
	4. Metrics: Hit@K, Accuracy@1, MRR comparison vs base model

	## Model Architecture
	- Base: intfloat/multilingual-e5-base
	- Max Sequence Length: 256 tokens
	- Output Dimension: 768
	- Similarity: Cosine similarity
	- Training Loss: MultipleNegativesRankingLoss

	## Use Cases
	- ✅ Educational Q&A: Find exact mathematical definitions and explanations
	- ✅ Content Retrieval: Precise chunk retrieval for Vietnamese math content
	- ✅ Tutoring Systems: Quick and accurate answer finding
	- ✅ Knowledge Base Search: Efficient mathematical concept lookup

	## Performance Interpretation
	- Hit@1 ≥ 0.7: 🌟 Excellent - Correct answer usually at #1
	- Hit@3 ≥ 0.8: 🎯 Very Good - Correct answer in top 3
	- MRR ≥ 0.7: 👍 Good - Low average rank for correct answers
	- Accuracy@1 ≥ 0.6: ✅ Solid - Good precision for top result

	## Limitations
	- Vietnamese-specific: Optimized for Vietnamese mathematical terminology
	- Domain-specific: Best performance on educational math content
	- Sequence length: Limited to 256 tokens
	- E5 format required: Must use "query:" and "passage:" prefixes

	## Citation
	```bibtex
	@model{e5-math-vietnamese-binary,
	title={E5-Math-Vietnamese-Binary: Hard Negatives Fine-tuning for Mathematical Retrieval},
	author={ThanhLe0125},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/ThanhLe0125/ebd-math}
	}
	```

	---
	Trained on July 01, 2025 using hard negatives and loss-based early stopping for optimal retrieval performance.