| # 🧠 Text Similarity Model using Sentence-BERT | |
| This project fine-tunes a Sentence-BERT model (`paraphrase-MiniLM-L6-v2`) on the **STS Benchmark** English dataset (`stsb_multi_mt`) to perform **semantic similarity scoring** between two text inputs. | |
| --- | |
| ## 🚀 Features | |
| - 🔁 Fine-tunes `sentence-transformers/paraphrase-MiniLM-L6-v2` | |
| - 🔧 Trained on the `stsb_multi_mt` dataset (English split) | |
| - 🧪 Predicts cosine similarity between sentence pairs (0 to 1) | |
| - ⚙️ Uses a custom PyTorch model and manual training loop | |
| - 💾 Model is saved as `similarity_model.pt` | |
| - 🧠 Supports inference on custom sentence pairs | |
| --- | |
| ## 📦 Dependencies | |
| Install required libraries: | |
| ```python | |
| pip install -q transformers datasets sentence-transformers evaluate --upgrade | |
| ``` | |
| # 📊 Dataset | |
| - Dataset: stsb_multi_mt | |
| - Split: "en" | |
| - Purpose: Provides sentence pairs with similarity scores ranging from 0 to 5, which are normalized to 0–1 for training. | |
| ```python | |
| from datasets import load_dataset | |
| dataset = load_dataset("stsb_multi_mt", name="en", split="train") | |
| dataset = dataset.shuffle(seed=42).select(range(10000)) # Sample subset for faster training | |
| ``` | |
| ## 🏗️ Model Architecture | |
| # ✅ Base Model | |
| - sentence-transformers/paraphrase-MiniLM-L6-v2 (from Hugging Face) | |
| # ✅ Fine-Tuning | |
| - Cosine similarity computed between the CLS token embeddings of two inputs | |
| - Loss: Mean Squared Error (MSE) between predicted similarity and true score | |
| # 🧠 Training | |
| - Epochs: 3 | |
| - Optimizer: Adam | |
| - Loss: MSELoss | |
| - Manual training loop using PyTorch | |
| # Files and Structure | |
| 📦text-similarity-project | |
| ┣ 📜similarity_model.pt # Trained PyTorch model | |
| ┣ 📜training_script.py # Full training and inference script | |
| ┣ 📜README.md # Documentation | |