|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- mteb/sts12-sts |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
|
|
library_name: transformers |
|
|
--- |
|
|
# Model Description |
|
|
This model is a fine-tuned version of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 for sentence similarity tasks. It was trained on the mteb/stsbenchmark-sts dataset to evaluate the similarity between sentence pairs. |
|
|
|
|
|
Model Type: Sequence Classification (Regression) |
|
|
Pre-trained Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
|
|
Fine-Tuning Dataset: mteb/stsbenchmark-sts |
|
|
Task: Sentence similarity (regression) |
|
|
Training Details |
|
|
Training Objective: To predict the similarity score between pairs of sentences. |
|
|
Training Data: mteb/stsbenchmark-sts, which contains sentence pairs with similarity scores. |
|
|
Number of Labels: 1 (regression) |
|
|
Epochs: 2 |
|
|
Batch Size: 8 |
|
|
Learning Rate: 2e-5 |
|
|
Weight Decay: 0.01 |
|
|
Evaluation |
|
|
The model was evaluated using Pearson correlation on the validation set of the mteb/stsbenchmark-sts dataset. Results indicate how well the model predicts similarity scores between sentence pairs. |
|
|
|
|
|
# Usage |
|
|
To use this model for sentence similarity, follow these steps: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# Load the fine-tuned model |
|
|
|
|
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("./paraphraser_model") |
|
|
tokenizer = AutoTokenizer.from_pretrained("./paraphraser_model") |
|
|
|
|
|
sentences = ["The quick brown fox jumps over the lazy dog.", "A fast dark-colored fox leaps over a sleeping dog."] |
|
|
encoded_input = tokenizer(sentences[0], sentences[1], return_tensors="pt", truncation=True, padding='max_length', max_length=128) |
|
|
|
|
|
# Compute Similarity Score: |
|
|
|
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
|
|
|
# Perform inference |
|
|
with torch.no_grad(): |
|
|
model_output = model(**encoded_input) |
|
|
logits = model_output.logits |
|
|
similarity_score = F.sigmoid(logits).item() |
|
|
|
|
|
print(f"Similarity score between the two sentences: {similarity_score}") |
|
|
|
|
|
# Mean Pooling Function: |
|
|
|
|
|
If using the model for generating sentence embeddings, you can use the following mean pooling function: |
|
|
def mean_pooling(model_output, attention_mask): |
|
|
token_embeddings = model_output[0] # First element of model_output contains the token embeddings |
|
|
input_mask_expanded = attention_mask.unsqueeze(-1).float() |
|
|
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1) |
|
|
sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9) |
|
|
return sum_embeddings / sum_mask |
|
|
|
|
|
# Limitations |
|
|
Domain Specificity: The model is fine-tuned on the mteb/stsbenchmark-sts dataset and may perform differently on other types of text or datasets. |
|
|
Biases: As with any model trained on human language data, it may inherit and reflect biases present in the training data. |
|
|
|
|
|
# Future Work |
|
|
Potential improvements include fine-tuning on additional datasets, experimenting with different architectures or hyperparameters, and incorporating additional training techniques to improve performance and robustness. |
|
|
|
|
|
Citation |
|
|
If you use this model in your research, please cite it as follows: |
|
|
@inproceedings{your_paper, |
|
|
title={Fine-Tuned Paraphrase-Multilingual-MiniLM-L12-v2 for Sentence Similarity}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
publisher={Your Institution} |
|
|
} |
|
|
|
|
|
|
|
|
# License |
|
|
This model is licensed under the MIT License. See the LICENSE file for more information. |