jaesani
/

paraphrase_model

Text Classification

text-embeddings-inference

Model card Files Files and versions

paraphrase_model / README.md

jaesani's picture

Update README.md

eaf9386 verified over 1 year ago

|

history blame contribute delete

3.43 kB

	---
	license: mit
	datasets:
	- mteb/sts12-sts
	metrics:
	- accuracy
	base_model:
	- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
	library_name: transformers
	---
	# Model Description
	This model is a fine-tuned version of sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 for sentence similarity tasks. It was trained on the mteb/stsbenchmark-sts dataset to evaluate the similarity between sentence pairs.

	Model Type: Sequence Classification (Regression)
	Pre-trained Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
	Fine-Tuning Dataset: mteb/stsbenchmark-sts
	Task: Sentence similarity (regression)
	Training Details
	Training Objective: To predict the similarity score between pairs of sentences.
	Training Data: mteb/stsbenchmark-sts, which contains sentence pairs with similarity scores.
	Number of Labels: 1 (regression)
	Epochs: 2
	Batch Size: 8
	Learning Rate: 2e-5
	Weight Decay: 0.01
	Evaluation
	The model was evaluated using Pearson correlation on the validation set of the mteb/stsbenchmark-sts dataset. Results indicate how well the model predicts similarity scores between sentence pairs.

	# Usage
	To use this model for sentence similarity, follow these steps:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# Load the fine-tuned model


	model = AutoModelForSequenceClassification.from_pretrained("./paraphraser_model")
	tokenizer = AutoTokenizer.from_pretrained("./paraphraser_model")

	sentences = ["The quick brown fox jumps over the lazy dog.", "A fast dark-colored fox leaps over a sleeping dog."]
	encoded_input = tokenizer(sentences[0], sentences[1], return_tensors="pt", truncation=True, padding='max_length', max_length=128)

	# Compute Similarity Score:

	import torch
	import torch.nn.functional as F

	# Perform inference
	with torch.no_grad():
	model_output = model(**encoded_input)
	logits = model_output.logits
	similarity_score = F.sigmoid(logits).item()

	print(f"Similarity score between the two sentences: {similarity_score}")

	# Mean Pooling Function:

	If using the model for generating sentence embeddings, you can use the following mean pooling function:
	def mean_pooling(model_output, attention_mask):
	token_embeddings = model_output[0] # First element of model_output contains the token embeddings
	input_mask_expanded = attention_mask.unsqueeze(-1).float()
	sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, dim=1)
	sum_mask = torch.clamp(input_mask_expanded.sum(dim=1), min=1e-9)
	return sum_embeddings / sum_mask

	# Limitations
	Domain Specificity: The model is fine-tuned on the mteb/stsbenchmark-sts dataset and may perform differently on other types of text or datasets.
	Biases: As with any model trained on human language data, it may inherit and reflect biases present in the training data.

	# Future Work
	Potential improvements include fine-tuning on additional datasets, experimenting with different architectures or hyperparameters, and incorporating additional training techniques to improve performance and robustness.

	Citation
	If you use this model in your research, please cite it as follows:
	@inproceedings{your_paper,
	title={Fine-Tuned Paraphrase-Multilingual-MiniLM-L12-v2 for Sentence Similarity},
	author={Your Name},
	year={2024},
	publisher={Your Institution}
	}


	# License
	This model is licensed under the MIT License. See the LICENSE file for more information.