# Paraphrase Detection Pipeline using Transformers This repository provides a complete pipeline to fine-tune a transformer model for **Paraphrase Detection** using the PAWS dataset. --- ## Steps ### 1. Load Dataset Load the PAWS dataset which contains pairs of sentences with labels indicating if they are paraphrases or not. ```python from datasets import load_dataset dataset = load_dataset("paws", "labeled_final") ``` ### 2. Preprocess and Tokenize Tokenize sentence pairs with padding and truncation. ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2") def preprocess_function(examples): return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128) tokenized_datasets = dataset.map(preprocess_function, batched=True) ``` ### 3. Load Model Load a pre-trained sequence classification model suitable for paraphrase detection. ```python from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2", num_labels=2) ``` ### 4. Fine-tune the Model Setup training arguments and fine-tune the model using the Trainer API. ```python from transformers import TrainingArguments, Trainer import evaluate training_args = TrainingArguments( output_dir="./paraphrase-detector", evaluation_strategy="epoch", save_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=64, num_train_epochs=3, weight_decay=0.01, load_best_model_at_end=True, metric_for_best_model="accuracy" ) accuracy = evaluate.load("accuracy") def compute_metrics(eval_preds): logits, labels = eval_preds predictions = logits.argmax(axis=-1) return accuracy.compute(predictions=predictions, references=labels) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], tokenizer=tokenizer, compute_metrics=compute_metrics, ) trainer.train() trainer.save_model("paraphrase-detector") ``` ### 5. Evaluate Evaluate the fine-tuned model. ```python eval_results = trainer.evaluate() print(eval_results) ``` ### 6. Inference Use the fine-tuned model for paraphrase detection inference. ```python from transformers import pipeline paraphrase_pipeline = pipeline("text-classification", model="paraphrase-detector", tokenizer=tokenizer) example = paraphrase_pipeline({ "text": "How old are you?", "text_pair": "What is your age?" }) print(example) ``` --- ## Requirements - `datasets` - `transformers` - `evaluate` Install dependencies with: ```bash pip install datasets transformers evaluate ``` --- ## Author Your Name - your.email@example.com --- ## License MIT License