|
|
| # Paraphrase Detection Pipeline using Transformers |
|
|
| This repository provides a complete pipeline to fine-tune a transformer model for **Paraphrase Detection** using the PAWS dataset. |
|
|
| --- |
|
|
| ## Steps |
|
|
| ### 1. Load Dataset |
| Load the PAWS dataset which contains pairs of sentences with labels indicating if they are paraphrases or not. |
|
|
| ```python |
| from datasets import load_dataset |
| dataset = load_dataset("paws", "labeled_final") |
| ``` |
|
|
| ### 2. Preprocess and Tokenize |
| Tokenize sentence pairs with padding and truncation. |
|
|
| ```python |
| from transformers import AutoTokenizer |
| tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2") |
| |
| def preprocess_function(examples): |
| return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128) |
| |
| tokenized_datasets = dataset.map(preprocess_function, batched=True) |
| ``` |
|
|
| ### 3. Load Model |
| Load a pre-trained sequence classification model suitable for paraphrase detection. |
|
|
| ```python |
| from transformers import AutoModelForSequenceClassification |
| model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2", num_labels=2) |
| ``` |
|
|
| ### 4. Fine-tune the Model |
| Setup training arguments and fine-tune the model using the Trainer API. |
|
|
| ```python |
| from transformers import TrainingArguments, Trainer |
| import evaluate |
| |
| training_args = TrainingArguments( |
| output_dir="./paraphrase-detector", |
| evaluation_strategy="epoch", |
| save_strategy="epoch", |
| learning_rate=2e-5, |
| per_device_train_batch_size=16, |
| per_device_eval_batch_size=64, |
| num_train_epochs=3, |
| weight_decay=0.01, |
| load_best_model_at_end=True, |
| metric_for_best_model="accuracy" |
| ) |
| |
| accuracy = evaluate.load("accuracy") |
| |
| def compute_metrics(eval_preds): |
| logits, labels = eval_preds |
| predictions = logits.argmax(axis=-1) |
| return accuracy.compute(predictions=predictions, references=labels) |
| |
| trainer = Trainer( |
| model=model, |
| args=training_args, |
| train_dataset=tokenized_datasets["train"], |
| eval_dataset=tokenized_datasets["validation"], |
| tokenizer=tokenizer, |
| compute_metrics=compute_metrics, |
| ) |
| |
| trainer.train() |
| trainer.save_model("paraphrase-detector") |
| ``` |
|
|
| ### 5. Evaluate |
| Evaluate the fine-tuned model. |
|
|
| ```python |
| eval_results = trainer.evaluate() |
| print(eval_results) |
| ``` |
|
|
| ### 6. Inference |
| Use the fine-tuned model for paraphrase detection inference. |
|
|
| ```python |
| from transformers import pipeline |
| |
| paraphrase_pipeline = pipeline("text-classification", model="paraphrase-detector", tokenizer=tokenizer) |
| |
| example = paraphrase_pipeline({ |
| "text": "How old are you?", |
| "text_pair": "What is your age?" |
| }) |
| |
| print(example) |
| ``` |
|
|
| --- |
|
|
| ## Requirements |
| - `datasets` |
| - `transformers` |
| - `evaluate` |
|
|
| Install dependencies with: |
|
|
| ```bash |
| pip install datasets transformers evaluate |
| ``` |
|
|
| --- |
|
|
| ## Author |
| Your Name - your.email@example.com |
|
|
| --- |
|
|
| ## License |
| MIT License |
|
|