cja5553
/

deberta-Twitter-spam-classification

@@ -15,6 +15,87 @@ This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1)
 This was fine-tuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large) serving as the base model.
 ## Metrics
 Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
@@ -23,5 +104,6 @@ Based on a 80-10-10 train-val-test split, the following results were obtained on
 - Recall: 0.9779
 - F1-Score: 0.9779
 ## Questions?
 contact me at alba@wustl.edu

 This was fine-tuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large) serving as the base model.
+## How to use model
+Here is some source code to get you started on using the model to classify spam Tweets.
+```{python}
+def classify_texts(df, text_col, model_path="cja5553/deberta-Twitter-spam-classification", batch_size=24):
+    '''
+    Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.
+    Parameters:
+    -----------
+    df : pandas.DataFrame
+        DataFrame containing the texts to classify.
+    text_col : str
+        Name of the column in that contains the text data to be classified.
+    model_path : str, default="cja5553/deberta-Twitter-spam-classification"
+        Path to the pre-trained model for sequence classification.
+    batch_size : int, optional, default=24
+        Batch size for loading and processing data in batches. Adjust based on available GPU memory.
+    Returns:
+    --------
+    pandas.DataFrame
+        The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.
+    '''
+    # Load the tokenizer and model
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
+    model.eval()  # Set model to evaluation mode
+    # Prepare the text data for classification
+    df["text"] = df[text_col].astype(str)  # Ensure text is in string format
+    # Convert the data to a Hugging Face Dataset and tokenize
+    text_dataset = Dataset.from_pandas(df)
+    def tokenize_function(example):
+        return tokenizer(
+            example["text"],
+            padding="max_length",
+            truncation=True,
+            max_length=512
+        )
+    text_dataset = text_dataset.map(tokenize_function, batched=True)
+    text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
+    # DataLoader for the text data
+    text_loader = DataLoader(text_dataset, batch_size=batch_size)
+    # Make predictions
+    predictions = []
+    with torch.no_grad():
+        for batch in tqdm_notebook(text_loader):
+            input_ids = batch['input_ids'].to("cuda")
+            attention_mask = batch['attention_mask'].to("cuda")
+            # Forward pass
+            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
+            logits = outputs.logits
+            preds = torch.argmax(logits, dim=-1).cpu().numpy()  # Get predicted labels
+            predictions.extend(preds)
+    # Map predictions to labels
+    id2label = {0: "Quality", 1: "Spam"}
+    predicted_labels = [id2label[pred] for pred in predictions]
+    # Add predictions to the original DataFrame
+    df["spam_prediction"] = predicted_labels
+    return df
+spam_df_classification = classify_texts(df, "text_col")
+print(spam_df_classification)
+```
 ## Metrics
 Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
 - Recall: 0.9779
 - F1-Score: 0.9779
 ## Questions?
 contact me at alba@wustl.edu