File size: 3,749 Bytes
2c42f92
 
 
 
 
 
10842ca
 
2c42f92
 
22526af
33c44ba
2c42f92
2415df9
2c42f92
1966b26
2c42f92
3fc192b
 
 
 
165f267
3fc192b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2415df9
2c42f92
 
 
 
 
 
 
a53af12
 
736a14e
3fc192b
2c42f92
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: mit
language:
- en
library_name: transformers
tags:
- spam detection
- Twitter
base_model: microsoft/deberta-v3-large
---
# Spam detection of Tweets
This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0). 

## Training Dataset

This was fine-tuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large) serving as the base model.

## How to use model

Here is some source code to get you started on using the model to classify spam Tweets. 

```python
def classify_texts(df, text_col, model_path="cja5553/deberta-Twitter-spam-classification", batch_size=24):
    '''
    Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing the texts to classify.
    
    text_col : str
        Name of the column in that contains the text data to be classified.
    
    model_path : str, default="cja5553/deberta-Twitter-spam-classification"
        Path to the pre-trained model for sequence classification.
    
    batch_size : int, optional, default=24
        Batch size for loading and processing data in batches. Adjust based on available GPU memory.

    Returns:
    --------
    pandas.DataFrame
        The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.

    '''
    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
    model.eval()  # Set model to evaluation mode
    
    # Prepare the text data for classification
    df["text"] = df[text_col].astype(str)  # Ensure text is in string format

    # Convert the data to a Hugging Face Dataset and tokenize
    text_dataset = Dataset.from_pandas(df)
    
    def tokenize_function(example):
        return tokenizer(
            example["text"],
            padding="max_length",
            truncation=True,
            max_length=512
        )
    
    text_dataset = text_dataset.map(tokenize_function, batched=True)
    text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
    
    # DataLoader for the text data
    text_loader = DataLoader(text_dataset, batch_size=batch_size)
    
    # Make predictions
    predictions = []
    with torch.no_grad():
        for batch in tqdm_notebook(text_loader):
            input_ids = batch['input_ids'].to("cuda")
            attention_mask = batch['attention_mask'].to("cuda")
            
            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1).cpu().numpy()  # Get predicted labels
            predictions.extend(preds)
    
    # Map predictions to labels
    id2label = {0: "Quality", 1: "Spam"}
    predicted_labels = [id2label[pred] for pred in predictions]
    
    # Add predictions to the original DataFrame
    df["spam_prediction"] = predicted_labels
    
    return df

spam_df_classification = classify_texts(df, "text_col")
print(spam_df_classification)

```

## Metrics

Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
- Accuracy: 0.9779
- Precision: 0.9781
- Recall: 0.9779
- F1-Score: 0.9779

## Code 

Code used to train these models are available on GitHub at [github.com/cja5553/Twitter_spam_detection](https://github.com/cja5553/Twitter_spam_detection)

## Questions?
contact me at alba@wustl.edu