cja5553 commited on
Commit
3fc192b
·
verified ·
1 Parent(s): 1966b26

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md CHANGED
@@ -15,6 +15,87 @@ This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1)
15
 
16
  This was fine-tuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large) serving as the base model.
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ## Metrics
19
 
20
  Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
@@ -23,5 +104,6 @@ Based on a 80-10-10 train-val-test split, the following results were obtained on
23
  - Recall: 0.9779
24
  - F1-Score: 0.9779
25
 
 
26
  ## Questions?
27
  contact me at alba@wustl.edu
 
15
 
16
  This was fine-tuned on the [UtkMl's Twitter Spam Detection dataset](https://www.kaggle.com/c/twitter-spam/overview) with [`microsoft/deberta-v3-large`](https://huggingface.co/microsoft/deberta-v3-large) serving as the base model.
17
 
18
+ ## How to use model
19
+
20
+ Here is some source code to get you started on using the model to classify spam Tweets.
21
+
22
+ ```{python}
23
+ def classify_texts(df, text_col, model_path="cja5553/deberta-Twitter-spam-classification", batch_size=24):
24
+ '''
25
+ Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.
26
+
27
+ Parameters:
28
+ -----------
29
+ df : pandas.DataFrame
30
+ DataFrame containing the texts to classify.
31
+
32
+ text_col : str
33
+ Name of the column in that contains the text data to be classified.
34
+
35
+ model_path : str, default="cja5553/deberta-Twitter-spam-classification"
36
+ Path to the pre-trained model for sequence classification.
37
+
38
+ batch_size : int, optional, default=24
39
+ Batch size for loading and processing data in batches. Adjust based on available GPU memory.
40
+
41
+ Returns:
42
+ --------
43
+ pandas.DataFrame
44
+ The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.
45
+
46
+ '''
47
+ # Load the tokenizer and model
48
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
49
+ model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
50
+ model.eval() # Set model to evaluation mode
51
+
52
+ # Prepare the text data for classification
53
+ df["text"] = df[text_col].astype(str) # Ensure text is in string format
54
+
55
+ # Convert the data to a Hugging Face Dataset and tokenize
56
+ text_dataset = Dataset.from_pandas(df)
57
+
58
+ def tokenize_function(example):
59
+ return tokenizer(
60
+ example["text"],
61
+ padding="max_length",
62
+ truncation=True,
63
+ max_length=512
64
+ )
65
+
66
+ text_dataset = text_dataset.map(tokenize_function, batched=True)
67
+ text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
68
+
69
+ # DataLoader for the text data
70
+ text_loader = DataLoader(text_dataset, batch_size=batch_size)
71
+
72
+ # Make predictions
73
+ predictions = []
74
+ with torch.no_grad():
75
+ for batch in tqdm_notebook(text_loader):
76
+ input_ids = batch['input_ids'].to("cuda")
77
+ attention_mask = batch['attention_mask'].to("cuda")
78
+
79
+ # Forward pass
80
+ outputs = model(input_ids=input_ids, attention_mask=attention_mask)
81
+ logits = outputs.logits
82
+ preds = torch.argmax(logits, dim=-1).cpu().numpy() # Get predicted labels
83
+ predictions.extend(preds)
84
+
85
+ # Map predictions to labels
86
+ id2label = {0: "Quality", 1: "Spam"}
87
+ predicted_labels = [id2label[pred] for pred in predictions]
88
+
89
+ # Add predictions to the original DataFrame
90
+ df["spam_prediction"] = predicted_labels
91
+
92
+ return df
93
+
94
+ spam_df_classification = classify_texts(df, "text_col")
95
+ print(spam_df_classification)
96
+
97
+ ```
98
+
99
  ## Metrics
100
 
101
  Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
 
104
  - Recall: 0.9779
105
  - F1-Score: 0.9779
106
 
107
+
108
  ## Questions?
109
  contact me at alba@wustl.edu