kenbaker-gif commited on
Commit
7b939d4
·
verified ·
1 Parent(s): 244f7cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -18
README.md CHANGED
@@ -1,6 +1,23 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # Model Card for Model ID
@@ -17,51 +34,117 @@ tags: []
17
 
18
  This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [Ainebyona Abubaker]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
  ### Model Sources [optional]
29
 
30
  <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
  ## Uses
37
 
 
 
 
 
 
 
 
 
38
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
  ### Direct Use
41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
 
 
 
 
 
 
 
 
47
 
48
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
  ### Out-of-Scope Use
 
 
 
 
 
53
 
54
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
  ## Bias, Risks, and Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
  [More Information Needed]
63
 
64
  ### Recommendations
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
@@ -71,18 +154,76 @@ Users (both direct and downstream) should be made aware of the risks, biases and
71
 
72
  Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ### Training Data
 
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
 
 
 
 
 
 
 
 
83
 
84
  ### Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
 
86
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
  #### Preprocessing [optional]
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - text-generation-inference
5
+ - spam-detection
6
+ - nlp
7
+ - binary-classification
8
+ license: apache-2.0
9
+ datasets:
10
+ - bvk/SMS-spam
11
+ language:
12
+ - en
13
+ metrics:
14
+ - accuracy
15
+ - f1
16
+ - precision
17
+ - recall
18
+ base_model:
19
+ - distilbert/distilbert-base-uncased
20
+ pipeline_tag: text-classification
21
  ---
22
 
23
  # Model Card for Model ID
 
34
 
35
  This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
36
 
37
+ - **Developed by:** Ainebyona Abubaker
38
+ - **Funded by :** This model was developed independenly by Ainebyona Abubaker with no external funding.
39
+ - **Shared by :** Ainebyona Abubaker
40
+ - **Model type:** DistilBERT
41
+ - **Language(s) (NLP):** English
42
+ - **License:** Apache 2.0 License
43
+ - **Finetuned from model distilbert-base-uncased:**
44
 
45
  ### Model Sources [optional]
46
 
47
  <!-- Provide the basic links for the model. -->
48
 
49
+ - **Repository:** https://huggingface.co/kenbaker-gif/Email_Spam_Classifier
 
 
50
 
51
  ## Uses
52
 
53
+ - This model can be used for:
54
+
55
+ - Detecting spam messages in SMS or short text messages
56
+
57
+ - Educational purposes in NLP and machine learning
58
+
59
+ - Research and development of spam detection systems
60
+
61
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
62
 
63
  ### Direct Use
64
 
65
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
66
+
67
+ # Load the model and tokenizer
68
+ model_name = "your-username/spam-classifier"
69
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
70
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
71
+
72
+ # Create a text-classification pipeline
73
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
74
+
75
+ # Example usage
76
+ result = classifier("Congratulations! You've won a $500 gift card.")
77
+ print(result)
78
+ # Output: [{'label': 'SPAM', 'score': 0.99}]
79
+
80
+
81
  <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
82
 
 
83
 
84
+
85
+ ### Downstream Use.
86
+ - Email spam detection – fine-tune on email datasets for spam classification
87
+
88
+ - Chat moderation – detecting unwanted or spammy messages in chat apps
89
+
90
+ - SMS analytics – analyzing messaging patterns for marketing or user studies
91
+
92
+ - Text classification pipelines – can be incorporated into larger NLP workflows
93
 
94
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
95
 
96
+
97
 
98
  ### Out-of-Scope Use
99
+ - Not recommended for high-stakes decisions (legal, financial, or medical) without further validation
100
+
101
+ - Performance on languages other than English is not guaranteed
102
+
103
+ - Not tested on long-form text or other messaging platforms (email, social media)
104
 
105
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
106
 
107
+
108
 
109
  ## Bias, Risks, and Limitations
110
+ Biases:
111
+
112
+ - The model is trained on English SMS messages, so it may underperform on messages in other languages or dialects.
113
+
114
+ - It may be biased toward patterns in the training data, such as certain spam phrases or formatting, which can lead to false positives or false negatives.
115
+
116
+ - Minority or unusual types of spam may not be well recognized.
117
+
118
+ Risks:
119
+
120
+ - Misclassifying messages could lead to important messages being ignored or spam being delivered.
121
+
122
+ - Using the model in high-stakes applications (legal, financial, medical) without proper validation could have serious consequences.
123
+
124
+ Limitations:
125
+
126
+ - Only trained for binary classification: HAM (not spam) vs SPAM.
127
+
128
+ - Performance may degrade on longer texts, emails, or social media messages.
129
+
130
+ - The model may need fine-tuning for datasets outside SMS messages to maintain accuracy.
131
 
132
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
133
 
134
  [More Information Needed]
135
 
136
  ### Recommendations
137
+ - This model is recommended for detecting spam in short English text messages (SMS).
138
+
139
+ - Suitable for educational, research, and prototype applications in NLP and text classification.
140
+
141
+ - Not recommended for high-stakes environments (legal, financial, or medical) without further testing and validation.
142
+
143
+ - Users are encouraged to fine-tune the model if applying it to new datasets, different languages, or longer text formats.
144
+
145
+ - Always review model predictions before acting on them, especially in critical applications.
146
+
147
+ 💡 Tip:
148
 
149
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
150
 
 
154
 
155
  Use the code below to get started with the model.
156
 
157
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
158
+
159
+ # Load model and tokenizer
160
+ model_name = "your-username/spam-classifier"
161
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
162
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
163
+
164
+ # Create pipeline
165
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
166
+
167
+ # Example usage
168
+ result = classifier("Congratulations! You've won a $500 Amazon gift card.")
169
+ print(result)
170
+ # Output: [{'label': 'SPAM', 'score': 0.99}]
171
+
172
 
173
  ## Training Details
174
+ - Base Model: distilbert-base-uncased (DistilBERT)
175
+
176
+ - Task: Binary SMS spam classification (HAM / SPAM)
177
+
178
+ - Dataset: SMS Spam Collection (80% train, 20% eval)
179
+
180
+ - Preprocessing: Tokenized with padding & truncation
181
+
182
+ - Training: 3 epochs, batch size 16, learning rate 2e-5, AdamW optimizer
183
+
184
+ - Metrics: Accuracy, Weighted F1-score
185
+
186
+ - Trained for short English SMS messages; fine-tuning may be needed for other text types or languages.
187
 
188
  ### Training Data
189
+ - Primary Dataset: SMS Spam Collection Dataset
190
 
191
+ - Content: English SMS messages labeled as HAM (not spam) or SPAM
192
 
193
+ - Size: ~5,500 messages
194
+
195
+ - Preprocessing: Text tokenized with padding and truncation; labels mapped to 0 (HAM) and 1 (SPAM)
196
+
197
+ - Additional Datasets: Optional — can combine with other SMS/spam datasets to improve generalization
198
+
199
+ - The model is optimized for short English SMS messages; performance on other text types or languages may vary.
200
+
201
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
202
 
203
  ### Training Procedure
204
+ 1. Data Preparation:
205
+ - Loaded the SMS Spam Collection dataset
206
+ - Tokenized messages using AutoTokenizer with padding and truncation
207
+ - Split dataset: 80% train, 20% evaluation
208
+
209
+ 2. Model Setup:
210
+ - Base model: distilbert-base-uncased
211
+ -Task: Binary classification (HAM vs SPAM)
212
+
213
+ 3. Training:
214
+ - Optimizer: AdamW
215
+ - Learning rate: 2e-5
216
+ - Batch size: 16 (train & eval)
217
+
218
+ 4. Number of epochs: 3
219
+
220
+ 5. Evaluation and checkpointing performed at each epoch.
221
+
222
+ 6. Metrics Monitored:
223
+ - Accuracy
224
+ - Weighted F1-score
225
 
226
+ Training focused on short English SMS messages; additional fine-tuning may be needed for other datasets or text types.
227
  <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
228
 
229
  #### Preprocessing [optional]