|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- codesignal/sms-spam-collection |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
|
|
|
## **Model Overview** |
|
|
This model is a fine-tuned version of BERT designed to classify SMS messages as either spam or not spam. It was developed as part of a technical test for the startup **IntiGo**. |
|
|
|
|
|
### **Model Details** |
|
|
- **Model Name:** BERT Fine-Tuned for SMS Spam Classification |
|
|
- **Library:** [Transformers](https://huggingface.co/transformers/) |
|
|
- **Language:** English |
|
|
- **Pipeline Tag:** `text-classification` |
|
|
|
|
|
### **License** |
|
|
This model is released under the [MIT License](https://opensource.org/licenses/MIT). |
|
|
|
|
|
## **Datasets** |
|
|
- **Training Dataset:** [codesignal/sms-spam-collection](https://huggingface.co/datasets/codesignal/sms-spam-collection) |
|
|
|
|
|
## **Fine-Tuning Procedure** |
|
|
This model was fine-tuned on the SMS Spam Collection dataset. The dataset contains a collection of SMS messages labeled as "spam" or "ham" (not spam). |
|
|
|
|
|
### **Metrics** |
|
|
- **Precision:** 0.99 |
|
|
- **Recall:** 0.81 |
|
|
- **F1 Score:** 0.96 |
|
|
|
|
|
These metrics were computed on the validation set and indicate that the model is highly precise, with a strong ability to balance false positives and false negatives. |
|
|
|
|
|
### **Usage** |
|
|
You can use this model to classify SMS messages into spam or not spam. The model accepts raw text input and outputs a label prediction. |
|
|
|
|
|
#### Example: |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "Amenallah2001/intigo-technical-test" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Example input |
|
|
text = "Congratulations! You've won a free ticket to Bahamas. Call now!" |
|
|
|
|
|
# Tokenize and classify |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
predicted_class = logits.argmax().item() |
|
|
|
|
|
# Output prediction |
|
|
label_map = {0: "ham", 1: "spam"} |
|
|
print(f"Prediction: {label_map[predicted_class]}") |
|
|
``` |
|
|
|
|
|
### **Intended Use** |
|
|
This model is intended for detecting spam in SMS messages. It can be integrated into systems that require spam detection, such as messaging apps or SMS gateways. |
|
|
|
|
|
### **Limitations** |
|
|
- **Data Imbalance:** The dataset used for training was imbalanced, which could affect the model’s performance in real-world scenarios with different distributions of spam and non-spam messages. |
|
|
- **Language Support:** This model was fine-tuned on English text only and may not perform well on SMS messages in other languages. |
|
|
|
|
|
### **Ethical Considerations** |
|
|
When using this model, be mindful of privacy concerns and ensure that the deployment complies with relevant regulations, especially in handling user-generated content. |
|
|
|