File size: 5,477 Bytes
6936cc5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
language: en
tags:
- clickbait-detection
- text-classification
- sklearn
- random-forest
- tfidf
license: mit
datasets:
- clickbait-dataset
metrics:
- accuracy
- precision
- recall
- f1-score
---
# Clickbait Detector
This model is a machine learning classifier trained to detect clickbait headlines. It uses a Random Forest algorithm with TF-IDF vectorization to classify news headlines as either "clickbait" or "real".
## Model Details
### Model Description
- **Model type:** Random Forest Classifier
- **Task:** Text Classification (Clickbait Detection)
- **Input:** News headlines (text strings)
- **Output:** Binary classification ("clickbait" or "real")
- **Language(s) covered:** English
- **License:** MIT
### Model Sources
- **Repository:** [Devishetty100/clickbait-detector](https://huggingface.co/Devishetty100/clickbait-detector)
- **Paper or resources:** N/A
- **Demo:** N/A
## Uses
### Direct Use
This model can be used to classify news headlines and identify potentially misleading or sensationalized content. It can be integrated into content moderation systems, news aggregators, or educational tools to help users discern between genuine news and clickbait.
### Downstream Use
- Content filtering and moderation
- Journalism education
- Social media analysis
- Research on media manipulation
### Out-of-Scope Use
This model should not be used for:
- Automated content removal without human oversight
- Making decisions that affect individuals' livelihoods or rights
- Classifying content in languages other than English
## Bias, Risks, and Limitations
### Recommendations
Users should be aware that:
- The model may have biases based on the training data
- Performance may vary across different domains or writing styles
- False positives/negatives can occur
- The model is trained on English text only
### Known Limitations
- Trained on a specific dataset which may not represent all types of clickbait or real news
- May not perform well on very short or very long headlines
- Does not consider context beyond the headline text itself
- Binary classification may not capture nuanced cases
## Training Details
### Training Data
The model was trained on the [Clickbait Dataset](https://www.kaggle.com/datasets/amananandrai/clickbait-dataset) from Kaggle, which contains news headlines labeled as clickbait or real.
- **Dataset size:** 32,000 samples (16,000 clickbait, 16,000 real)
- **Data preprocessing:** Text cleaning, TF-IDF vectorization with English stop words, max 5000 features
- **Train/test split:** 80/20 stratified split (25,600 train, 6,400 test)
### Training Procedure
- **Architecture:** Random Forest with 200 estimators
- **Hyperparameters:** Default parameters except n_estimators=200, random_state=42
- **Training time:** [Not specified]
- **Hardware:** [Not specified]
- **Software:** scikit-learn, pandas, numpy
## Evaluation
### Metrics
The model achieves the following performance on the test set:
- **Accuracy:** 91.45%
- **Precision:** 0.92 (macro avg)
- **Recall:** 0.91 (macro avg)
- **F1-Score:** 0.91 (macro avg)
### Testing Data, Factors & Metrics
#### Testing Data
- Same dataset as training, held-out test set
- Stratified sampling to maintain class balance
#### Factors
- Headline length and complexity
- Use of sensational language
- Topic domain
#### Metrics
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrix
### Results
```
precision recall f1-score support
clickbait 0.89 0.95 0.92 3200
real 0.94 0.88 0.91 3200
accuracy 0.91 6400
macro avg 0.92 0.91 0.91 6400
weighted avg 0.92 0.91 0.91 6400
```
## Environmental Impact
**Estimated Emissions:** Not calculated
**Hardware Type:** Standard CPU training
**Hours used:** [Not specified]
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** Ensemble of decision trees (Random Forest)
- **Objective:** Binary classification using TF-IDF features
- **Input preprocessing:** TF-IDF vectorization
- **Output postprocessing:** Class prediction
### Compute Infrastructure
- **Hardware:** CPU-based training
- **Software:** Python, scikit-learn
## How to Use
### Loading the Model
```python
from huggingface_hub import hf_hub_download
import joblib
# Download model and vectorizer
model_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="clickbait_detector.pkl")
vectorizer_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="tfidf_vectorizer.pkl")
# Load
model = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)
```
### Making Predictions
```python
# Example headline
headline = "You won't believe what happened next!"
# Transform and predict
features = vectorizer.transform([headline])
prediction = model.predict(features)[0]
print(f"Prediction: {prediction}") # Output: 'clickbait' or 'real'
```
### Requirements
- Python 3.6+
- scikit-learn
- joblib
- huggingface_hub
## Citation
If you use this model, please cite:
```
@misc{clickbait-detector,
title={Clickbait Detector},
author={Devishetty100},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Devishetty100/clickbait-detector}
}
```
## Contact
For questions or issues, please open an issue on the [repository](https://huggingface.co/Devishetty100/clickbait-detector). |