Devishetty100 commited on
Commit
6936cc5
·
verified ·
1 Parent(s): 2ba4f61

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +209 -0
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - clickbait-detection
5
+ - text-classification
6
+ - sklearn
7
+ - random-forest
8
+ - tfidf
9
+ license: mit
10
+ datasets:
11
+ - clickbait-dataset
12
+ metrics:
13
+ - accuracy
14
+ - precision
15
+ - recall
16
+ - f1-score
17
+ ---
18
+
19
+ # Clickbait Detector
20
+
21
+ This model is a machine learning classifier trained to detect clickbait headlines. It uses a Random Forest algorithm with TF-IDF vectorization to classify news headlines as either "clickbait" or "real".
22
+
23
+ ## Model Details
24
+
25
+ ### Model Description
26
+
27
+ - **Model type:** Random Forest Classifier
28
+ - **Task:** Text Classification (Clickbait Detection)
29
+ - **Input:** News headlines (text strings)
30
+ - **Output:** Binary classification ("clickbait" or "real")
31
+ - **Language(s) covered:** English
32
+ - **License:** MIT
33
+
34
+ ### Model Sources
35
+
36
+ - **Repository:** [Devishetty100/clickbait-detector](https://huggingface.co/Devishetty100/clickbait-detector)
37
+ - **Paper or resources:** N/A
38
+ - **Demo:** N/A
39
+
40
+ ## Uses
41
+
42
+ ### Direct Use
43
+
44
+ This model can be used to classify news headlines and identify potentially misleading or sensationalized content. It can be integrated into content moderation systems, news aggregators, or educational tools to help users discern between genuine news and clickbait.
45
+
46
+ ### Downstream Use
47
+
48
+ - Content filtering and moderation
49
+ - Journalism education
50
+ - Social media analysis
51
+ - Research on media manipulation
52
+
53
+ ### Out-of-Scope Use
54
+
55
+ This model should not be used for:
56
+ - Automated content removal without human oversight
57
+ - Making decisions that affect individuals' livelihoods or rights
58
+ - Classifying content in languages other than English
59
+
60
+ ## Bias, Risks, and Limitations
61
+
62
+ ### Recommendations
63
+
64
+ Users should be aware that:
65
+ - The model may have biases based on the training data
66
+ - Performance may vary across different domains or writing styles
67
+ - False positives/negatives can occur
68
+ - The model is trained on English text only
69
+
70
+ ### Known Limitations
71
+
72
+ - Trained on a specific dataset which may not represent all types of clickbait or real news
73
+ - May not perform well on very short or very long headlines
74
+ - Does not consider context beyond the headline text itself
75
+ - Binary classification may not capture nuanced cases
76
+
77
+ ## Training Details
78
+
79
+ ### Training Data
80
+
81
+ The model was trained on the [Clickbait Dataset](https://www.kaggle.com/datasets/amananandrai/clickbait-dataset) from Kaggle, which contains news headlines labeled as clickbait or real.
82
+
83
+ - **Dataset size:** 32,000 samples (16,000 clickbait, 16,000 real)
84
+ - **Data preprocessing:** Text cleaning, TF-IDF vectorization with English stop words, max 5000 features
85
+ - **Train/test split:** 80/20 stratified split (25,600 train, 6,400 test)
86
+
87
+ ### Training Procedure
88
+
89
+ - **Architecture:** Random Forest with 200 estimators
90
+ - **Hyperparameters:** Default parameters except n_estimators=200, random_state=42
91
+ - **Training time:** [Not specified]
92
+ - **Hardware:** [Not specified]
93
+ - **Software:** scikit-learn, pandas, numpy
94
+
95
+ ## Evaluation
96
+
97
+ ### Metrics
98
+
99
+ The model achieves the following performance on the test set:
100
+
101
+ - **Accuracy:** 91.45%
102
+ - **Precision:** 0.92 (macro avg)
103
+ - **Recall:** 0.91 (macro avg)
104
+ - **F1-Score:** 0.91 (macro avg)
105
+
106
+ ### Testing Data, Factors & Metrics
107
+
108
+ #### Testing Data
109
+ - Same dataset as training, held-out test set
110
+ - Stratified sampling to maintain class balance
111
+
112
+ #### Factors
113
+ - Headline length and complexity
114
+ - Use of sensational language
115
+ - Topic domain
116
+
117
+ #### Metrics
118
+ - Accuracy, Precision, Recall, F1-Score
119
+ - Confusion Matrix
120
+
121
+ ### Results
122
+
123
+ ```
124
+ precision recall f1-score support
125
+
126
+ clickbait 0.89 0.95 0.92 3200
127
+ real 0.94 0.88 0.91 3200
128
+
129
+ accuracy 0.91 6400
130
+ macro avg 0.92 0.91 0.91 6400
131
+ weighted avg 0.92 0.91 0.91 6400
132
+ ```
133
+
134
+ ## Environmental Impact
135
+
136
+ **Estimated Emissions:** Not calculated
137
+
138
+ **Hardware Type:** Standard CPU training
139
+
140
+ **Hours used:** [Not specified]
141
+
142
+ ## Technical Specifications
143
+
144
+ ### Model Architecture and Objective
145
+
146
+ - **Architecture:** Ensemble of decision trees (Random Forest)
147
+ - **Objective:** Binary classification using TF-IDF features
148
+ - **Input preprocessing:** TF-IDF vectorization
149
+ - **Output postprocessing:** Class prediction
150
+
151
+ ### Compute Infrastructure
152
+
153
+ - **Hardware:** CPU-based training
154
+ - **Software:** Python, scikit-learn
155
+
156
+ ## How to Use
157
+
158
+ ### Loading the Model
159
+
160
+ ```python
161
+ from huggingface_hub import hf_hub_download
162
+ import joblib
163
+
164
+ # Download model and vectorizer
165
+ model_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="clickbait_detector.pkl")
166
+ vectorizer_path = hf_hub_download(repo_id="Devishetty100/clickbait-detector", filename="tfidf_vectorizer.pkl")
167
+
168
+ # Load
169
+ model = joblib.load(model_path)
170
+ vectorizer = joblib.load(vectorizer_path)
171
+ ```
172
+
173
+ ### Making Predictions
174
+
175
+ ```python
176
+ # Example headline
177
+ headline = "You won't believe what happened next!"
178
+
179
+ # Transform and predict
180
+ features = vectorizer.transform([headline])
181
+ prediction = model.predict(features)[0]
182
+
183
+ print(f"Prediction: {prediction}") # Output: 'clickbait' or 'real'
184
+ ```
185
+
186
+ ### Requirements
187
+
188
+ - Python 3.6+
189
+ - scikit-learn
190
+ - joblib
191
+ - huggingface_hub
192
+
193
+ ## Citation
194
+
195
+ If you use this model, please cite:
196
+
197
+ ```
198
+ @misc{clickbait-detector,
199
+ title={Clickbait Detector},
200
+ author={Devishetty100},
201
+ year={2024},
202
+ publisher={Hugging Face},
203
+ url={https://huggingface.co/Devishetty100/clickbait-detector}
204
+ }
205
+ ```
206
+
207
+ ## Contact
208
+
209
+ For questions or issues, please open an issue on the [repository](https://huggingface.co/Devishetty100/clickbait-detector).