lawlevisan commited on
Commit
9c228b0
·
verified ·
1 Parent(s): 6c20bf5

Upload 2 files

Browse files
evaluation_results/classification_report.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7726c5b7899421298a3732569702f85b7584dd4e9b89229a46b8433c556ee026
3
+ size 400
evaluation_results/evaluation_results.txt ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ✅ Loaded 1953 rows from 7 CSV files.
2
+ Columns detected in CSVs: ['tweet_id', 'datetime', 'username', 'user_display_name', 'user_followers', 'user_following', 'user_verified', 'content', 'user_location', 'user_description', 'hashtags', 'mentions', 'phone_numbers', 'tweet_url', 'retweet_count', 'like_count', 'reply_count', 'kar_score', 'drug_score', 'crime_score', 'sentiment', 'sentiment_compound', 'is_drug_related', 'is_crime_related', 'has_contact_info', 'risk_level', 'content_hash', 'date_parsed']
3
+
4
+ === General Stats ===
5
+ Columns: ['tweet_id', 'datetime', 'username', 'user_display_name', 'user_followers', 'user_following', 'user_verified', 'content', 'user_location', 'user_description', 'hashtags', 'mentions', 'phone_numbers', 'tweet_url', 'retweet_count', 'like_count', 'reply_count', 'kar_score', 'drug_score', 'crime_score', 'sentiment', 'sentiment_compound', 'is_drug_related', 'is_crime_related', 'has_contact_info', 'risk_level', 'content_hash', 'date_parsed']
6
+ Total rows: 1953
7
+ Missing values per column:
8
+ tweet_id 0
9
+ datetime 0
10
+ username 0
11
+ user_display_name 0
12
+ user_followers 0
13
+ user_following 0
14
+ user_verified 0
15
+ content 0
16
+ user_location 362
17
+ user_description 1953
18
+ hashtags 978
19
+ mentions 1349
20
+ phone_numbers 1939
21
+ tweet_url 0
22
+ retweet_count 0
23
+ like_count 0
24
+ reply_count 0
25
+ kar_score 0
26
+ drug_score 0
27
+ crime_score 0
28
+ sentiment 0
29
+ sentiment_compound 0
30
+ is_drug_related 0
31
+ is_crime_related 0
32
+ has_contact_info 0
33
+ risk_level 0
34
+ content_hash 0
35
+ date_parsed 0
36
+ dtype: int64
37
+ Duplicate rows: 1148
38
+
39
+ Sample rows with missing values:
40
+ tweet_id datetime username ... risk_level content_hash date_parsed
41
+ 0 1959959601048420576 25-08-2025 18:12:37 NewsMeter_In ... HIGH 8abf18065977e4493f16f4b165624ef2 2025-08-25 18:12:37
42
+ 1 1966399847793345021 12-09-2025 12:43:52 idencies05 ... MEDIUM d555008fafd825e05aca84ccffb13a77 2025-09-12 12:43:52
43
+ 2 1969293025831530539 20-09-2025 12:20:19 idencies05 ... CRITICAL 9884c305ea49804875088cd5b72c1781 2025-09-20 12:20:19
44
+ 3 1963241947126018435 03-09-2025 19:35:30 Prathikthethith ... CRITICAL f2de5dc092522d4e1af0e49442e66937 2025-09-03 19:35:30
45
+ 4 1969293025831530539 20-09-2025 12:20:19 idencies05 ... CRITICAL 9884c305ea49804875088cd5b72c1781 2025-09-20 12:20:19
46
+
47
+ [5 rows x 28 columns]
48
+
49
+ Sample duplicate rows:
50
+ tweet_id datetime username ... risk_level content_hash date_parsed
51
+ 0 1959959601048420576 25-08-2025 18:12:37 NewsMeter_In ... HIGH 8abf18065977e4493f16f4b165624ef2 2025-08-25 18:12:37
52
+ 1 1966399847793345021 12-09-2025 12:43:52 idencies05 ... MEDIUM d555008fafd825e05aca84ccffb13a77 2025-09-12 12:43:52
53
+ 2 1969293025831530539 20-09-2025 12:20:19 idencies05 ... CRITICAL 9884c305ea49804875088cd5b72c1781 2025-09-20 12:20:19
54
+ 3 1963241947126018435 03-09-2025 19:35:30 Prathikthethith ... CRITICAL f2de5dc092522d4e1af0e49442e66937 2025-09-03 19:35:30
55
+ 5 1978329389067641222 15-10-2025 10:47:36 XpressBengaluru ... CRITICAL eef36dce5060c43923727565aa695b93 2025-10-15 10:47:36
56
+
57
+ [5 rows x 28 columns]
58
+
59
+ === is_drug_related Distribution ===
60
+ is_drug_related
61
+ True 1473
62
+ False 480
63
+ Name: count, dtype: int64
64
+ Proportion:
65
+ is_drug_related
66
+ True 0.7542
67
+ False 0.2458
68
+ Name: proportion, dtype: float64
69
+
70
+ === is_crime_related Distribution ===
71
+ is_crime_related
72
+ True 1576
73
+ False 377
74
+ Name: count, dtype: int64
75
+ Proportion:
76
+ is_crime_related
77
+ True 0.807
78
+ False 0.193
79
+ Name: proportion, dtype: float64
80
+
81
+ === risk_level Distribution ===
82
+ risk_level
83
+ MEDIUM 1523
84
+ LOW 290
85
+ HIGH 127
86
+ CRITICAL 13
87
+ Name: count, dtype: int64
88
+ Proportion:
89
+ risk_level
90
+ MEDIUM 0.7798
91
+ LOW 0.1485
92
+ HIGH 0.0650
93
+ CRITICAL 0.0067
94
+ Name: proportion, dtype: float64
95
+
96
+ === Date Range ===
97
+ Earliest: 2025-03-14 16:42:38
98
+ Latest: 2025-10-17 00:36:41
99
+
100
+ === Daily Counts of Posts ===
101
+ date
102
+ 2025-03-14 2
103
+ 2025-07-19 112
104
+ 2025-07-20 35
105
+ 2025-07-21 23
106
+ 2025-07-22 27
107
+ ...
108
+ 2025-10-13 43
109
+ 2025-10-14 27
110
+ 2025-10-15 20
111
+ 2025-10-16 10
112
+ 2025-10-17 4
113
+ Length: 91, dtype: int64
114
+
115
+ === User Analysis ===
116
+ Total unique users: 554
117
+ Top 10 users by post count:
118
+ username
119
+ Newskarnataka 118
120
+ grok 57
121
+ KannadaRepublic 30
122
+ XpressBengaluru 25
123
+ path2shah 23
124
+ bangalore_22532 21
125
+ ndtv 20
126
+ ians_india 18
127
+ UvEnglish 17
128
+ wegro_app 17
129
+ Name: count, dtype: int64
130
+
131
+ === Scraper Evaluation Metrics ===
132
+ Completeness (all columns filled): 88.38%
133
+ Duplicate rows rate: 58.78%
134
+ is_drug_related relevance rate: 75.42%
135
+ is_crime_related relevance rate: 80.7%
136
+ Time coverage ratio (active days / total days): 41.94%
137
+
138
+ === Classification Metrics (is_drug_related vs is_crime_related) ===
139
+ Accuracy: 0.6994
140
+ Precision: 0.8357
141
+ Recall: 0.7811
142
+ F1-score: 0.8075
143
+
144
+ Classification report saved as 'classification_report.csv'