Nishitha03 commited on
Commit
dd99def
·
verified ·
1 Parent(s): 169e806

Upload 15 files

Browse files
src/README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Indian News Scraper
2
+
3
+ A collection of web scrapers for various Indian news websites that can extract articles based on specific topics.
4
+
5
+ ## Features
6
+
7
+ - Scrapes articles from major Indian news sources:
8
+ - Times of India (TOI)
9
+ - NDTV
10
+ - WION
11
+ - Scroll.in
12
+ - Command-line interface for easy use
13
+ - Multithreaded scraping for fast performance
14
+ - Automatic progress saving to prevent data loss
15
+ - CSV output format for easy analysis
16
+
17
+ ## Requirements
18
+
19
+ - Python 3.7+
20
+ - Chrome browser
21
+ - ChromeDriver (compatible with your Chrome version)
22
+
23
+ ## Installation
24
+
25
+ 1. Clone this repository:
26
+ ```bash
27
+ git clone https://github.com/yourusername/indian-news-scraper.git
28
+ cd indian-news-scraper
29
+ ```
30
+
31
+ 2. Install the required dependencies:
32
+ ```bash
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ 3. Make sure you have Chrome and ChromeDriver installed:
37
+ - Install Chrome: [https://www.google.com/chrome/](https://www.google.com/chrome/)
38
+ - Download ChromeDriver: [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads)
39
+ - Make sure ChromeDriver is in your PATH
40
+
41
+ ## Usage
42
+
43
+ Run the main script with the desired news source and topic:
44
+
45
+ ```bash
46
+ python run_scraper.py --source toi --topic "Climate Change"
47
+ ```
48
+
49
+ ### Available News Sources
50
+
51
+ - `toi` - Times of India
52
+ - `ndtv` - NDTV
53
+ - `wion` - WION News
54
+ - `scroll` - Scroll.in
55
+
56
+ ### Command Line Options
57
+
58
+ ```
59
+ usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL]
60
+
61
+ Scrape news articles from Indian news websites
62
+
63
+ optional arguments:
64
+ -h, --help show this help message and exit
65
+ --source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll}
66
+ News source to scrape from
67
+ --topic TOPIC, -t TOPIC
68
+ Topic to search for (e.g., "Climate Change", "Politics")
69
+ --workers WORKERS, -w WORKERS
70
+ Number of worker threads (default: 4)
71
+ --interval INTERVAL, -i INTERVAL
72
+ Auto-save interval in seconds (default: 300)
73
+ ```
74
+
75
+ ### Examples
76
+
77
+ Scrape articles about "COVID" from Times of India:
78
+ ```bash
79
+ python run_scraper.py --source toi --topic COVID
80
+ ```
81
+
82
+ Scrape articles about "Elections" from NDTV with 8 worker threads:
83
+ ```bash
84
+ python run_scraper.py --source ndtv --topic Elections --workers 8
85
+ ```
86
+
87
+ Scrape articles about "Climate Change" from Scroll.in with auto-save every minute:
88
+ ```bash
89
+ python run_scraper.py --source scroll --topic "Climate Change" --interval 60
90
+ ```
91
+
92
+ ## Output
93
+
94
+ The scraped articles are saved in CSV format in the `output` directory with filenames in the following format:
95
+ ```
96
+ {source}_{topic}articles_{timestamp}_{status}.csv
97
+ ```
98
+
99
+ For example:
100
+ ```
101
+ output/toi_COVIDarticles_20250407_121530_final.csv
102
+ ```
103
+
104
+ ## Contributing
105
+
106
+ Contributions are welcome! Please feel free to submit a Pull Request.
107
+
108
+ 1. Fork the repository
109
+ 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
110
+ 3. Commit your changes (`git commit -m 'Add some amazing feature'`)
111
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
112
+ 5. Open a Pull Request
113
+
114
+ ## License
115
+
116
+ This project is licensed under the MIT License - see the LICENSE file for details.
117
+
118
+ ## Disclaimer
119
+
120
+ This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.
src/app.py ADDED
@@ -0,0 +1,1050 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import matplotlib.pyplot as plt
4
+ import numpy as np
5
+ import plotly.express as px
6
+ import plotly.graph_objects as go
7
+ from matplotlib.ticker import MaxNLocator
8
+ import os
9
+ import time
10
+ import json
11
+ import requests
12
+ import spacy
13
+ from tqdm import tqdm
14
+ import warnings
15
+ import pandas as pd
16
+ import pussy
17
+
18
+ # Suppress warnings for cleaner output
19
+ warnings.filterwarnings('ignore')
20
+
21
+ # Set page configuration
22
+ st.set_page_config(
23
+ page_title="Sentiment Analysis of RSS Articles",
24
+ page_icon="📰",
25
+ layout="wide",
26
+ initial_sidebar_state="expanded"
27
+ )
28
+
29
+ # Custom CSS for styling
30
+ def load_css():
31
+ st.markdown("""
32
+ <style>
33
+ .main-header {
34
+ font-size: 3rem !important;
35
+ font-weight: 700 !important;
36
+ text-align: center !important;
37
+ padding: 2rem 0 !important;
38
+ }
39
+ .sub-header {
40
+ font-size: 2rem !important;
41
+ font-weight: 600 !important;
42
+ padding: 1rem 0 !important;
43
+ }
44
+ .newspaper-card {
45
+ background-color: #f8f9fa;
46
+ border-radius: 10px;
47
+ padding: 20px;
48
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
49
+ text-align: center;
50
+ margin-bottom: 20px;
51
+ }
52
+ .newspaper-title {
53
+ font-size: 1.5rem;
54
+ font-weight: 600;
55
+ margin-bottom: 10px;
56
+ }
57
+ .entry-page {
58
+ display: flex;
59
+ flex-direction: column;
60
+ justify-content: center;
61
+ align-items: center;
62
+ height: 100vh;
63
+ position: fixed;
64
+ top: 0;
65
+ left: 0;
66
+ right: 0;
67
+ bottom: 0;
68
+ }
69
+ .entry-container {
70
+ text-align: center;
71
+ background-color: #f8f9fa;
72
+ padding: 3rem;
73
+ border-radius: 20px;
74
+ box-shadow: 0 6px 12px rgba(0, 0, 0, 0.15);
75
+ max-width: 800px;
76
+ }
77
+ .button-container {
78
+ margin-top: 2rem;
79
+ }
80
+ .footer {
81
+ text-align: center;
82
+ padding: 1rem;
83
+ color: #6c757d;
84
+ margin-top: 2rem;
85
+ border-top: 1px solid #dee2e6;
86
+ }
87
+ </style>
88
+ """, unsafe_allow_html=True)
89
+
90
+ # Constants
91
+ INDIA_GEOJSON_URL = 'https://raw.githubusercontent.com/geohacker/india/master/state/india_state.geojson'
92
+
93
+ # India GeoJSON loading function
94
+ @st.cache_data
95
+ def load_india_geojson():
96
+ """Load India GeoJSON data for mapping"""
97
+ try:
98
+ response = requests.get(INDIA_GEOJSON_URL)
99
+ return json.loads(response.text)
100
+ except Exception as e:
101
+ st.error(f"Failed to load GeoJSON: {e}")
102
+ st.info("Trying fallback method...")
103
+ try:
104
+ # Fallback: pip install geopandas
105
+ import geopandas as gpd
106
+ india = gpd.read_file(INDIA_GEOJSON_URL)
107
+ return json.loads(india.to_json())
108
+ except:
109
+ st.error("Error: Could not load India GeoJSON. Please ensure internet connection.")
110
+ return None
111
+
112
+ # Load spaCy model (with caching)
113
+ @st.cache_resource
114
+ def load_spacy_model():
115
+ try:
116
+ return spacy.load("en_core_web_sm")
117
+ except OSError:
118
+ st.info("Downloading spaCy model... This may take a moment.")
119
+ import subprocess
120
+ subprocess.call(["python", "-m", "spacy", "download", "en_core_web_sm"])
121
+ return spacy.load("en_core_web_sm")
122
+
123
+ # State mapping dictionary
124
+ def get_state_mapping():
125
+ return {
126
+ # Standard state names
127
+ 'andhra pradesh': 'Andhra Pradesh',
128
+ 'arunachal pradesh': 'Arunachal Pradesh',
129
+ 'assam': 'Assam',
130
+ 'bihar': 'Bihar',
131
+ 'chhattisgarh': 'Chhattisgarh',
132
+ 'goa': 'Goa',
133
+ 'gujarat': 'Gujarat',
134
+ 'haryana': 'Haryana',
135
+ 'himachal pradesh': 'Himachal Pradesh',
136
+ 'jharkhand': 'Jharkhand',
137
+ 'karnataka': 'Karnataka',
138
+ 'kerala': 'Kerala',
139
+ 'madhya pradesh': 'Madhya Pradesh',
140
+ 'maharashtra': 'Maharashtra',
141
+ 'manipur': 'Manipur',
142
+ 'meghalaya': 'Meghalaya',
143
+ 'mizoram': 'Mizoram',
144
+ 'nagaland': 'Nagaland',
145
+ 'odisha': 'Odisha',
146
+ 'punjab': 'Punjab',
147
+ 'rajasthan': 'Rajasthan',
148
+ 'sikkim': 'Sikkim',
149
+ 'tamil nadu': 'Tamil Nadu',
150
+ 'telangana': 'Telangana',
151
+ 'tripura': 'Tripura',
152
+ 'uttar pradesh': 'Uttar Pradesh',
153
+ 'uttarakhand': 'Uttarakhand',
154
+ 'west bengal': 'West Bengal',
155
+ # Union Territories
156
+ 'delhi': 'Delhi',
157
+ 'new delhi': 'Delhi',
158
+ 'jammu and kashmir': 'Jammu and Kashmir',
159
+ 'j&k': 'Jammu and Kashmir',
160
+ 'ladakh': 'Ladakh',
161
+ 'chandigarh': 'Chandigarh',
162
+ 'puducherry': 'Puducherry',
163
+ 'pondicherry': 'Puducherry',
164
+ 'andaman and nicobar': 'Andaman and Nicobar Islands',
165
+ 'dadra and nagar haveli': 'Dadra and Nagar Haveli and Daman and Diu',
166
+ 'daman and diu': 'Dadra and Nagar Haveli and Daman and Diu',
167
+ 'lakshadweep': 'Lakshadweep',
168
+ # Major cities mapped to their states
169
+ 'mumbai': 'Maharashtra',
170
+ 'kolkata': 'West Bengal',
171
+ 'chennai': 'Tamil Nadu',
172
+ 'bangalore': 'Karnataka',
173
+ 'bengaluru': 'Karnataka',
174
+ 'hyderabad': 'Telangana',
175
+ 'ahmedabad': 'Gujarat',
176
+ 'lucknow': 'Uttar Pradesh',
177
+ 'jaipur': 'Rajasthan',
178
+ 'srinagar': 'Jammu and Kashmir',
179
+ 'varanasi': 'Uttar Pradesh',
180
+ 'kochi': 'Kerala',
181
+ 'pune': 'Maharashtra',
182
+ 'agra': 'Uttar Pradesh',
183
+ 'bhopal': 'Madhya Pradesh',
184
+ 'patna': 'Bihar',
185
+ }
186
+
187
+ # Function to extract locations from descriptions
188
+ @st.cache_data
189
+ def extract_locations_from_descriptions(df, description_column='desc'):
190
+ """
191
+ Extract state names from description column using spaCy
192
+ """
193
+ with st.spinner("Extracting location data from articles..."):
194
+ # Load spaCy model
195
+ nlp = load_spacy_model()
196
+
197
+ # Get state mapping dictionary
198
+ state_mapping = get_state_mapping()
199
+
200
+ # Process descriptions to extract locations
201
+ locations = []
202
+
203
+ # Use a progress bar if processing a large dataset
204
+ progress_text = "Extracting locations..."
205
+ progress_bar = st.progress(0)
206
+
207
+ for idx, row in enumerate(df.iterrows()):
208
+ # Update progress every 100 rows
209
+ if idx % 100 == 0:
210
+ progress_bar.progress(min(idx / len(df), 1.0))
211
+
212
+ row = row[1] # Get the actual row data (second element of the tuple)
213
+
214
+ if pd.isna(row[description_column]):
215
+ locations.append(None)
216
+ continue
217
+
218
+ description = str(row[description_column]).lower()
219
+ doc = nlp(description)
220
+
221
+ # Extract location entities
222
+ found_locations = []
223
+ for ent in doc.ents:
224
+ if ent.label_ in ["GPE", "LOC"]:
225
+ loc_name = ent.text.lower()
226
+ if loc_name in state_mapping:
227
+ found_locations.append(state_mapping[loc_name])
228
+
229
+ # Direct string matching for state names
230
+ for state_var, standard_name in state_mapping.items():
231
+ if state_var in description and standard_name not in found_locations:
232
+ found_locations.append(standard_name)
233
+
234
+ # Store the first found location, or None if none found
235
+ locations.append(found_locations[0] if found_locations else None)
236
+
237
+ # Complete progress
238
+ progress_bar.progress(1.0)
239
+
240
+ # Add locations to dataframe
241
+ df = df.copy() # Create a copy to avoid modifying the original
242
+ df['extracted_location'] = locations
243
+
244
+ st.success(f"Locations extracted. Found locations in {df['extracted_location'].notna().sum()} of {len(df)} articles.")
245
+ return df
246
+
247
+ # Function to analyze sentiment by state
248
+ def analyze_sentiment_by_state(df, sentiment_column='sentiment_score'):
249
+ """
250
+ Analyze sentiment by state and prepare data for visualization
251
+ """
252
+ # Filter to only rows with extracted locations and valid sentiment
253
+ df_with_locations = df.dropna(subset=['extracted_location', sentiment_column])
254
+
255
+ if len(df_with_locations) == 0:
256
+ st.warning("No locations found with valid sentiment values. Cannot create map.")
257
+ return None
258
+
259
+ # Group by state and calculate average sentiment
260
+ sentiment_by_state = df_with_locations.groupby('extracted_location')[sentiment_column].agg(
261
+ avg_sentiment=('mean'),
262
+ count=('count')
263
+ ).reset_index()
264
+
265
+ return sentiment_by_state
266
+
267
+ # Function to create India sentiment map
268
+ def create_india_sentiment_map(sentiment_data, geojson_data, newspaper_name):
269
+ """
270
+ Create a choropleth map of India showing sentiment by state
271
+ """
272
+ # Ensure state names match between GeoJSON and our data
273
+ state_property = 'NAME_1' # This is the property name in the GeoJSON
274
+
275
+ # Determine color scale range based on data
276
+ min_sentiment = sentiment_data['avg_sentiment'].min()
277
+ max_sentiment = sentiment_data['avg_sentiment'].max()
278
+
279
+ # Use symmetrical range if sentiment ranges from negative to positive
280
+ if min_sentiment < 0 and max_sentiment > 0:
281
+ abs_max = max(abs(min_sentiment), abs(max_sentiment))
282
+ color_range = [-abs_max, abs_max]
283
+ else:
284
+ # Add small buffer to range
285
+ color_range = [min_sentiment - 0.1, max_sentiment + 0.1]
286
+
287
+ # Create the choropleth map
288
+ fig = px.choropleth_mapbox(
289
+ sentiment_data,
290
+ geojson=geojson_data,
291
+ locations='extracted_location',
292
+ featureidkey=f"properties.{state_property}",
293
+ color='avg_sentiment',
294
+ color_continuous_scale="RdBu",
295
+ range_color=color_range,
296
+ mapbox_style="carto-positron",
297
+ zoom=3.5,
298
+ center={"lat": 20.5937, "lon": 78.9629},
299
+ opacity=0.7,
300
+ hover_data=['count'],
301
+ labels={
302
+ 'avg_sentiment': 'Average Sentiment',
303
+ 'extracted_location': 'State',
304
+ 'count': 'Article Count'
305
+ }
306
+ )
307
+
308
+ # Customize the layout
309
+ fig.update_layout(
310
+ title=dict(
311
+ text=f'{newspaper_name} - Sentiment Analysis by Indian States',
312
+ font=dict(size=24, color='#2c3e50'),
313
+ x=0.5,
314
+ y=0.95
315
+ ),
316
+ height=800,
317
+ margin={"r":0,"t":50,"l":0,"b":0}
318
+ )
319
+
320
+ # Add text annotation explaining the color scale
321
+ fig.add_annotation(
322
+ x=0.5, y=0.02,
323
+ xref="paper", yref="paper",
324
+ text="Color scale: Red (Negative) to Blue (Positive)",
325
+ showarrow=False,
326
+ font=dict(size=14)
327
+ )
328
+
329
+ return fig
330
+
331
+ # Function to plot sentiment trends by year (from original code)
332
+ def plot_sentiment_trends_by_year(df, newspaper_name):
333
+ # Set the style to a clean, modern look
334
+ plt.style.use('seaborn-v0_8-whitegrid')
335
+
336
+ # Custom font settings
337
+ plt.rcParams['font.family'] = 'sans-serif'
338
+ plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica', 'DejaVu Sans']
339
+ plt.rcParams['font.size'] = 11
340
+ plt.rcParams['axes.titlesize'] = 16
341
+ plt.rcParams['axes.labelsize'] = 12
342
+
343
+ # Convert date to datetime and extract year
344
+ df['year'] = pd.to_datetime(df['date'], errors='coerce').dt.year
345
+
346
+ # Ensure only known sentiments are used
347
+ valid_sentiments = {"positive", "negative", "neutral"}
348
+ df['sentiment'] = df['sentiment_value'].apply(lambda x: x.lower() if isinstance(x, str) and x.lower() in valid_sentiments else "neutral")
349
+
350
+ # Count the number of articles per sentiment per year
351
+ sentiment_counts = df.groupby(['year', 'sentiment']).size().reset_index(name='count')
352
+
353
+ # Calculate total articles per year
354
+ year_totals = sentiment_counts.groupby('year')['count'].sum().reset_index(name='total')
355
+
356
+ # Merge the counts with totals to calculate percentages
357
+ sentiment_counts = sentiment_counts.merge(year_totals, on='year')
358
+ sentiment_counts['percentage'] = sentiment_counts['count'] / sentiment_counts['total'] * 100
359
+
360
+ # Pivot the data for easier plotting
361
+ sentiment_pivot = sentiment_counts.pivot(index='year', columns='sentiment', values='percentage').fillna(0)
362
+
363
+ # Ensure all sentiment columns exist
364
+ for sentiment in ['negative', 'neutral', 'positive']:
365
+ if sentiment not in sentiment_pivot.columns:
366
+ sentiment_pivot[sentiment] = 0
367
+
368
+ # Sort by year (ascending for timeline)
369
+ sentiment_pivot = sentiment_pivot.sort_index()
370
+
371
+ # Create the figure and axis
372
+ fig, ax = plt.subplots(figsize=(12, 7))
373
+
374
+ # Define custom colors
375
+ colors = {
376
+ 'negative': '#5D3FD3', # rich purple
377
+ 'neutral': '#9D4EDD', # lavender purple
378
+ 'positive': '#00897B' # teal green
379
+ }
380
+
381
+ # Plot lines for each sentiment
382
+ for sentiment in ['negative', 'neutral', 'positive']:
383
+ ax.plot(
384
+ sentiment_pivot.index,
385
+ sentiment_pivot[sentiment],
386
+ marker='o',
387
+ linewidth=2.5,
388
+ label=sentiment.capitalize(),
389
+ color=colors[sentiment],
390
+ markersize=8,
391
+ markeredgecolor='white',
392
+ markeredgewidth=1.5
393
+ )
394
+
395
+ # Add article counts as annotations
396
+ for year in sentiment_pivot.index:
397
+ total = year_totals.loc[year_totals['year'] == year, 'total'].values[0]
398
+ ax.annotate(
399
+ f"{total:,}",
400
+ xy=(year, sentiment_pivot.loc[year, 'negative'] - 5),
401
+ xytext=(0, -25),
402
+ textcoords='offset points',
403
+ ha='center',
404
+ fontsize=9,
405
+ color='gray'
406
+ )
407
+
408
+ # Add a text indicating what the numbers represent
409
+ ax.text(
410
+ sentiment_pivot.index[0],
411
+ -12,
412
+ "Article Count",
413
+ fontsize=9,
414
+ color='gray',
415
+ ha='center'
416
+ )
417
+
418
+ # Set x-axis to only show years (integers)
419
+ ax.xaxis.set_major_locator(MaxNLocator(integer=True))
420
+
421
+ # Set y-axis limits and labels
422
+ ax.set_ylim(0, max(100, sentiment_pivot.max().max() * 1.1))
423
+ ax.set_ylabel('Percentage (%)', fontweight='bold')
424
+ ax.set_xlabel('Year', fontweight='bold')
425
+
426
+ # Add title
427
+ ax.set_title(f'{newspaper_name} - Sentiment Trends by Year', fontweight='bold', pad=20)
428
+
429
+ # Customize legend
430
+ legend = ax.legend(
431
+ loc='upper right',
432
+ frameon=True,
433
+ framealpha=0.95,
434
+ edgecolor='lightgray',
435
+ title='Sentiment'
436
+ )
437
+ legend.get_title().set_fontweight('bold')
438
+
439
+ # Remove spines for cleaner look
440
+ ax.spines['top'].set_visible(False)
441
+ ax.spines['right'].set_visible(False)
442
+ ax.spines['left'].set_linewidth(0.5)
443
+ ax.spines['bottom'].set_linewidth(0.5)
444
+
445
+ # Add grid lines
446
+ ax.grid(axis='y', linestyle='--', alpha=0.3, color='gray')
447
+
448
+ # Add subtle background color
449
+ fig.patch.set_facecolor('#F8F9FA')
450
+ ax.set_facecolor('#F8F9FA')
451
+
452
+ # Add percentage labels at the end of each line
453
+ last_year = sentiment_pivot.index[-1]
454
+ for sentiment in ['negative', 'neutral', 'positive']:
455
+ if last_year in sentiment_pivot.index: # Check if the last_year exists in the index
456
+ last_value = sentiment_pivot.loc[last_year, sentiment]
457
+ ax.annotate(
458
+ f"{last_value:.1f}%",
459
+ xy=(last_year, last_value),
460
+ xytext=(5, 0),
461
+ textcoords='offset points',
462
+ fontweight='bold',
463
+ color=colors[sentiment]
464
+ )
465
+
466
+ # Add a data source footer
467
+ plt.figtext(
468
+ 0.01, 0.01,
469
+ f"Data source: Analysis of {df.shape[0]:,} articles",
470
+ fontsize=8,
471
+ color='gray'
472
+ )
473
+
474
+ # Add horizontal line at 50% for reference
475
+ ax.axhline(y=50, color='gray', linestyle='-', alpha=0.2)
476
+ ax.text(sentiment_pivot.index[0], 51, "50%", fontsize=8, color='gray')
477
+
478
+ # Adjust layout
479
+ plt.tight_layout(pad=2.0)
480
+
481
+ return fig
482
+
483
+ # Function to plot article volume by year (from original code)
484
+ def plot_article_volume_by_year(df, newspaper_name):
485
+ # Set the style to a clean, modern look
486
+ plt.style.use('seaborn-v0_8-whitegrid')
487
+
488
+ # Custom font settings
489
+ plt.rcParams['font.family'] = 'sans-serif'
490
+ plt.rcParams['font.sans-serif'] = ['Arial', 'Helvetica', 'DejaVu Sans']
491
+
492
+ # Convert date to datetime and extract year
493
+ df['year'] = pd.to_datetime(df['date'], errors='coerce').dt.year
494
+
495
+ # Count articles per year
496
+ article_counts = df.groupby('year').size().reset_index(name='count')
497
+
498
+ # Create the figure and axis
499
+ fig, ax = plt.subplots(figsize=(12, 5))
500
+
501
+ # Plot line for article count
502
+ ax.plot(
503
+ article_counts['year'],
504
+ article_counts['count'],
505
+ marker='o',
506
+ linewidth=2.5,
507
+ color='#3949AB',
508
+ markersize=8,
509
+ markeredgecolor='white',
510
+ markeredgewidth=1.5
511
+ )
512
+
513
+ # Fill area under the line
514
+ ax.fill_between(
515
+ article_counts['year'],
516
+ article_counts['count'],
517
+ alpha=0.2,
518
+ color='#3949AB'
519
+ )
520
+
521
+ # Set x-axis to only show years (integers)
522
+ ax.xaxis.set_major_locator(MaxNLocator(integer=True))
523
+
524
+ # Add count labels above each point
525
+ for year, count in zip(article_counts['year'], article_counts['count']):
526
+ ax.annotate(
527
+ f"{count:,}",
528
+ xy=(year, count),
529
+ xytext=(0, 10),
530
+ textcoords='offset points',
531
+ ha='center',
532
+ fontweight='bold',
533
+ fontsize=10
534
+ )
535
+
536
+ # Set axis labels
537
+ ax.set_ylabel('Number of Articles', fontweight='bold')
538
+ ax.set_xlabel('Year', fontweight='bold')
539
+
540
+ # Add title
541
+ ax.set_title(f'{newspaper_name} - Article Volume by Year', fontweight='bold', pad=20)
542
+
543
+ # Remove spines for cleaner look
544
+ ax.spines['top'].set_visible(False)
545
+ ax.spines['right'].set_visible(False)
546
+
547
+ # Add grid lines
548
+ ax.grid(axis='y', linestyle='--', alpha=0.3, color='gray')
549
+
550
+ # Add subtle background color
551
+ fig.patch.set_facecolor('#F8F9FA')
552
+ ax.set_facecolor('#F8F9FA')
553
+
554
+ # Adjust layout
555
+ plt.tight_layout()
556
+
557
+ return fig
558
+
559
+ # Function to create a comparison bar chart of newspapers
560
+ def create_newspaper_comparison(dataframes, newspaper_names):
561
+ # Prepare data for the comparison
562
+ comparison_data = []
563
+
564
+ for i, df in enumerate(dataframes):
565
+ if df is not None:
566
+ # Ensure sentiment column exists and is properly formatted
567
+ if 'sentiment_value' in df.columns:
568
+ df['sentiment'] = df['sentiment_value'].apply(
569
+ lambda x: x.lower() if isinstance(x, str) and x.lower() in ["positive", "negative", "neutral"] else "neutral"
570
+ )
571
+
572
+ # Count articles by sentiment
573
+ sentiment_counts = df['sentiment'].value_counts().to_dict()
574
+
575
+ # Add counts to comparison data
576
+ for sentiment in ['positive', 'negative', 'neutral']:
577
+ comparison_data.append({
578
+ 'Newspaper': newspaper_names[i],
579
+ 'Sentiment': sentiment.capitalize(),
580
+ 'Count': sentiment_counts.get(sentiment, 0)
581
+ })
582
+
583
+ # Create DataFrame from comparison data
584
+ comparison_df = pd.DataFrame(comparison_data)
585
+
586
+ # Create grouped bar chart
587
+ fig = px.bar(
588
+ comparison_df,
589
+ x='Newspaper',
590
+ y='Count',
591
+ color='Sentiment',
592
+ barmode='group',
593
+ title='Sentiment Distribution Across Newspapers',
594
+ color_discrete_map={
595
+ 'Positive': '#00897B',
596
+ 'Neutral': '#9D4EDD',
597
+ 'Negative': '#5D3FD3'
598
+ }
599
+ )
600
+
601
+ fig.update_layout(
602
+ height=500,
603
+ legend_title='Sentiment',
604
+ xaxis_title='',
605
+ yaxis_title='Number of Articles'
606
+ )
607
+
608
+ return fig
609
+
610
+ # Function to create a top locations bar chart
611
+ def create_top_locations_chart(df, newspaper_name):
612
+ """Create a bar chart of the top mentioned locations"""
613
+ if 'extracted_location' not in df.columns or df['extracted_location'].isna().all():
614
+ # Return an empty figure
615
+ fig = go.Figure()
616
+ fig.add_annotation(
617
+ text="No location data available",
618
+ showarrow=False,
619
+ font=dict(size=20)
620
+ )
621
+ fig.update_layout(height=400)
622
+ return fig
623
+
624
+ # Count articles by location
625
+ location_counts = df['extracted_location'].value_counts().reset_index()
626
+ location_counts.columns = ['Location', 'Article Count']
627
+
628
+ # Get top 15 locations
629
+ top_locations = location_counts.head(15)
630
+
631
+ # Create bar chart
632
+ fig = px.bar(
633
+ top_locations,
634
+ y='Location',
635
+ x='Article Count',
636
+ title=f'Top 15 Locations Mentioned in {newspaper_name} Articles',
637
+ orientation='h',
638
+ color='Article Count',
639
+ color_continuous_scale='Viridis'
640
+ )
641
+
642
+ fig.update_layout(
643
+ height=500,
644
+ yaxis={'categoryorder':'total ascending'}
645
+ )
646
+
647
+ return fig
648
+
649
+ def create_top_politicians_chart(df, newspaper_name):
650
+ """Create a bar chart of the top mentioned locations"""
651
+
652
+ if 'Politician' not in df.columns or df['Politician'].isna().all():
653
+ print(df.head())
654
+ # Return an empty figure
655
+ fig = go.Figure()
656
+ fig.add_annotation(
657
+ text="No politician data available",
658
+ showarrow=False,
659
+ font=dict(size=20)
660
+ )
661
+ fig.update_layout(height=400)
662
+ return fig
663
+
664
+ # Get top 15 locations
665
+ top_locations = df
666
+
667
+ # Create bar chart
668
+ fig = px.bar(
669
+ top_locations,
670
+ y='Politician',
671
+ x='Mentions',
672
+ title=f'Top 10 Politicians Mentioned in {newspaper_name} Articles',
673
+ orientation='h',
674
+ color='Mentions',
675
+ color_continuous_scale='Viridis'
676
+ )
677
+
678
+ fig.update_layout(
679
+ height=500,
680
+ yaxis={'categoryorder':'total ascending'}
681
+ )
682
+
683
+ return fig
684
+
685
+ # Function to load data
686
+ @st.cache_data
687
+ def load_data(newspaper_name):
688
+ try:
689
+ # Try to load CSV file for the newspaper
690
+ file_path = f"data/{newspaper_name.lower().replace(' ', '_')}_articles.csv"
691
+ df = pd.read_csv(file_path)
692
+
693
+ # Check if required columns exist
694
+ required_columns = ['date', 'sentiment_value']
695
+ for col in required_columns:
696
+ if col not in df.columns:
697
+ st.error(f"Required column '{col}' not found in {file_path}")
698
+ return None
699
+
700
+ # Add sentiment score column if not exists
701
+ if 'sentiment_score' not in df.columns:
702
+ # Create a numeric sentiment score based on sentiment_value
703
+ sentiment_map = {
704
+ 'positive': 1.0,
705
+ 'negative': -1.0,
706
+ 'neutral': 0.0
707
+ }
708
+ df['sentiment_score'] = df['sentiment_value'].str.lower().map(sentiment_map).fillna(0)
709
+
710
+ return df
711
+
712
+ except Exception as e:
713
+ st.error(f"Error loading data for {newspaper_name}: {str(e)}")
714
+ return None
715
+
716
+ # Entry page
717
+ def show_entry_page():
718
+ st.markdown('<div class="entry-page">', unsafe_allow_html=True)
719
+ st.markdown('<div class="entry-container">', unsafe_allow_html=True)
720
+
721
+ st.markdown('<h1 class="main-header">Sentiment Analysis of RSS Articles</h1>', unsafe_allow_html=True)
722
+ st.markdown("""
723
+ <p style="font-size: 1.2rem; margin-bottom: 2rem;">
724
+ Analyze the sentiments of news articles from various RSS feeds across different newspapers.
725
+ Discover trends, patterns, and insights through interactive visualizations.
726
+ </p>
727
+ """, unsafe_allow_html=True)
728
+
729
+ st.markdown('<div class="button-container">', unsafe_allow_html=True)
730
+ if st.button("Explore Analysis", key="entry_explore", use_container_width=True):
731
+ st.session_state.show_entry = False
732
+ st.markdown('</div>', unsafe_allow_html=True)
733
+
734
+ st.markdown('</div>', unsafe_allow_html=True)
735
+ st.markdown('</div>', unsafe_allow_html=True)
736
+
737
+ # Home page with newspaper cards
738
+ def show_home_page():
739
+ st.markdown('<h1 class="main-header">RSS Articles Sentiment Analysis Dashboard</h1>', unsafe_allow_html=True)
740
+
741
+ # List of newspapers
742
+ newspapers = ["Print", "Scroll", "Sentinel", "NDTV"]
743
+
744
+ # Load data for all newspapers
745
+ dataframes = []
746
+ for newspaper in newspapers:
747
+ df = load_data(newspaper)
748
+ dataframes.append(df)
749
+
750
+ # Show comparison chart of all newspapers
751
+ st.markdown('<h2 class="sub-header">Newspaper Sentiment Comparison</h2>', unsafe_allow_html=True)
752
+ comparison_fig = create_newspaper_comparison(dataframes, newspapers)
753
+ st.plotly_chart(comparison_fig, use_container_width=True)
754
+
755
+ # Create a 2x2 grid for newspaper cards
756
+ col1, col2 = st.columns(2)
757
+ col3, col4 = st.columns(2)
758
+ cols = [col1, col2, col3, col4]
759
+
760
+ # Create a card for each newspaper
761
+ for i, newspaper in enumerate(newspapers):
762
+ df = dataframes[i]
763
+ with cols[i]:
764
+ st.markdown(f'<div class="newspaper-card">', unsafe_allow_html=True)
765
+ st.markdown(f'<div class="newspaper-title">{newspaper}</div>', unsafe_allow_html=True)
766
+
767
+ # Only show counts if data is available
768
+ if df is not None:
769
+ # Count articles by sentiment
770
+ if 'sentiment_value' in df.columns:
771
+ sentiment_counts = df['sentiment_value'].str.lower().value_counts()
772
+
773
+ # Create three columns for sentiment counts
774
+ pos_col, neu_col, neg_col = st.columns(3)
775
+ with pos_col:
776
+ st.metric("Positive", sentiment_counts.get('positive', 0))
777
+ with neu_col:
778
+ st.metric("Neutral", sentiment_counts.get('neutral', 0))
779
+ with neg_col:
780
+ st.metric("Negative", sentiment_counts.get('negative', 0))
781
+ else:
782
+ st.write("Sentiment data not available")
783
+ else:
784
+ st.write("Data not available")
785
+
786
+ # Add button to view detailed analysis
787
+ if st.button(f"View Analysis", key=f"view_{newspaper}"):
788
+ st.session_state.current_newspaper = newspaper
789
+ st.session_state.show_newspaper_analysis = True
790
+ st.rerun()
791
+
792
+ st.markdown('</div>', unsafe_allow_html=True)
793
+
794
+ # Function to process all newspapers with location extraction
795
+ @st.cache_data
796
+ def preprocess_newspapers_with_locations(newspapers):
797
+ # Load GeoJSON for India map
798
+ india_geojson = load_india_geojson()
799
+ if india_geojson is None:
800
+ st.error("Could not load India GeoJSON. Please check your internet connection.")
801
+ return {}
802
+
803
+ processed_data = {}
804
+
805
+ for newspaper in newspapers:
806
+ # Load the raw data
807
+ df = load_data(newspaper)
808
+
809
+ if df is not None and 'desc' in df.columns:
810
+ # Extract locations if not already done
811
+ if 'extracted_location' not in df.columns:
812
+ df = extract_locations_from_descriptions(df, 'desc')
813
+
814
+ # Analyze sentiment by state
815
+ sentiment_by_state = analyze_sentiment_by_state(df)
816
+
817
+ processed_data[newspaper] = {
818
+ 'df': df,
819
+ 'sentiment_by_state': sentiment_by_state,
820
+ 'india_geojson': india_geojson
821
+ }
822
+ else:
823
+ if df is not None:
824
+ processed_data[newspaper] = {
825
+ 'df': df,
826
+ 'error': "Description column 'desc' not found"
827
+ }
828
+ else:
829
+ processed_data[newspaper] = {
830
+ 'error': f"Could not load data for {newspaper}"
831
+ }
832
+
833
+ return processed_data
834
+
835
+ # Newspaper analysis page
836
+ # Newspaper analysis page
837
+ def show_newspaper_analysis():
838
+ # Add back button
839
+ if st.button("← Back to Home"):
840
+ st.session_state.show_newspaper_analysis = False
841
+ st.rerun()
842
+
843
+ newspaper = st.session_state.current_newspaper
844
+ st.markdown(f'<h1 class="main-header">{newspaper} - Sentiment Analysis</h1>', unsafe_allow_html=True)
845
+
846
+ # Load data for this newspaper
847
+ df = load_data(newspaper)
848
+
849
+ if df is not None:
850
+ # Get or preprocess location data
851
+ if 'processed_data' not in st.session_state:
852
+ with st.spinner("Processing newspaper data..."):
853
+ st.session_state.processed_data = preprocess_newspapers_with_locations(["Print", "Scroll", "Sentinel", "NDTV"])
854
+
855
+ processed_data = st.session_state.processed_data.get(newspaper, {})
856
+
857
+ # Display article count and date range
858
+ article_count = len(df)
859
+
860
+ # Convert date column to datetime to get min and max dates
861
+ df['date'] = pd.to_datetime(df['date'], errors='coerce')
862
+ min_date = df['date'].min().strftime('%d %b, %Y') if not pd.isna(df['date'].min()) else "Unknown"
863
+ max_date = df['date'].max().strftime('%d %b, %Y') if not pd.isna(df['date'].max()) else "Unknown"
864
+
865
+ # Create metrics row
866
+ col1, col2, col3 = st.columns(3)
867
+ with col1:
868
+ st.metric("Total Articles", f"{article_count:,}")
869
+ with col2:
870
+ st.metric("First Article", min_date)
871
+ with col3:
872
+ st.metric("Latest Article", max_date)
873
+
874
+ # Show sentiment trends by year
875
+ st.markdown('<h2 class="sub-header">Sentiment Trends Over Time</h2>', unsafe_allow_html=True)
876
+ try:
877
+ sentiment_trend_fig = plot_sentiment_trends_by_year(df, newspaper)
878
+ st.pyplot(sentiment_trend_fig)
879
+ except Exception as e:
880
+ st.error(f"Error generating sentiment trends chart: {str(e)}")
881
+
882
+ # Show article volume by year
883
+ st.markdown('<h2 class="sub-header">Article Volume by Year</h2>', unsafe_allow_html=True)
884
+ try:
885
+ volume_fig = plot_article_volume_by_year(df, newspaper)
886
+ st.pyplot(volume_fig)
887
+ except Exception as e:
888
+ st.error(f"Error generating article volume chart: {str(e)}")
889
+
890
+ # Create two columns for location analysis
891
+ col1, col2 = st.columns(2)
892
+
893
+ with col1:
894
+ # Top mentioned locations
895
+ st.markdown('<h2 class="sub-header">Top Mentioned Locations</h2>', unsafe_allow_html=True)
896
+
897
+ if 'extracted_location' in df.columns:
898
+ top_locations_fig = create_top_locations_chart(df, newspaper)
899
+ st.plotly_chart(top_locations_fig, use_container_width=True)
900
+ else:
901
+ if 'desc' in df.columns:
902
+ st.info("Location data not yet extracted. Click the button below to extract locations.")
903
+ if st.button("Extract Locations", key=f"extract_{newspaper}"):
904
+ with st.spinner("Extracting locations..."):
905
+ df = extract_locations_from_descriptions(df)
906
+ # Update the processed data
907
+ processed_data['df'] = df
908
+ sentiment_by_state = analyze_sentiment_by_state(df)
909
+ processed_data['sentiment_by_state'] = sentiment_by_state
910
+ st.session_state.processed_data[newspaper] = processed_data
911
+ st.experimental_rerun()
912
+ else:
913
+ st.warning("Description column not found. Cannot extract locations.")
914
+
915
+ # Top mentioned politicians - Now placed below the locations graph in the same column
916
+ st.markdown('<h2 class="sub-header">Top Mentioned Politicians</h2>', unsafe_allow_html=True)
917
+
918
+ if 'desc' in df.columns:
919
+ # Check if rss_personalities is defined, if not you'll need to define it
920
+ if 'rss_personalities' not in locals() and 'rss_personalities' not in globals():
921
+ # Define your list of politicians here or import it
922
+ rss_personalities = ["Narendra Modi", "Amit Shah", "Rajnath Singh", "Mohan Bhagwat", "Yogi Adityanath", "Nitin Gadkari"]
923
+
924
+ top_politicians = pussy.count_politicians_in_descriptions(df, rss_personalities).head(10)
925
+ top_politicians_fig = create_top_politicians_chart(top_politicians, newspaper)
926
+ st.plotly_chart(top_politicians_fig, use_container_width=True)
927
+ else:
928
+ st.warning("Description column not found. Cannot analyze politicians.")
929
+
930
+ with col2:
931
+ # Sentiment by state map
932
+ st.markdown('<h2 class="sub-header">Sentiment by State</h2>', unsafe_allow_html=True)
933
+
934
+ sentiment_by_state = processed_data.get('sentiment_by_state')
935
+ india_geojson = processed_data.get('india_geojson')
936
+
937
+ if sentiment_by_state is not None and india_geojson is not None and not sentiment_by_state.empty:
938
+ try:
939
+ map_fig = create_india_sentiment_map(sentiment_by_state, india_geojson, newspaper)
940
+ st.plotly_chart(map_fig, use_container_width=True)
941
+ except Exception as e:
942
+ st.error(f"Error creating sentiment map: {str(e)}")
943
+ else:
944
+ if 'error' in processed_data:
945
+ st.warning(processed_data['error'])
946
+ else:
947
+ st.info("Sentiment data not available. Extract locations first.")
948
+
949
+ # Add section for detailed article analysis
950
+ st.markdown('<h2 class="sub-header">Article Analysis</h2>', unsafe_allow_html=True)
951
+
952
+ # Add filters for article display
953
+ col1, col2, col3 = st.columns(3)
954
+
955
+ with col1:
956
+ # Sentiment filter
957
+ sentiment_options = ["All"] + sorted(df['sentiment_value'].unique().tolist())
958
+ selected_sentiment = st.selectbox("Filter by Sentiment", sentiment_options)
959
+
960
+ with col2:
961
+ # Year filter
962
+ year_options = ["All"] + sorted(df['date'].dt.year.dropna().unique().astype(int).tolist())
963
+ selected_year = st.selectbox("Filter by Year", year_options)
964
+
965
+ with col3:
966
+ # Location filter (if available)
967
+ location_options = ["All"]
968
+ if 'extracted_location' in df.columns:
969
+ location_options += sorted(df['extracted_location'].dropna().unique().tolist())
970
+ selected_location = st.selectbox("Filter by Location", location_options)
971
+
972
+ # Apply filters
973
+ filtered_df = df.copy()
974
+
975
+ if selected_sentiment != "All":
976
+ filtered_df = filtered_df[filtered_df['sentiment_value'] == selected_sentiment]
977
+
978
+ if selected_year != "All":
979
+ filtered_df = filtered_df[filtered_df['date'].dt.year == selected_year]
980
+
981
+ if selected_location != "All" and 'extracted_location' in filtered_df.columns:
982
+ filtered_df = filtered_df[filtered_df['extracted_location'] == selected_location]
983
+
984
+ # Show article count after filtering
985
+ st.write(f"Displaying {len(filtered_df)} articles based on your filters.")
986
+
987
+ # Display articles in an expandable format
988
+ if not filtered_df.empty:
989
+ for index, row in filtered_df.head(50).iterrows():
990
+ title = row.get('title', 'Untitled')
991
+ date = row['date'].strftime('%d %b, %Y') if pd.notna(row['date']) else 'Unknown date'
992
+ sentiment = row.get('sentiment_value', 'Unknown sentiment')
993
+ description = row.get('desc', 'No description available')
994
+ link = row.get('link', 'No link available')
995
+
996
+ # Format sentiment with color
997
+ sentiment_color = {
998
+ 'positive': 'green',
999
+ 'neutral': 'gray',
1000
+ 'negative': 'red'
1001
+ }.get(sentiment.lower(), 'gray')
1002
+
1003
+ # Create expandable card for each article
1004
+ with st.expander(f"{title} - {date}"):
1005
+ st.markdown(f"**Sentiment:** <span style='color:{sentiment_color}'>{sentiment.capitalize()}</span>", unsafe_allow_html=True)
1006
+
1007
+ if 'extracted_location' in row and pd.notna(row['extracted_location']):
1008
+ st.markdown(f"**Location:** {row['extracted_location']}")
1009
+
1010
+ st.markdown("**Description:**")
1011
+ st.markdown(f"{description}")
1012
+ st.markdown(f"**Link:** {link}")
1013
+
1014
+ if len(filtered_df) > 50:
1015
+ st.info(f"Showing 50 out of {len(filtered_df)} articles. Apply more filters to narrow down results.")
1016
+ else:
1017
+ st.info("No articles match your selected filters.")
1018
+ else:
1019
+ st.error(f"Could not load data for {newspaper}. Please check if the data file exists.")
1020
+
1021
+ # Main function
1022
+ def main():
1023
+ # Load CSS
1024
+ load_css()
1025
+
1026
+ # Initialize session state variables if not exists
1027
+ if 'show_entry' not in st.session_state:
1028
+ st.session_state.show_entry = True
1029
+
1030
+ if 'show_newspaper_analysis' not in st.session_state:
1031
+ st.session_state.show_newspaper_analysis = False
1032
+
1033
+ if 'current_newspaper' not in st.session_state:
1034
+ st.session_state.current_newspaper = None
1035
+
1036
+ # Display appropriate page based on session state
1037
+ if st.session_state.show_entry:
1038
+ show_entry_page()
1039
+ elif st.session_state.show_newspaper_analysis:
1040
+ show_newspaper_analysis()
1041
+ else:
1042
+ show_home_page()
1043
+
1044
+ # Footer
1045
+ st.markdown('<div class="footer">', unsafe_allow_html=True)
1046
+ st.markdown('RSS Sentiment Analysis Dashboard - Developed with Streamlit', unsafe_allow_html=True)
1047
+ st.markdown('</div>', unsafe_allow_html=True)
1048
+
1049
+ if __name__ == "__main__":
1050
+ main()
src/requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ selenium
2
+ beautifulsoup4
3
+ requests
4
+ webdriver-manager
5
+ pandas
6
+ tqdm
src/run_scraper.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Main entry point for running the news scrapers.
4
+ This script acts as a unified interface to run any of the available scrapers.
5
+ """
6
+
7
+ import os
8
+ import sys
9
+ import argparse
10
+ import importlib
11
+ from datetime import datetime
12
+
13
+ # Make sure the output directory exists
14
+ os.makedirs('output', exist_ok=True)
15
+
16
+ def parse_arguments():
17
+ """Parse command line arguments"""
18
+ parser = argparse.ArgumentParser(description='Scrape news articles from Indian news websites')
19
+
20
+ # Required arguments
21
+ parser.add_argument('--source', '-s', type=str, required=True,
22
+ choices=['toi', 'ndtv', 'wion', 'scroll'],
23
+ help='News source to scrape from')
24
+ parser.add_argument('--topic', '-t', type=str, required=True,
25
+ help='Topic to search for (e.g., "Climate Change", "Politics")')
26
+
27
+ # Optional arguments
28
+ parser.add_argument('--workers', '-w', type=int, default=4,
29
+ help='Number of worker threads (default: 4)')
30
+ parser.add_argument('--interval', '-i', type=int, default=300,
31
+ help='Auto-save interval in seconds (default: 300)')
32
+
33
+ return parser.parse_args()
34
+
35
+ def get_scraper_class(source):
36
+ """Dynamically import the appropriate scraper class based on the source"""
37
+ source_map = {
38
+ 'toi': ('scrapers.toi_scraper', 'TOIArticleScraper'),
39
+ 'ndtv': ('scrapers.ndtv_scraper', 'NDTVArticleScraper'),
40
+ 'wion': ('scrapers.wion_scraper', 'WIONArticleScraper'),
41
+ 'scroll': ('scrapers.scroll_scraper', 'ScrollArticleScraper')
42
+ }
43
+
44
+ if source not in source_map:
45
+ raise ValueError(f"Unsupported news source: {source}")
46
+
47
+ module_name, class_name = source_map[source]
48
+ module = importlib.import_module(module_name)
49
+ return getattr(module, class_name)
50
+
51
+ def main():
52
+ """Main function to run the scraper based on command line arguments"""
53
+ args = parse_arguments()
54
+
55
+ try:
56
+ print(f"\n--- Indian News Scraper ---")
57
+ print(f"Source: {args.source}")
58
+ print(f"Topic: {args.topic}")
59
+ print(f"Workers: {args.workers}")
60
+ print(f"Auto-save interval: {args.interval} seconds")
61
+ print("---------------------------\n")
62
+
63
+ # Get the appropriate scraper class
64
+ ScrapeClass = get_scraper_class(args.source)
65
+
66
+ # Initialize the scraper
67
+ scraper = ScrapeClass(max_workers=args.workers)
68
+ scraper.save_interval = args.interval
69
+
70
+ # Configure output directory
71
+ os.chdir('output')
72
+
73
+ print(f"Starting to scrape {args.topic}-related articles from {args.source.upper()}...")
74
+ print("Press Ctrl+C at any time to save progress and exit.")
75
+
76
+ # Run the scraper (accounting for different parameter names in different scrapers)
77
+ if args.source == 'toi':
78
+ topic_url = f"{scraper.base_url}/topic/{args.topic}/news"
79
+ final_csv = scraper.scrape_topic(topic_url, args.topic)
80
+ elif args.source == 'ndtv':
81
+ final_csv = scraper.scrape_topic(args.topic)
82
+ elif args.source == 'wion' or args.source == 'scroll':
83
+ final_csv = scraper.scrape_topic(args.topic.lower(), args.topic)
84
+
85
+ # Print results
86
+ if final_csv:
87
+ article_count = len(scraper.articles) if hasattr(scraper, 'articles') else len(scraper.scraped_articles)
88
+ print(f"\nArticles have been saved to: {final_csv}")
89
+ print(f"Total articles scraped: {article_count}")
90
+ else:
91
+ print("\nError saving to final CSV file")
92
+
93
+ except KeyboardInterrupt:
94
+ print("\nProcess interrupted by user. Saving progress...")
95
+ articles = getattr(scraper, 'articles', None) or getattr(scraper, 'scraped_articles', [])
96
+ if articles:
97
+ scraper.save_progress(args.topic, force=True)
98
+ print("Saved progress and exiting.")
99
+ except Exception as e:
100
+ print(f"\nAn error occurred: {str(e)}")
101
+ articles = getattr(scraper, 'articles', None) or getattr(scraper, 'scraped_articles', [])
102
+ if 'scraper' in locals() and articles:
103
+ scraper.save_progress(args.topic, force=True)
104
+ print("Saved progress despite error.")
105
+ return 1
106
+
107
+ return 0
108
+
109
+ if __name__ == "__main__":
110
+ sys.exit(main())
src/scrapers/__init__.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ """
2
+ News scraper package for Indian news websites.
3
+ Contains scrapers for TOI, NDTV, WION, and Scroll.in.
4
+ """
5
+
6
+ __version__ = '1.0.0'
src/scrapers/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (307 Bytes). View file
 
src/scrapers/__pycache__/toi_scraper.cpython-312.pyc ADDED
Binary file (17.4 kB). View file
 
src/scrapers/ndtv_scraper.py ADDED
@@ -0,0 +1,338 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import logging
4
+ import csv
5
+ import signal
6
+ import sys
7
+ import argparse
8
+ from datetime import datetime
9
+ from concurrent.futures import ThreadPoolExecutor, as_completed
10
+ from selenium import webdriver
11
+ from selenium.webdriver.chrome.service import Service
12
+ from selenium.webdriver.common.by import By
13
+ from selenium.webdriver.support.ui import WebDriverWait
14
+ from selenium.webdriver.support import expected_conditions as EC
15
+ from selenium.common.exceptions import (
16
+ TimeoutException, NoSuchElementException, ElementClickInterceptedException
17
+ )
18
+ from bs4 import BeautifulSoup
19
+ from utils.webdriver_utils import create_chrome_driver, wait_for_page_load, scroll_to_element
20
+
21
+ class NDTVArticleScraper:
22
+ def __init__(self, max_workers=4):
23
+ self.max_workers = max_workers
24
+ self.base_url = "https://www.ndtv.com/"
25
+ self.fetched_articles = set()
26
+ self.scraped_articles = []
27
+ self.last_save_time = time.time()
28
+ self.save_interval = 300 # Save every 5 minutes
29
+ self.is_interrupted = False
30
+
31
+ logging.basicConfig(
32
+ level=logging.INFO,
33
+ format='%(asctime)s - %(levelname)s - %(message)s'
34
+ )
35
+ self.logger = logging.getLogger(__name__)
36
+
37
+ # Set up signal handlers
38
+ signal.signal(signal.SIGINT, self.signal_handler)
39
+ signal.signal(signal.SIGTERM, self.signal_handler)
40
+
41
+ def signal_handler(self, signum, frame):
42
+ """Handle interrupt signals"""
43
+ print("\nReceived interrupt signal. Saving progress and shutting down...")
44
+ self.is_interrupted = True
45
+ if self.scraped_articles:
46
+ self.save_progress("interrupted", force=True)
47
+ sys.exit(0)
48
+
49
+ # def create_driver(self):
50
+ # """Create and return a new Chrome driver instance"""
51
+ # chrome_options = webdriver.ChromeOptions()
52
+ # chrome_options.add_argument('--headless')
53
+ # chrome_options.add_argument('--no-sandbox')
54
+ # chrome_options.add_argument('--disable-dev-shm-usage')
55
+ # chrome_options.add_argument('--disable-extensions')
56
+ # chrome_options.page_load_strategy = 'eager'
57
+ # return webdriver.Chrome(options=chrome_options)
58
+
59
+ def create_driver(self):
60
+ """Create and return a new Chrome driver instance"""
61
+ return create_chrome_driver(headless=True, load_images=False)
62
+
63
+ def scroll_to_element(self, driver, element):
64
+ try:
65
+ driver.execute_script("arguments[0].scrollIntoView(true);", element)
66
+ driver.execute_script("window.scrollBy(0, -100);")
67
+ except Exception as e:
68
+ self.logger.error(f"Error scrolling to element: {str(e)}")
69
+
70
+ def extract_visible_articles(self, driver):
71
+ """Extract articles currently visible on the page"""
72
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
73
+ article_containers = soup.find_all('div', class_='SrchLstPg-a')
74
+
75
+ new_articles = []
76
+ for container in article_containers:
77
+ link = container.find('a', href=True)
78
+ date = self.extract_date(container)
79
+
80
+ if link:
81
+ full_link = link['href'] if link['href'].startswith('http') else self.base_url + link['href']
82
+ if full_link not in self.fetched_articles:
83
+ self.fetched_articles.add(full_link)
84
+ new_articles.append({'link': full_link, 'date': date})
85
+
86
+ self.logger.info(f"Found {len(new_articles)} new articles")
87
+ return new_articles
88
+
89
+ def save_progress(self, topic, force=False):
90
+ """Save current progress to CSV"""
91
+ current_time = time.time()
92
+ if force or (current_time - self.last_save_time >= self.save_interval and self.scraped_articles):
93
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
94
+ filename = f"ndtv_{topic}articles_{timestamp}_partial.csv"
95
+ try:
96
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
97
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
98
+ writer.writeheader()
99
+ writer.writerows(self.scraped_articles)
100
+ self.last_save_time = current_time
101
+ print(f"\nProgress saved to {filename}: {len(self.scraped_articles)} articles")
102
+ self.logger.info(f"Progress saved to {filename}: {len(self.scraped_articles)} articles")
103
+ except Exception as e:
104
+ self.logger.error(f"Error saving progress: {str(e)}")
105
+
106
+ def get_article_links(self, topic_url, topic):
107
+ driver = self.create_driver()
108
+ all_articles = []
109
+
110
+ try:
111
+ driver.get(topic_url)
112
+ time.sleep(2)
113
+
114
+ # First scrape visible articles
115
+ initial_articles = self.extract_visible_articles(driver)
116
+ all_articles.extend(initial_articles)
117
+ self.logger.info(f"Initially found {len(initial_articles)} visible articles")
118
+
119
+ # Process these articles first
120
+ if initial_articles:
121
+ self.process_articles_batch(initial_articles)
122
+ self.save_progress(topic) # Save after initial batch
123
+
124
+ # Then continue with "Load More" and scrape new articles
125
+ attempt = 0
126
+ while not self.is_interrupted:
127
+ try:
128
+ parent_div = WebDriverWait(driver, 5).until(
129
+ EC.presence_of_element_located((By.CLASS_NAME, "btn_bm-cntr"))
130
+ )
131
+ load_more_button = parent_div.find_element(By.TAG_NAME, "a")
132
+ self.scroll_to_element(driver, load_more_button)
133
+ driver.execute_script("arguments[0].click();", load_more_button)
134
+ attempt += 1
135
+ print(f"Clicking load more {attempt} time")
136
+ time.sleep(2)
137
+
138
+ # Extract newly loaded articles
139
+ new_articles = self.extract_visible_articles(driver)
140
+ new_count = len(new_articles)
141
+ print(f"Found {new_count} new articles after clicking load more")
142
+
143
+ new_articles = [article for article in new_articles if article['link'] not in {a['link'] for a in all_articles}]
144
+ all_articles.extend(new_articles)
145
+
146
+ # Process new batch of articles
147
+ if new_articles:
148
+ self.process_articles_batch(new_articles)
149
+ self.save_progress(topic) # Save after each batch
150
+
151
+ except (NoSuchElementException, TimeoutException, ElementClickInterceptedException):
152
+ self.logger.info("No more 'Load More' button found or it is no longer clickable.")
153
+ break
154
+ except Exception as e:
155
+ self.logger.error(f"Error during load more operation: {str(e)}")
156
+ self.save_progress(topic, force=True)
157
+ break
158
+
159
+ return all_articles
160
+
161
+ except Exception as e:
162
+ self.logger.error(f"Error getting article links: {str(e)}")
163
+ self.save_progress(topic, force=True)
164
+ return all_articles
165
+ finally:
166
+ driver.quit()
167
+
168
+ def process_articles_batch(self, articles):
169
+ """Process a batch of articles immediately"""
170
+ self.logger.info(f"Processing batch of {len(articles)} articles")
171
+ with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
172
+ futures = [executor.submit(self.scrape_article_parallel, article) for article in articles]
173
+ for future in as_completed(futures):
174
+ try:
175
+ article = future.result()
176
+ if article:
177
+ self.scraped_articles.append(article)
178
+ self.logger.info(f"Successfully scraped: {article['title'][:50]}...")
179
+ except Exception as e:
180
+ self.logger.error(f"Error processing article: {str(e)}")
181
+
182
+ def scrape_article_parallel(self, article_data):
183
+ """Scrape a single article with its own driver instance"""
184
+ driver = self.create_driver()
185
+ try:
186
+ url = article_data['link']
187
+ driver.get(url)
188
+
189
+ try:
190
+ title_element = WebDriverWait(driver, 5).until(
191
+ EC.presence_of_element_located((By.TAG_NAME, "h1"))
192
+ )
193
+ except TimeoutException:
194
+ pass
195
+
196
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
197
+
198
+ # Extract title
199
+ title = None
200
+ title_element = soup.find('h1')
201
+ if title_element:
202
+ title = title_element.get_text().strip()
203
+ desc = None
204
+ desc_element = soup.find('div', class_='Art-exp_wr')
205
+ if desc_element:
206
+
207
+ paragraphs = desc_element.find_all('p')
208
+ if paragraphs:
209
+ desc = ' '.join([p.get_text().strip() for p in paragraphs])
210
+ else:
211
+ desc = desc_element.get_text().strip()
212
+
213
+ # Clean up the description
214
+ if desc:
215
+ desc = desc.strip("<br>").strip("<!-- -->").strip('<div>').strip("\n")
216
+ # desc = desc_element.get_text().strip("<br>").strip("<!-- -->").strip('<div>').strip("\n")
217
+
218
+ if not title:
219
+ meta_title = soup.find('meta', property='og:title')
220
+ if meta_title:
221
+ title = meta_title.get('content', '').strip()
222
+
223
+ return {
224
+ 'title': title or 'Title not found',
225
+ 'desc': desc or 'Description not found',
226
+ 'date': article_data['date'] or 'Date not found',
227
+ 'link': url
228
+ }
229
+
230
+ except Exception as e:
231
+ self.logger.error(f"Error scraping article {url}: {str(e)}")
232
+ return None
233
+ finally:
234
+ driver.quit()
235
+
236
+ def extract_date(self, container):
237
+ try:
238
+ date_div = container.find('span', class_='pst-by_lnk')
239
+ if date_div:
240
+ contents = list(date_div.descendants)[0].split(" ")[1:]
241
+ date = ''
242
+ for content in contents:
243
+ date += content + ' '
244
+ return date.rstrip(' ')
245
+ return None
246
+ except Exception as e:
247
+ self.logger.error(f"Error extracting date: {str(e)}")
248
+ return None
249
+
250
+ def scrape_topic(self, topic):
251
+ try:
252
+ topic_url = f"{self.base_url}/search?searchtext={topic}"
253
+ self.logger.info(f"Scraping topic: {topic_url}")
254
+
255
+ articles = self.get_article_links(topic_url, topic)
256
+ self.logger.info(f"Found total of {len(articles)} articles")
257
+
258
+ # Final save with complete data
259
+ if self.scraped_articles:
260
+ return self.save_to_csv(self.scraped_articles, topic, final=True)
261
+ return None
262
+
263
+ except Exception as e:
264
+ self.logger.error(f"Error in scrape_topic: {str(e)}")
265
+ if self.scraped_articles:
266
+ return self.save_to_csv(self.scraped_articles, topic, final=True)
267
+ return None
268
+
269
+ def save_to_csv(self, articles, topic, final=False):
270
+ if not articles:
271
+ self.logger.error("No articles to save")
272
+ return None
273
+
274
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
275
+ filename = f"ndtv_{topic}articles_{timestamp}_{'final' if final else 'partial'}.csv"
276
+
277
+ try:
278
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
279
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
280
+ writer.writeheader()
281
+ writer.writerows(articles)
282
+ self.logger.info(f"Successfully saved {len(articles)} articles to: {filename}")
283
+ return filename
284
+ except Exception as e:
285
+ self.logger.error(f"Error saving to CSV: {str(e)}")
286
+ return None
287
+
288
+ def parse_arguments():
289
+ """Parse command line arguments"""
290
+ parser = argparse.ArgumentParser(description='Scrape articles from NDTV by topic')
291
+
292
+ parser.add_argument('topic', type=str,
293
+ help='Topic to scrape (e.g., "RSS", "Covid", "India")')
294
+
295
+ parser.add_argument('-w', '--workers', type=int, default=4,
296
+ help='Number of worker threads (default: 4)')
297
+
298
+ parser.add_argument('-i', '--interval', type=int, default=300,
299
+ help='Auto-save interval in seconds (default: 300)')
300
+
301
+ return parser.parse_args()
302
+
303
+ def main():
304
+ try:
305
+ # Parse command line arguments
306
+ args = parse_arguments()
307
+
308
+ # Initialize the scraper with command line arguments
309
+ scraper = NDTVArticleScraper(max_workers=args.workers)
310
+ scraper.save_interval = args.interval
311
+
312
+ # Get topic from command line argument
313
+ topic = args.topic
314
+
315
+ print(f"\nScraping {topic}-related articles from NDTV...")
316
+ print("Press Ctrl+C at any time to save progress and exit.")
317
+
318
+ final_csv = scraper.scrape_topic(topic)
319
+
320
+ if final_csv:
321
+ print(f"\nArticles have been saved to: {final_csv}")
322
+ print(f"Total articles scraped: {len(scraper.scraped_articles)}")
323
+ else:
324
+ print("\nError saving to final CSV file")
325
+
326
+ except KeyboardInterrupt:
327
+ print("\nProcess interrupted by user. Saving progress...")
328
+ if scraper.scraped_articles:
329
+ scraper.save_progress(topic, force=True)
330
+ print("Saved progress and exiting.")
331
+ except Exception as e:
332
+ print(f"\nAn error occurred: {str(e)}")
333
+ if 'scraper' in locals() and scraper.scraped_articles:
334
+ scraper.save_progress(topic, force=True)
335
+ print("Saved progress despite error.")
336
+
337
+ if __name__ == "__main__":
338
+ main()
src/scrapers/scroll_scraper.py ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import logging
4
+ import csv
5
+ import signal
6
+ import sys
7
+ import argparse
8
+ from datetime import datetime
9
+ from concurrent.futures import ThreadPoolExecutor, as_completed
10
+ from selenium import webdriver
11
+ from selenium.webdriver.chrome.service import Service
12
+ from selenium.webdriver.common.by import By
13
+ from selenium.webdriver.support.ui import WebDriverWait
14
+ from selenium.webdriver.support import expected_conditions as EC
15
+ from selenium.common.exceptions import (
16
+ TimeoutException, NoSuchElementException, ElementClickInterceptedException
17
+ )
18
+ from bs4 import BeautifulSoup
19
+ from utils.webdriver_utils import create_chrome_driver, scroll_to_element
20
+
21
+ class ScrollArticleScraper:
22
+ def __init__(self, max_workers=4, articles_per_page=10):
23
+ self.max_workers = max_workers
24
+ self.articles_per_page = articles_per_page
25
+ self.base_url = "https://scroll.in/search"
26
+ self.fetched_articles = set()
27
+ self.articles = []
28
+ self.is_interrupted = False
29
+ self.last_save_time = time.time()
30
+ self.save_interval = 300 # Save every 5 minutes
31
+
32
+ logging.basicConfig(
33
+ level=logging.INFO,
34
+ format='%(asctime)s - %(levelname)s - %(message)s'
35
+ )
36
+ self.logger = logging.getLogger(__name__)
37
+
38
+ # Set up signal handlers
39
+ signal.signal(signal.SIGINT, self.signal_handler)
40
+ signal.signal(signal.SIGTERM, self.signal_handler)
41
+
42
+ def signal_handler(self, signum, frame):
43
+ """Handle interrupt signals"""
44
+ print("\nReceived interrupt signal. Saving progress and shutting down...")
45
+ self.is_interrupted = True
46
+ if self.articles:
47
+ self.save_progress("interrupted", force=True)
48
+ sys.exit(0)
49
+
50
+ # def create_driver(self):
51
+ # """Create and return a new Chrome driver instance"""
52
+ # chrome_options = webdriver.ChromeOptions()
53
+ # chrome_options.add_argument('--headless')
54
+ # chrome_options.add_argument('--no-sandbox')
55
+ # chrome_options.add_argument('--disable-dev-shm-usage')
56
+ # chrome_options.add_argument('--disable-extensions')
57
+ # chrome_options.page_load_strategy = 'eager'
58
+ # return webdriver.Chrome(options=chrome_options)
59
+ def create_driver(self):
60
+ """Create and return a new Chrome driver instance"""
61
+ return create_chrome_driver(headless=True, load_images=False)
62
+
63
+ def get_total_pages(self, driver, search_term):
64
+ """Get the total number of pages for the search term"""
65
+ try:
66
+ driver.get(f"{self.base_url}?q={search_term}&page=1")
67
+ time.sleep(2)
68
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
69
+
70
+ # Scroll.in might have a different pagination structure
71
+ pagination = soup.find('div', class_='page-numbers')
72
+ if pagination:
73
+ pages = pagination.find_all('a', class_='page-number')
74
+ if pages:
75
+ # Get the last page number
76
+ last_page = max([int(page.text.strip()) for page in pages])
77
+ return last_page
78
+
79
+ # Fallback to 144 pages if pagination not found
80
+ return 144
81
+ except Exception as e:
82
+ self.logger.error(f"Error getting total pages: {str(e)}")
83
+ return 144 # Default to 144 pages
84
+ def extract_visible_articles(self, driver):
85
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
86
+ article_containers = soup.find_all('li', itemscope=True, itemtype="https://schema.org/NewsArticle")
87
+
88
+ new_articles = []
89
+ for container in article_containers:
90
+ link_element = container.find('a', href=True)
91
+ date_element = container.find('time') # Change back to 'time'
92
+
93
+ if link_element and link_element['href']:
94
+ full_link = link_element['href'] if link_element['href'].startswith('http') else 'https://scroll.in' + link_element['href']
95
+
96
+ # Extract date
97
+ date = None
98
+ if date_element:
99
+ date_text = date_element.get_text(strip=True)
100
+ date_text = date_text.replace('Â', '').replace('·', '').strip()
101
+ try:
102
+ # Try parsing the date
103
+ parsed_date = datetime.strptime(date_text, '%b %d, %Y')
104
+ date = parsed_date.strftime('%Y-%m-%d')
105
+ except:
106
+ try:
107
+ # Alternative date format
108
+ parsed_date = datetime.strptime(date_text, '%B %d, %Y')
109
+ date = parsed_date.strftime('%Y-%m-%d')
110
+ except:
111
+ date = date_text
112
+
113
+ if full_link not in self.fetched_articles:
114
+ self.fetched_articles.add(full_link)
115
+ new_articles.append({'link': full_link, 'date': date})
116
+
117
+ return new_articles
118
+
119
+ def scrape_topic(self, search_term, topic):
120
+ """Scrape articles from all pages for a given search term"""
121
+ driver = self.create_driver()
122
+ try:
123
+ total_pages = self.get_total_pages(driver, search_term)
124
+ total_expected_articles = total_pages * self.articles_per_page
125
+ self.logger.info(f"Found {total_pages} pages to scrape (approximately {total_expected_articles} articles)")
126
+
127
+ for page in range(1, total_pages + 1):
128
+ if self.is_interrupted:
129
+ break
130
+
131
+ page_url = f"{self.base_url}?q={search_term}&page={page}"
132
+ articles_scraped = len(self.articles)
133
+ progress_percentage = (articles_scraped / total_expected_articles) * 100
134
+
135
+ self.logger.info(f"Scraping page {page}/{total_pages} - Progress: {articles_scraped}/{total_expected_articles} articles ({progress_percentage:.1f}%)")
136
+
137
+ try:
138
+ driver.get(page_url)
139
+ time.sleep(2) # Allow page to load
140
+
141
+ new_articles = self.extract_visible_articles(driver)
142
+ if new_articles:
143
+ self.process_articles_batch(new_articles)
144
+ self.logger.info(f"Scraped {len(new_articles)}/{self.articles_per_page} articles from page {page}")
145
+ self.save_progress(topic)
146
+ else:
147
+ self.logger.warning(f"No articles found on page {page}")
148
+
149
+ except Exception as e:
150
+ self.logger.error(f"Error scraping page {page}: {str(e)}")
151
+ continue
152
+
153
+ if self.articles:
154
+ return self.save_to_csv(self.articles, topic, final=True)
155
+ return None
156
+
157
+ except Exception as e:
158
+ self.logger.error(f"Error in scrape_topic: {str(e)}")
159
+ if self.articles:
160
+ return self.save_to_csv(self.articles, topic, final=True)
161
+ return None
162
+ finally:
163
+ driver.quit()
164
+
165
+ def scrape_article_parallel(self, article_data):
166
+
167
+ driver = self.create_driver()
168
+ try:
169
+ url = article_data['link']
170
+ driver.get(url)
171
+ time.sleep(2)
172
+
173
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
174
+
175
+ # Extract title
176
+ title = None
177
+ title_element = soup.find('h1') # Simple h1 search
178
+ if title_element:
179
+ title = title_element.get_text().strip()
180
+
181
+ # Extract description
182
+ desc = None
183
+ article_body = soup.find('section', class_='article-content')
184
+ if article_body:
185
+ paragraphs = article_body.find_all('p')
186
+ desc = ' '.join([p.get_text().strip() for p in paragraphs])
187
+
188
+ return {
189
+ 'title': title or 'Title not found',
190
+ 'desc': desc or 'Description not found',
191
+ 'date': article_data['date'] or 'Date not found',
192
+ 'link': url
193
+ }
194
+
195
+ except Exception as e:
196
+ self.logger.error(f"Error scraping article {url}: {str(e)}")
197
+ return None
198
+ finally:
199
+ driver.quit()
200
+
201
+ def process_articles_batch(self, article_batch):
202
+ """Process a batch of articles in parallel"""
203
+ with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
204
+ futures = [executor.submit(self.scrape_article_parallel, article_data)
205
+ for article_data in article_batch]
206
+
207
+ successful_articles = 0
208
+ for future in as_completed(futures):
209
+ try:
210
+ article = future.result()
211
+ if article:
212
+ self.articles.append(article)
213
+ successful_articles += 1
214
+ self.logger.info(f"Successfully scraped: {article['title'][:50]}...")
215
+ except Exception as e:
216
+ self.logger.error(f"Error processing article: {str(e)}")
217
+
218
+ if successful_articles < len(article_batch):
219
+ self.logger.warning(f"Only processed {successful_articles}/{len(article_batch)} articles in this batch")
220
+
221
+ def save_progress(self, topic, force=False):
222
+ """Save current progress to CSV"""
223
+ current_time = time.time()
224
+ if force or (current_time - self.last_save_time >= self.save_interval and self.articles):
225
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
226
+ filename = f"scroll_{topic}articles_{timestamp}_partial.csv"
227
+ try:
228
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
229
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
230
+ writer.writeheader()
231
+ writer.writerows(self.articles)
232
+ self.last_save_time = current_time
233
+ print(f"\nProgress saved to {filename}: {len(self.articles)} articles")
234
+ self.logger.info(f"Progress saved to {filename}: {len(self.articles)} articles")
235
+ except Exception as e:
236
+ self.logger.error(f"Error saving progress: {str(e)}")
237
+
238
+ def save_to_csv(self, articles, topic, final=False):
239
+ """Save articles to CSV file"""
240
+ if not articles:
241
+ self.logger.error("No articles to save")
242
+ return None
243
+
244
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
245
+ filename = f"scroll_{topic}articles_{timestamp}_{'final' if final else 'partial'}.csv"
246
+
247
+ try:
248
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
249
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
250
+ writer.writeheader()
251
+ writer.writerows(articles)
252
+ self.logger.info(f"Successfully saved {len(articles)} articles to: {filename}")
253
+ return filename
254
+ except Exception as e:
255
+ self.logger.error(f"Error saving to CSV: {str(e)}")
256
+ return None
257
+
258
+ # def parse_arguments():
259
+ # """Parse command line arguments"""
260
+ # parser = argparse.ArgumentParser(description='Scrape articles from Scroll.in by topic')
261
+
262
+ # parser.add_argument('topic', type=str,
263
+ # help='Topic to scrape (e.g., "RSS", "Covid", "India")')
264
+
265
+ # parser.add_argument('-w', '--workers', type=int, default=4,
266
+ # help='Number of worker threads (default: 4)')
267
+
268
+ # parser.add_argument('-i', '--interval', type=int, default=300,
269
+ # help='Auto-save interval in seconds (default: 300)')
270
+
271
+ # parser.add_argument('-a', '--articles-per-page', type=int, default=10,
272
+ # help='Expected number of articles per page (default: 10)')
273
+
274
+ # return parser.parse_args()
275
+
276
+ def main():
277
+ try:
278
+ # Parse command line arguments
279
+ args = parse_arguments()
280
+
281
+ # Initialize the scraper with command line arguments
282
+ scraper = ScrollArticleScraper(
283
+ max_workers=args.workers,
284
+ articles_per_page=args.articles_per_page
285
+ )
286
+ scraper.save_interval = args.interval
287
+
288
+ # Get topic from command line argument
289
+ topic = args.topic
290
+
291
+ print(f"\nScraping {topic}-related articles from Scroll.in...")
292
+ print("Press Ctrl+C at any time to save progress and exit.")
293
+
294
+ final_csv = scraper.scrape_topic(topic.lower(), topic)
295
+
296
+ if final_csv:
297
+ print(f"\nArticles have been saved to: {final_csv}")
298
+ print(f"Total articles scraped: {len(scraper.articles)}")
299
+ else:
300
+ print("\nError saving to final CSV file")
301
+
302
+ except KeyboardInterrupt:
303
+ print("\nProcess interrupted by user. Saving progress...")
304
+ if scraper.articles:
305
+ scraper.save_progress(topic, force=True)
306
+ print("Saved progress and exiting.")
307
+ except Exception as e:
308
+ print(f"\nAn error occurred: {str(e)}")
309
+ if 'scraper' in locals() and scraper.articles:
310
+ scraper.save_progress(topic, force=True)
311
+ print("Saved progress despite error.")
312
+
313
+ if __name__ == "__main__":
314
+ main()
src/scrapers/toi_scraper.py ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import logging
4
+ import csv
5
+ import signal
6
+ import sys
7
+ import argparse
8
+ from datetime import datetime
9
+ from concurrent.futures import ThreadPoolExecutor, as_completed
10
+ from selenium import webdriver
11
+ from selenium.webdriver.chrome.service import Service
12
+ from selenium.webdriver.common.by import By
13
+ from selenium.webdriver.support.ui import WebDriverWait
14
+ from selenium.webdriver.support import expected_conditions as EC
15
+ from selenium.common.exceptions import (
16
+ TimeoutException, NoSuchElementException, ElementClickInterceptedException
17
+ )
18
+ from bs4 import BeautifulSoup
19
+ from utils.webdriver_utils import create_chrome_driver, scroll_to_element
20
+
21
+ class TOIArticleScraper:
22
+ def __init__(self, max_workers=4):
23
+ self.max_workers = max_workers
24
+ self.base_url = "https://timesofindia.indiatimes.com"
25
+ self.fetched_articles = set()
26
+ self.articles = []
27
+ self.is_interrupted = False
28
+ self.last_save_time = time.time()
29
+ self.save_interval = 300 # Save every 5 minutes
30
+
31
+ logging.basicConfig(
32
+ level=logging.INFO,
33
+ format='%(asctime)s - %(levelname)s - %(message)s'
34
+ )
35
+ self.logger = logging.getLogger(__name__)
36
+
37
+ # Set up signal handlers
38
+ signal.signal(signal.SIGINT, self.signal_handler)
39
+ signal.signal(signal.SIGTERM, self.signal_handler)
40
+
41
+ def create_driver(self):
42
+ """Create and return a new Chrome driver instance"""
43
+ return create_chrome_driver(headless=True, load_images=False)
44
+
45
+ def signal_handler(self, signum, frame):
46
+ """Handle interrupt signals"""
47
+ print("\nReceived interrupt signal. Saving progress and shutting down...")
48
+ self.is_interrupted = True
49
+ if self.articles:
50
+ self.save_progress("interrupted", force=True)
51
+ sys.exit(0)
52
+
53
+ # def create_driver(self):
54
+ # """Create and return a new Chrome driver instance"""
55
+ # chrome_options = webdriver.ChromeOptions()
56
+ # chrome_options.add_argument('--headless')
57
+ # chrome_options.add_argument('--no-sandbox')
58
+ # chrome_options.add_argument('--disable-dev-shm-usage')
59
+ # chrome_options.add_argument('--disable-extensions')
60
+ # chrome_options.page_load_strategy = 'eager'
61
+ # return webdriver.Chrome(options=chrome_options)
62
+
63
+ def create_driver(self):
64
+ """Create and return a new Chrome driver instance"""
65
+ return create_chrome_driver(headless=True, load_images=False)
66
+
67
+ def scroll_to_element(self, driver, element):
68
+ """Scroll to make the element visible"""
69
+ try:
70
+ driver.execute_script("arguments[0].scrollIntoView(true);", element)
71
+ driver.execute_script("window.scrollBy(0, -100);")
72
+ except Exception as e:
73
+ self.logger.error(f"Error scrolling to element: {str(e)}")
74
+
75
+ def extract_visible_articles(self, driver):
76
+ """Extract currently visible articles from the page"""
77
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
78
+ article_containers = soup.find_all('div', class_='uwU81')
79
+
80
+ new_articles = []
81
+ for container in article_containers:
82
+ link = container.find('a', href=True)
83
+ date = self.extract_date(container)
84
+
85
+ if link and 'articleshow' in link['href']:
86
+ full_link = link['href'] if link['href'].startswith('http') else self.base_url + link['href']
87
+ if full_link not in self.fetched_articles:
88
+ self.fetched_articles.add(full_link)
89
+ new_articles.append({'link': full_link, 'date': date})
90
+
91
+ return new_articles
92
+
93
+ def save_progress(self, topic, force=False):
94
+ """Save current progress to CSV"""
95
+ current_time = time.time()
96
+ if force or (current_time - self.last_save_time >= self.save_interval and self.articles):
97
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
98
+ filename = f"toi_{topic}articles_{timestamp}_partial.csv"
99
+ try:
100
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
101
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
102
+ writer.writeheader()
103
+ writer.writerows(self.articles)
104
+ self.last_save_time = current_time
105
+ print(f"\nProgress saved to {filename}: {len(self.articles)} articles")
106
+ self.logger.info(f"Progress saved to {filename}: {len(self.articles)} articles")
107
+ except Exception as e:
108
+ self.logger.error(f"Error saving progress: {str(e)}")
109
+
110
+ def scrape_topic(self, topic_url, topic):
111
+ """Incrementally scrape articles while loading more content"""
112
+ driver = self.create_driver()
113
+ try:
114
+ driver.get(topic_url)
115
+ time.sleep(1)
116
+
117
+ while not self.is_interrupted:
118
+ # Scrape currently visible articles
119
+ new_articles = self.extract_visible_articles(driver)
120
+ if new_articles:
121
+ # Process new articles in parallel
122
+ self.process_articles_batch(new_articles)
123
+ self.logger.info(f"Scraped {len(new_articles)} new articles. Total: {len(self.articles)}")
124
+ # Save progress after each batch
125
+ self.save_progress(topic)
126
+
127
+ # Try to load more articles
128
+ try:
129
+ parent_div = driver.find_element(By.CLASS_NAME, "IVNry")
130
+ load_more_button = parent_div.find_element(By.TAG_NAME, "button")
131
+ self.scroll_to_element(driver, load_more_button)
132
+ driver.execute_script("arguments[0].click();", load_more_button)
133
+ time.sleep(1)
134
+ except NoSuchElementException:
135
+ self.logger.info("No more articles to load")
136
+ break
137
+ except Exception as e:
138
+ self.logger.error(f"Error clicking load more: {str(e)}")
139
+ break
140
+
141
+ # Save final results
142
+ if self.articles:
143
+ return self.save_to_csv(self.articles, topic, final=True)
144
+ return None
145
+
146
+ except Exception as e:
147
+ self.logger.error(f"Error in scrape_topic: {str(e)}")
148
+ if self.articles:
149
+ return self.save_to_csv(self.articles, topic, final=True)
150
+ return None
151
+ finally:
152
+ driver.quit()
153
+
154
+ def process_articles_batch(self, article_batch):
155
+ """Process a batch of articles in parallel"""
156
+ with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
157
+ futures = [executor.submit(self.scrape_article_parallel, article_data)
158
+ for article_data in article_batch]
159
+
160
+ for future in as_completed(futures):
161
+ try:
162
+ article = future.result()
163
+ if article:
164
+ self.articles.append(article)
165
+ self.logger.info(f"Successfully scraped: {article['title'][:50]}...")
166
+ except Exception as e:
167
+ self.logger.error(f"Error processing article: {str(e)}")
168
+
169
+ def scrape_article_parallel(self, article_data):
170
+ """Scrape a single article with its own driver instance"""
171
+ driver = self.create_driver()
172
+ try:
173
+ url = article_data['link']
174
+ driver.get(url)
175
+
176
+ try:
177
+ title_element = WebDriverWait(driver, 5).until(
178
+ EC.presence_of_element_located((By.TAG_NAME, "h1"))
179
+ )
180
+ except TimeoutException:
181
+ pass
182
+
183
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
184
+
185
+ # Extract title
186
+ title = None
187
+ title_element = soup.find('h1')
188
+ if title_element:
189
+ title = title_element.get_text().strip()
190
+
191
+ if not title:
192
+ title_elements = soup.find_all(['h1', 'div'], class_=['_23498', 'articletitle', 'headline', 'title'])
193
+ for element in title_elements:
194
+ if element.get_text().strip():
195
+ title = element.get_text().strip()
196
+ break
197
+
198
+ # Extract description
199
+ desc = None
200
+ desc_element = soup.find('div', class_='_s30J clearfix')
201
+ if desc_element:
202
+ desc = desc_element.get_text().strip("<br>").strip("<!-- -->").strip('<div>')
203
+
204
+ if not title:
205
+ meta_title = soup.find('meta', property='og:title')
206
+ if meta_title:
207
+ title = meta_title.get('content', '').strip()
208
+
209
+ return {
210
+ 'title': title or 'Title not found',
211
+ 'desc': desc or 'Description not found',
212
+ 'date': article_data['date'] or 'Date not found',
213
+ 'link': url
214
+ }
215
+
216
+ except Exception as e:
217
+ self.logger.error(f"Error scraping article {url}: {str(e)}")
218
+ return None
219
+ finally:
220
+ driver.quit()
221
+
222
+ def extract_date(self, container):
223
+ """Extract the date from the article container"""
224
+ try:
225
+ date_div = container.find('div', class_='ZxBIG')
226
+ if date_div:
227
+ contents = list(date_div.descendants)
228
+ for content in contents:
229
+ if hasattr(content, 'strip') and 'IST' in str(content):
230
+ text = str(content).strip()
231
+ if '(IST)' in text:
232
+ return text.strip()
233
+ return None
234
+ except Exception as e:
235
+ self.logger.error(f"Error extracting date: {str(e)}")
236
+ return None
237
+
238
+ def save_to_csv(self, articles, topic, final=False):
239
+ """Save articles to CSV file"""
240
+ if not articles:
241
+ self.logger.error("No articles to save")
242
+ return None
243
+
244
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
245
+ filename = f"toi_{topic}articles_{timestamp}_{'final' if final else 'partial'}.csv"
246
+
247
+ try:
248
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
249
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
250
+ writer.writeheader()
251
+ writer.writerows(articles)
252
+ self.logger.info(f"Successfully saved {len(articles)} articles to: {filename}")
253
+ return filename
254
+ except Exception as e:
255
+ self.logger.error(f"Error saving to CSV: {str(e)}")
256
+ return None
257
+
258
+ def parse_arguments():
259
+ """Parse command line arguments"""
260
+ parser = argparse.ArgumentParser(description='Scrape articles from Times of India by topic')
261
+
262
+ parser.add_argument('topic', type=str,
263
+ help='Topic to scrape (e.g., "RSS", "Covid", "India")')
264
+
265
+ parser.add_argument('-w', '--workers', type=int, default=4,
266
+ help='Number of worker threads (default: 4)')
267
+
268
+ parser.add_argument('-i', '--interval', type=int, default=300,
269
+ help='Auto-save interval in seconds (default: 300)')
270
+
271
+ return parser.parse_args()
272
+
273
+ def main():
274
+ try:
275
+ # Parse command line arguments
276
+ args = parse_arguments()
277
+
278
+ # Initialize the scraper with command line arguments
279
+ scraper = TOIArticleScraper(max_workers=args.workers)
280
+ scraper.save_interval = args.interval
281
+
282
+ # Get topic from command line argument
283
+ topic = args.topic
284
+ topic_url = f"{scraper.base_url}/topic/{topic}/news"
285
+
286
+ print(f"\nScraping {topic}-related articles from Times of India...")
287
+ print("Press Ctrl+C at any time to save progress and exit.")
288
+
289
+ final_csv = scraper.scrape_topic(topic_url, topic)
290
+
291
+ if final_csv:
292
+ print(f"\nArticles have been saved to: {final_csv}")
293
+ print(f"Total articles scraped: {len(scraper.articles)}")
294
+ else:
295
+ print("\nError saving to final CSV file")
296
+
297
+ except KeyboardInterrupt:
298
+ print("\nProcess interrupted by user. Saving progress...")
299
+ if scraper.articles:
300
+ scraper.save_progress(topic, force=True)
301
+ print("Saved progress and exiting.")
302
+ except Exception as e:
303
+ print(f"\nAn error occurred: {str(e)}")
304
+ if 'scraper' in locals() and scraper.articles:
305
+ scraper.save_progress(topic, force=True)
306
+ print("Saved progress despite error.")
307
+
308
+ if __name__ == "__main__":
309
+ main()
src/scrapers/wion_scraper.py ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import logging
4
+ import csv
5
+ import signal
6
+ import sys
7
+ import argparse
8
+ from datetime import datetime
9
+ from concurrent.futures import ThreadPoolExecutor, as_completed
10
+ from selenium import webdriver
11
+ from selenium.webdriver.chrome.service import Service
12
+ from selenium.webdriver.common.by import By
13
+ from selenium.webdriver.support.ui import WebDriverWait
14
+ from selenium.webdriver.support import expected_conditions as EC
15
+ from selenium.common.exceptions import (
16
+ TimeoutException, NoSuchElementException, ElementClickInterceptedException,
17
+ StaleElementReferenceException, WebDriverException
18
+ )
19
+ from bs4 import BeautifulSoup
20
+ from urllib3.exceptions import ProtocolError
21
+
22
+ from utils.webdriver_utils import create_chrome_driver, scroll_to_element
23
+
24
+ class WIONArticleScraper:
25
+ def __init__(self, max_workers=4, articles_per_page=10, max_retries=3):
26
+ self.max_workers = max_workers
27
+ self.articles_per_page = articles_per_page
28
+ self.max_retries = max_retries
29
+ self.base_url = "https://www.wionews.com/search"
30
+ self.fetched_articles = set()
31
+ self.articles = []
32
+ self.is_interrupted = False
33
+ self.last_save_time = time.time()
34
+ self.save_interval = 300 # Save every 5 minutes
35
+
36
+ logging.basicConfig(
37
+ level=logging.INFO,
38
+ format='%(asctime)s - %(levelname)s - %(message)s'
39
+ )
40
+ self.logger = logging.getLogger(__name__)
41
+
42
+ signal.signal(signal.SIGINT, self.signal_handler)
43
+ signal.signal(signal.SIGTERM, self.signal_handler)
44
+
45
+ def signal_handler(self, signum, frame):
46
+ print("\nReceived interrupt signal. Saving progress and shutting down...")
47
+ self.is_interrupted = True
48
+ if self.articles:
49
+ self.save_progress("interrupted", force=True)
50
+ sys.exit(0)
51
+
52
+ # def create_driver(self):
53
+ # chrome_options = webdriver.ChromeOptions()
54
+ # chrome_options.add_argument('--headless')
55
+ # chrome_options.add_argument('--no-sandbox')
56
+ # chrome_options.add_argument('--disable-dev-shm-usage')
57
+ # chrome_options.add_argument('--disable-extensions')
58
+ # chrome_options.add_argument('--disable-gpu')
59
+ # chrome_options.add_argument('--disable-infobars')
60
+ # chrome_options.add_argument('--disable-notifications')
61
+ # chrome_options.add_argument('--blink-settings=imagesEnabled=false') # Disable images
62
+ # chrome_options.page_load_strategy = 'eager'
63
+
64
+ # # Add performance preferences
65
+ # chrome_options.add_experimental_option('prefs', {
66
+ # 'profile.default_content_setting_values.notifications': 2,
67
+ # 'profile.managed_default_content_settings.images': 2,
68
+ # 'disk-cache-size': 4096
69
+ # })
70
+
71
+ # return webdriver.Chrome(options=chrome_options)
72
+
73
+ def create_driver(self):
74
+ """Create and return a new Chrome driver instance"""
75
+ return create_chrome_driver(headless=True, load_images=False)
76
+
77
+ def wait_for_page_load(self, driver, url, timeout=10, retries=3):
78
+ """Wait for page load with retries"""
79
+ for attempt in range(retries):
80
+ try:
81
+ driver.set_page_load_timeout(timeout)
82
+ driver.get(url)
83
+
84
+ # Wait for DOM to be ready
85
+ WebDriverWait(driver, timeout).until(
86
+ lambda d: d.execute_script('return document.readyState') == 'complete'
87
+ )
88
+
89
+ return True
90
+ except (TimeoutException, WebDriverException, ProtocolError) as e:
91
+ if attempt == retries - 1:
92
+ self.logger.warning(f"Failed to load {url} after {retries} attempts: {str(e)}")
93
+ return False
94
+ else:
95
+ self.logger.info(f"Retrying page load for {url} (attempt {attempt + 2}/{retries})")
96
+ time.sleep(2 * (attempt + 1)) # Exponential backoff
97
+ continue
98
+ except Exception as e:
99
+ self.logger.error(f"Unexpected error loading {url}: {str(e)}")
100
+ return False
101
+ return False
102
+
103
+ def get_total_pages(self, driver, search_term):
104
+ try:
105
+ if not self.wait_for_page_load(driver, f"{self.base_url}?page=1&title={search_term}"):
106
+ return 20
107
+
108
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
109
+ pagination = soup.find('ul', class_='pagination')
110
+ if pagination:
111
+ pages = pagination.find_all('li')
112
+ if pages:
113
+ last_page = pages[-2].text.strip()
114
+ return int(last_page)
115
+ return 20
116
+ except Exception as e:
117
+ self.logger.error(f"Error getting total pages: {str(e)}")
118
+ return 20
119
+
120
+ def extract_visible_articles(self, driver):
121
+ """Extract currently visible articles from the page"""
122
+ try:
123
+ # Wait for article containers to be present
124
+ WebDriverWait(driver, 10).until(
125
+ EC.presence_of_element_located((By.CLASS_NAME, 'gh-archive-page-post-content-main'))
126
+ )
127
+
128
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
129
+ article_containers = soup.find_all('div', class_='gh-archive-page-post-content-main')
130
+
131
+ new_articles = []
132
+ for container in article_containers:
133
+ try:
134
+ link_element = container.find('a', href=True)
135
+ date = self.extract_date(container)
136
+
137
+ if link_element and link_element['href']:
138
+ full_link = link_element['href'] if link_element['href'].startswith('http') else 'https://www.wionews.com' + link_element['href']
139
+ if full_link not in self.fetched_articles:
140
+ self.fetched_articles.add(full_link)
141
+ new_articles.append({'link': full_link, 'date': date})
142
+ except Exception as e:
143
+ self.logger.error(f"Error processing article container: {str(e)}")
144
+ continue
145
+
146
+ return new_articles
147
+ except Exception as e:
148
+ self.logger.error(f"Error extracting visible articles: {str(e)}")
149
+ return []
150
+
151
+ def extract_date(self, container):
152
+ try:
153
+ time_element = container.find('time', class_='gh-post-info__date')
154
+
155
+ if time_element:
156
+ date = time_element.get('datetime', time_element.get_text().strip())
157
+ return date
158
+
159
+ post_info = container.find('div', class_='gh-post-info')
160
+ if post_info:
161
+ time_element = post_info.find('time')
162
+ if time_element:
163
+ date = time_element.get('datetime', time_element.get_text().strip())
164
+ return date
165
+
166
+ return None
167
+ except Exception as e:
168
+ self.logger.error(f"Error extracting date: {str(e)}")
169
+ return None
170
+
171
+ def scrape_topic(self, search_term, topic):
172
+ driver = self.create_driver()
173
+ try:
174
+ total_pages = self.get_total_pages(driver, search_term)
175
+ total_expected_articles = total_pages * self.articles_per_page
176
+ self.logger.info(f"Found {total_pages} pages to scrape (approximately {total_expected_articles} articles)")
177
+
178
+ for page in range(1, total_pages + 1):
179
+ if self.is_interrupted:
180
+ break
181
+
182
+ page_url = f"{self.base_url}?page={page}&title={search_term}"
183
+ articles_scraped = len(self.articles)
184
+ progress_percentage = (articles_scraped / total_expected_articles) * 100
185
+
186
+ self.logger.info(f"Scraping page {page}/{total_pages} - Progress: {articles_scraped}/{total_expected_articles} articles ({progress_percentage:.1f}%)")
187
+
188
+ # Try loading the page with retries
189
+ if not self.wait_for_page_load(driver, page_url):
190
+ self.logger.warning(f"Skipping page {page} due to load failure")
191
+ continue
192
+
193
+ new_articles = self.extract_visible_articles(driver)
194
+ if new_articles:
195
+ self.process_articles_batch(new_articles)
196
+ self.logger.info(f"Scraped {len(new_articles)}/{self.articles_per_page} articles from page {page}")
197
+ self.save_progress(topic)
198
+ else:
199
+ self.logger.warning(f"No articles found on page {page}")
200
+
201
+ if self.articles:
202
+ return self.save_to_csv(self.articles, topic, final=True)
203
+ return None
204
+
205
+ except Exception as e:
206
+ self.logger.error(f"Error in scrape_topic: {str(e)}")
207
+ if self.articles:
208
+ return self.save_to_csv(self.articles, topic, final=True)
209
+ return None
210
+ finally:
211
+ driver.quit()
212
+
213
+ def scrape_article_parallel(self, article_data):
214
+ driver = self.create_driver()
215
+ url = article_data['link']
216
+
217
+ for attempt in range(self.max_retries):
218
+ try:
219
+ if not self.wait_for_page_load(driver, url):
220
+ if attempt == self.max_retries - 1:
221
+ raise TimeoutException(f"Failed to load article after {self.max_retries} attempts")
222
+ continue
223
+
224
+ # Wait for main content to be present
225
+ WebDriverWait(driver, 10).until(
226
+ EC.presence_of_element_located((By.TAG_NAME, "article"))
227
+ )
228
+
229
+ soup = BeautifulSoup(driver.page_source, 'html.parser')
230
+
231
+ # Extract title with multiple fallbacks
232
+ title = None
233
+ for selector in ['h1.article-main-title', 'h1.headline', 'h1']:
234
+ title_element = soup.select_one(selector)
235
+ if title_element:
236
+ title = title_element.get_text().strip()
237
+ break
238
+
239
+ # Extract description with multiple fallbacks
240
+ desc = None
241
+ for class_name in ['post-content', 'article-main-content', 'post-page-main-content']:
242
+ content_div = soup.find('div', class_=class_name)
243
+ if content_div:
244
+ p_tags = content_div.find_all('p')
245
+ if p_tags:
246
+ desc = ' '.join([p.get_text().strip() for p in p_tags])
247
+ break
248
+
249
+ return {
250
+ 'title': title or 'Title not found',
251
+ 'desc': desc or 'Description not found',
252
+ 'date': article_data['date'] or 'Date not found',
253
+ 'link': url
254
+ }
255
+
256
+ except Exception as e:
257
+ if attempt == self.max_retries - 1:
258
+ self.logger.error(f"Error scraping article {url}: {str(e)}")
259
+ return None
260
+ time.sleep(2 * (attempt + 1)) # Exponential backoff
261
+ finally:
262
+ if attempt == self.max_retries - 1:
263
+ driver.quit()
264
+
265
+ def process_articles_batch(self, article_batch):
266
+ with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
267
+ futures = [executor.submit(self.scrape_article_parallel, article_data)
268
+ for article_data in article_batch]
269
+
270
+ successful_articles = 0
271
+ for future in as_completed(futures):
272
+ try:
273
+ article = future.result()
274
+ if article:
275
+ self.articles.append(article)
276
+ successful_articles += 1
277
+ self.logger.info(f"Successfully scraped: {article['title'][:50]}...")
278
+ except Exception as e:
279
+ self.logger.error(f"Error processing article: {str(e)}")
280
+
281
+ if successful_articles < len(article_batch):
282
+ self.logger.warning(f"Only processed {successful_articles}/{len(article_batch)} articles in this batch")
283
+
284
+ def save_progress(self, topic, force=False):
285
+ current_time = time.time()
286
+ if force or (current_time - self.last_save_time >= self.save_interval and self.articles):
287
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
288
+ filename = f"wion_{topic}articles_{timestamp}_partial.csv"
289
+ try:
290
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
291
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
292
+ writer.writeheader()
293
+ writer.writerows(self.articles)
294
+ self.last_save_time = current_time
295
+ print(f"\nProgress saved to {filename}: {len(self.articles)} articles")
296
+ self.logger.info(f"Progress saved to {filename}: {len(self.articles)} articles")
297
+ except Exception as e:
298
+ self.logger.error(f"Error saving progress: {str(e)}")
299
+
300
+ def save_to_csv(self, articles, topic, final=False):
301
+ if not articles:
302
+ self.logger.error("No articles to save")
303
+ return None
304
+
305
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
306
+ filename = f"wion_{topic}articles_{timestamp}_{'final' if final else 'partial'}.csv"
307
+
308
+ try:
309
+ with open(filename, 'w', newline='', encoding='utf-8') as file:
310
+ writer = csv.DictWriter(file, fieldnames=['title', 'desc', 'date', 'link'])
311
+ writer.writeheader()
312
+ writer.writerows(articles)
313
+ self.logger.info(f"Successfully saved {len(articles)} articles to: {filename}")
314
+ return filename
315
+ except Exception as e:
316
+ self.logger.error(f"Error saving to CSV: {str(e)}")
317
+ return None
318
+
319
+ def parse_arguments():
320
+ """Parse command line arguments"""
321
+ parser = argparse.ArgumentParser(description='Scrape articles from WION News by topic')
322
+
323
+ parser.add_argument('topic', type=str,
324
+ help='Topic to scrape (e.g., "RSS", "Covid", "India")')
325
+
326
+ parser.add_argument('-w', '--workers', type=int, default=4,
327
+ help='Number of worker threads (default: 4)')
328
+
329
+ parser.add_argument('-i', '--interval', type=int, default=300,
330
+ help='Auto-save interval in seconds (default: 300)')
331
+
332
+ parser.add_argument('-a', '--articles-per-page', type=int, default=10,
333
+ help='Expected number of articles per page (default: 10)')
334
+
335
+ parser.add_argument('-r', '--retries', type=int, default=3,
336
+ help='Maximum retries for failed requests (default: 3)')
337
+
338
+ return parser.parse_args()
339
+
340
+ def main():
341
+ try:
342
+ # Parse command line arguments
343
+ args = parse_arguments()
344
+
345
+ # Initialize the scraper with command line arguments
346
+ scraper = WIONArticleScraper(
347
+ max_workers=args.workers,
348
+ articles_per_page=args.articles_per_page,
349
+ max_retries=args.retries
350
+ )
351
+ scraper.save_interval = args.interval
352
+
353
+ # Get topic from command line argument
354
+ topic = args.topic
355
+
356
+ print(f"\nScraping {topic}-related articles from WION News...")
357
+ print("Press Ctrl+C at any time to save progress and exit.")
358
+
359
+ final_csv = scraper.scrape_topic(topic.lower(), topic)
360
+
361
+ if final_csv:
362
+ print(f"\nArticles have been saved to: {final_csv}")
363
+ print(f"Total articles scraped: {len(scraper.articles)}")
364
+ else:
365
+ print("\nError saving to final CSV file")
366
+
367
+ except KeyboardInterrupt:
368
+ print("\nProcess interrupted by user. Saving progress...")
369
+ if scraper.articles:
370
+ scraper.save_progress(topic, force=True)
371
+ print("Saved progress and exiting.")
372
+ except Exception as e:
373
+ print(f"\nAn error occurred: {str(e)}")
374
+ if 'scraper' in locals() and scraper.articles:
375
+ scraper.save_progress(topic, force=True)
376
+ print("Saved progress despite error.")
377
+
378
+ if __name__ == "__main__":
379
+ main()
src/utils/__init__.py ADDED
File without changes
src/utils/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (154 Bytes). View file
 
src/utils/__pycache__/webdriver_utils.cpython-312.pyc ADDED
Binary file (6.04 kB). View file
 
src/utils/webdriver_utils.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utilities for creating and managing Selenium WebDriver instances.
3
+ This module provides reusable functions for browser automation.
4
+ """
5
+
6
+ import time
7
+ import logging
8
+ from selenium import webdriver
9
+ from selenium.webdriver.chrome.service import Service
10
+ from selenium.webdriver.support.ui import WebDriverWait
11
+ from selenium.webdriver.support import expected_conditions as EC
12
+ from selenium.common.exceptions import TimeoutException, WebDriverException
13
+ from urllib3.exceptions import ProtocolError
14
+
15
+ logger = logging.getLogger(__name__)
16
+
17
+ def create_chrome_driver(headless=True, load_images=False, page_load_strategy='eager'):
18
+ """
19
+ Create and configure a Chrome WebDriver instance with optimized settings.
20
+
21
+ Args:
22
+ headless (bool): Whether to run Chrome in headless mode
23
+ load_images (bool): Whether to load images
24
+ page_load_strategy (str): Page load strategy ('normal', 'eager', or 'none')
25
+
26
+ Returns:
27
+ webdriver.Chrome: Configured Chrome WebDriver instance
28
+ """
29
+ chrome_options = webdriver.ChromeOptions()
30
+
31
+ if headless:
32
+ chrome_options.add_argument('--headless')
33
+
34
+ # Common performance optimizations
35
+ chrome_options.add_argument('--no-sandbox')
36
+ chrome_options.add_argument('--disable-dev-shm-usage')
37
+ chrome_options.add_argument('--disable-extensions')
38
+ chrome_options.add_argument('--disable-gpu')
39
+ chrome_options.add_argument('--disable-infobars')
40
+ chrome_options.add_argument('--disable-notifications')
41
+
42
+ if not load_images:
43
+ chrome_options.add_argument('--blink-settings=imagesEnabled=false')
44
+
45
+ chrome_options.page_load_strategy = page_load_strategy
46
+
47
+ # Performance preferences
48
+ chrome_options.add_experimental_option('prefs', {
49
+ 'profile.default_content_setting_values.notifications': 2,
50
+ 'profile.managed_default_content_settings.images': 2 if not load_images else 0,
51
+ 'disk-cache-size': 4096
52
+ })
53
+
54
+ return webdriver.Chrome(options=chrome_options)
55
+
56
+ def wait_for_page_load(driver, url, timeout=10, retries=3, backoff_factor=2):
57
+ """
58
+ Load a URL with retries and exponential backoff.
59
+
60
+ Args:
61
+ driver (webdriver.Chrome): WebDriver instance
62
+ url (str): URL to load
63
+ timeout (int): Page load timeout in seconds
64
+ retries (int): Number of retry attempts
65
+ backoff_factor (int): Factor to multiply wait time by on each retry
66
+
67
+ Returns:
68
+ bool: Whether page load was successful
69
+ """
70
+ for attempt in range(retries):
71
+ try:
72
+ driver.set_page_load_timeout(timeout)
73
+ driver.get(url)
74
+
75
+ # Wait for DOM to be ready
76
+ WebDriverWait(driver, timeout).until(
77
+ lambda d: d.execute_script('return document.readyState') == 'complete'
78
+ )
79
+
80
+ return True
81
+
82
+ except (TimeoutException, WebDriverException, ProtocolError) as e:
83
+ if attempt == retries - 1:
84
+ logger.warning(f"Failed to load {url} after {retries} attempts: {str(e)}")
85
+ return False
86
+ else:
87
+ wait_time = backoff_factor * (attempt + 1)
88
+ logger.info(f"Retrying page load for {url} (attempt {attempt + 2}/{retries}) in {wait_time}s")
89
+ time.sleep(wait_time)
90
+ continue
91
+
92
+ except Exception as e:
93
+ logger.error(f"Unexpected error loading {url}: {str(e)}")
94
+ return False
95
+
96
+ return False
97
+
98
+ def scroll_to_element(driver, element):
99
+ """
100
+ Scroll the page to make an element visible.
101
+
102
+ Args:
103
+ driver (webdriver.Chrome): WebDriver instance
104
+ element: WebElement to scroll to
105
+ """
106
+ try:
107
+ driver.execute_script("arguments[0].scrollIntoView(true);", element)
108
+ driver.execute_script("window.scrollBy(0, -100);") # Adjust to avoid navbar overlay
109
+ except Exception as e:
110
+ logger.error(f"Error scrolling to element: {str(e)}")
111
+
112
+ def scroll_to_bottom(driver, scroll_pause_time=1.0, num_scrolls=None):
113
+ """
114
+ Scroll to the bottom of the page incrementally.
115
+
116
+ Args:
117
+ driver (webdriver.Chrome): WebDriver instance
118
+ scroll_pause_time (float): Time to pause between scrolls
119
+ num_scrolls (int, optional): Maximum number of scrolls to perform
120
+ """
121
+ # Get scroll height
122
+ last_height = driver.execute_script("return document.body.scrollHeight")
123
+
124
+ scrolls_performed = 0
125
+
126
+ while True:
127
+ # Check if we've reached the scroll limit
128
+ if num_scrolls is not None and scrolls_performed >= num_scrolls:
129
+ break
130
+
131
+ # Scroll down to bottom
132
+ driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
133
+
134
+ # Wait to load page
135
+ time.sleep(scroll_pause_time)
136
+
137
+ # Calculate new scroll height and compare with last scroll height
138
+ new_height = driver.execute_script("return document.body.scrollHeight")
139
+ if new_height == last_height:
140
+ break
141
+
142
+ last_height = new_height
143
+ scrolls_performed += 1
144
+
145
+ return scrolls_performed