File size: 7,137 Bytes
2cb327c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# News Scraper — Multi-Language CLI

## Overview

A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own `LanguageConfig` and `Scraper` subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic.

| Flag | Source | Language |
|------|--------|----------|
| `--english` | `news.abplive.com` | English |
| `--hindi` | `www.abplive.com` | Hindi |

---

## Setup & Installation

**Python 3.10 is required.**

```bash
py -3.10 -m venv venv
.\venv\Scripts\activate        # Windows
source venv/bin/activate       # Linux / macOS

# Install uv, then install deps for your language
pip install uv
uv pip install -r requirements-ci-english.txt   # for English
uv pip install -r requirements-ci-hindi.txt     # for Hindi
```

---

## Usage

A language flag (`--english` or `--hindi`) is **always required**.

### List Available Categories
```bash
python backend/web-scraping/news-scrape.py --english --list
python backend/web-scraping/news-scrape.py --hindi   --list
```

### Scrape a Category
```bash
python backend/web-scraping/news-scrape.py --english --category sports
python backend/web-scraping/news-scrape.py --english --category technology
python backend/web-scraping/news-scrape.py --hindi   --category politics
python backend/web-scraping/news-scrape.py --hindi   --category latest
```

### Search
```bash
python backend/web-scraping/news-scrape.py --english --search "climate change"
python backend/web-scraping/news-scrape.py --english --search "stock market"
python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3
python backend/web-scraping/news-scrape.py --hindi   --search "पुणे"
python backend/web-scraping/news-scrape.py --hindi   --search "पुणे" --pages 3
```

`--pages` / `--page` is optional and defaults to `1`.

## Available Categories

### English (`--english`)

| Key | Display Name |
|-----|-------------|
| `top` | Top News |
| `business` | Business |
| `entertainment` | Entertainment |
| `sports` | Sports |
| `lifestyle` | Lifestyle |
| `technology` | Technology |
| `elections` | Elections |

### Hindi (`--hindi`)

| Key | Display Name |
|-----|-------------|
| `top` | Top News |
| `entertainment` | Entertainment |
| `sports` | Sports |
| `politics` | Politics |
| `latest` | Latest News |
| `technology` | Technology |
| `lifestyle` | Lifestyle |
| `business` | Business |
| `world` | World News |
| `crime` | Crime |

---

## How It Works

### Scraping Pipeline

```
1. LINK DISCOVERY
   |
   +-- Category mode: fetch category page, extract article URLs via regex
   +-- Search mode:   fetch page 1 by default, or up to N pages via --pages
   |                  page 1: /search?s=query
   |                  page 2+: /search/page-2?s=query, /search/page-3?s=query
   |                  stops early if a page returns no article links
   |
   Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10)
   |
2. CONTENT EXTRACTION
   |
   +-- English (EnglishScraper): finds <div class="abp-story-article"> or "article-content"
   +-- Hindi   (HindiScraper):  finds <div class="abp-story-detail"> or "story-detail"
   |
   Extracts: id, title, content (<p> tags), author, published_date, url
   Skips: /photo-gallery/ and /videos/ pages (no plain text)
   |
3. SAVE & UPLOAD
   |
   +-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json
   +-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json
   +-- Uploaded to Cloudinary as resource_type="raw"
```

### Article Link Patterns

- **English:** URLs matching `abplive.com/...-{numeric_id}` or ending in `.html`
- **Hindi:** URLs matching `abplive.com/.+-{6+ digit numeric ID}`, excluding photo-gallery and video paths

---

## Output Structure

### Category Scraping
```
articles/
+-- english/
|   +-- categories/
|       +-- {category}/
|           +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
    +-- categories/
        +-- {category}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json
```

### Search Queries
```
articles/
+-- english/
    +-- search_queries/
        +-- {sanitized_query}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
    +-- search_queries/
        +-- {sanitized_query}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json
```

**Timestamp examples:** `1_feb_2_30_pm` · `15_jan_9_45_am`

---

## Article Data Format

Each scraped article:

```json
{
  "id": "1827329",
  "language": "english",
  "category": "Sports",
  "title": "Article Title Here",
  "author": "Author Name",
  "published_date": "2026-02-01",
  "url": "https://news.abplive.com/sports/...",
  "content": "Full article text paragraph by paragraph...",
  "scraped_at": "2026-02-01T14:30:00.000000+00:00"
}
```

The `language` field (`"english"` or `"hindi"`) flows through every pipeline stage and is stored in Supabase for filtering.

---

## Configuration

| Setting | Environment Variable | Default |
|---------|---------------------|---------|
| Concurrent workers | `SCRAPING_MAX_WORKERS` | `10` |
| Request timeout | `SCRAPING_TIMEOUT` | `30` (seconds) |

---

## Error Handling

- Network timeouts are caught per article; failed articles are counted but do not abort the run
- HTTP non-200 responses are logged and skipped
- Articles with no extractable `<h1>` title or content `<p>` tags are skipped
- Hindi photo-gallery and video URLs are excluded during link discovery
- Duplicate URLs within a single run are filtered by using a `set`

---

## Adding a New Language (Developer Guide)

### Step 1: Add a `LanguageConfig` in `news-scrape.py`

```python
_MR_BASE = "https://marathi.abplive.com"

MARATHI_CONFIG = LanguageConfig(
    base_url=_MR_BASE,
    categories={
        "top":    {"name": "Top News", "url": f"{_MR_BASE}/news"},
        "sports": {"name": "Sports",   "url": f"{_MR_BASE}/sports"},
    },
    search_url_tpl=None,                 # set if the site supports search
    scraper_class_name="MarathiScraper",
    output_subfolder="marathi",
)
```

### Step 2: Register it in `LANGUAGE_CONFIGS`

```python
LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = {
    "english": ENGLISH_CONFIG,
    "hindi":   HINDI_CONFIG,
    "marathi": MARATHI_CONFIG,    # <- add here
}
```

### Step 3: Write a `Scraper` subclass

```python
class MarathiScraper(BaseScraper):
    def _extract_links(self, soup, src_url, is_search=False):
        # return Set[str] of article URLs
        ...

    def parse_article(self, link: str, category: str):
        # return Dict with keys: id, language, category, title,
        #   author, published_date, url, content, scraped_at
        # or return None on failure
        ...
```

Register in the factory dict at the bottom of `news-scrape.py`:
```python
_SCRAPER_CLASSES = {
    "EnglishScraper": EnglishScraper,
    "HindiScraper":   HindiScraper,
    "MarathiScraper": MarathiScraper,   # <- add here
}
```

The `--marathi` CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.