Devang1290
feat: deploy News Whisper on-demand search API (FastAPI + Docker)
2cb327c
# News Scraper — Multi-Language CLI
## Overview
A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own `LanguageConfig` and `Scraper` subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic.
| Flag | Source | Language |
|------|--------|----------|
| `--english` | `news.abplive.com` | English |
| `--hindi` | `www.abplive.com` | Hindi |
---
## Setup & Installation
**Python 3.10 is required.**
```bash
py -3.10 -m venv venv
.\venv\Scripts\activate # Windows
source venv/bin/activate # Linux / macOS
# Install uv, then install deps for your language
pip install uv
uv pip install -r requirements-ci-english.txt # for English
uv pip install -r requirements-ci-hindi.txt # for Hindi
```
---
## Usage
A language flag (`--english` or `--hindi`) is **always required**.
### List Available Categories
```bash
python backend/web-scraping/news-scrape.py --english --list
python backend/web-scraping/news-scrape.py --hindi --list
```
### Scrape a Category
```bash
python backend/web-scraping/news-scrape.py --english --category sports
python backend/web-scraping/news-scrape.py --english --category technology
python backend/web-scraping/news-scrape.py --hindi --category politics
python backend/web-scraping/news-scrape.py --hindi --category latest
```
### Search
```bash
python backend/web-scraping/news-scrape.py --english --search "climate change"
python backend/web-scraping/news-scrape.py --english --search "stock market"
python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3
python backend/web-scraping/news-scrape.py --hindi --search "पुणे"
python backend/web-scraping/news-scrape.py --hindi --search "पुणे" --pages 3
```
`--pages` / `--page` is optional and defaults to `1`.
## Available Categories
### English (`--english`)
| Key | Display Name |
|-----|-------------|
| `top` | Top News |
| `business` | Business |
| `entertainment` | Entertainment |
| `sports` | Sports |
| `lifestyle` | Lifestyle |
| `technology` | Technology |
| `elections` | Elections |
### Hindi (`--hindi`)
| Key | Display Name |
|-----|-------------|
| `top` | Top News |
| `entertainment` | Entertainment |
| `sports` | Sports |
| `politics` | Politics |
| `latest` | Latest News |
| `technology` | Technology |
| `lifestyle` | Lifestyle |
| `business` | Business |
| `world` | World News |
| `crime` | Crime |
---
## How It Works
### Scraping Pipeline
```
1. LINK DISCOVERY
|
+-- Category mode: fetch category page, extract article URLs via regex
+-- Search mode: fetch page 1 by default, or up to N pages via --pages
| page 1: /search?s=query
| page 2+: /search/page-2?s=query, /search/page-3?s=query
| stops early if a page returns no article links
|
Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10)
|
2. CONTENT EXTRACTION
|
+-- English (EnglishScraper): finds <div class="abp-story-article"> or "article-content"
+-- Hindi (HindiScraper): finds <div class="abp-story-detail"> or "story-detail"
|
Extracts: id, title, content (<p> tags), author, published_date, url
Skips: /photo-gallery/ and /videos/ pages (no plain text)
|
3. SAVE & UPLOAD
|
+-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json
+-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json
+-- Uploaded to Cloudinary as resource_type="raw"
```
### Article Link Patterns
- **English:** URLs matching `abplive.com/...-{numeric_id}` or ending in `.html`
- **Hindi:** URLs matching `abplive.com/.+-{6+ digit numeric ID}`, excluding photo-gallery and video paths
---
## Output Structure
### Category Scraping
```
articles/
+-- english/
| +-- categories/
| +-- {category}/
| +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
+-- categories/
+-- {category}/
+-- {day}_{month}_{hour}_{minute}_{ampm}.json
```
### Search Queries
```
articles/
+-- english/
+-- search_queries/
+-- {sanitized_query}/
+-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
+-- search_queries/
+-- {sanitized_query}/
+-- {day}_{month}_{hour}_{minute}_{ampm}.json
```
**Timestamp examples:** `1_feb_2_30_pm` · `15_jan_9_45_am`
---
## Article Data Format
Each scraped article:
```json
{
"id": "1827329",
"language": "english",
"category": "Sports",
"title": "Article Title Here",
"author": "Author Name",
"published_date": "2026-02-01",
"url": "https://news.abplive.com/sports/...",
"content": "Full article text paragraph by paragraph...",
"scraped_at": "2026-02-01T14:30:00.000000+00:00"
}
```
The `language` field (`"english"` or `"hindi"`) flows through every pipeline stage and is stored in Supabase for filtering.
---
## Configuration
| Setting | Environment Variable | Default |
|---------|---------------------|---------|
| Concurrent workers | `SCRAPING_MAX_WORKERS` | `10` |
| Request timeout | `SCRAPING_TIMEOUT` | `30` (seconds) |
---
## Error Handling
- Network timeouts are caught per article; failed articles are counted but do not abort the run
- HTTP non-200 responses are logged and skipped
- Articles with no extractable `<h1>` title or content `<p>` tags are skipped
- Hindi photo-gallery and video URLs are excluded during link discovery
- Duplicate URLs within a single run are filtered by using a `set`
---
## Adding a New Language (Developer Guide)
### Step 1: Add a `LanguageConfig` in `news-scrape.py`
```python
_MR_BASE = "https://marathi.abplive.com"
MARATHI_CONFIG = LanguageConfig(
base_url=_MR_BASE,
categories={
"top": {"name": "Top News", "url": f"{_MR_BASE}/news"},
"sports": {"name": "Sports", "url": f"{_MR_BASE}/sports"},
},
search_url_tpl=None, # set if the site supports search
scraper_class_name="MarathiScraper",
output_subfolder="marathi",
)
```
### Step 2: Register it in `LANGUAGE_CONFIGS`
```python
LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = {
"english": ENGLISH_CONFIG,
"hindi": HINDI_CONFIG,
"marathi": MARATHI_CONFIG, # <- add here
}
```
### Step 3: Write a `Scraper` subclass
```python
class MarathiScraper(BaseScraper):
def _extract_links(self, soup, src_url, is_search=False):
# return Set[str] of article URLs
...
def parse_article(self, link: str, category: str):
# return Dict with keys: id, language, category, title,
# author, published_date, url, content, scraped_at
# or return None on failure
...
```
Register in the factory dict at the bottom of `news-scrape.py`:
```python
_SCRAPER_CLASSES = {
"EnglishScraper": EnglishScraper,
"HindiScraper": HindiScraper,
"MarathiScraper": MarathiScraper, # <- add here
}
```
The `--marathi` CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.