Spaces:

dev11-13
/

news-whisper-api

Sleeping

App Files Files Community

news-whisper-api / backend /web_scraping /scrape.md

Devang1290

feat: deploy News Whisper on-demand search API (FastAPI + Docker)

2cb327c 21 days ago

preview code

raw

history blame contribute delete

7.14 kB

	# News Scraper — Multi-Language CLI

	## Overview

	A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own `LanguageConfig` and `Scraper` subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic.

	\| Flag \| Source \| Language \|
	\|------\|--------\|----------\|
	\| `--english` \| `news.abplive.com` \| English \|
	\| `--hindi` \| `www.abplive.com` \| Hindi \|

	---

	## Setup & Installation

	Python 3.10 is required.

	```bash
	py -3.10 -m venv venv
	.\venv\Scripts\activate # Windows
	source venv/bin/activate # Linux / macOS

	# Install uv, then install deps for your language
	pip install uv
	uv pip install -r requirements-ci-english.txt # for English
	uv pip install -r requirements-ci-hindi.txt # for Hindi
	```

	---

	## Usage

	A language flag (`--english` or `--hindi`) is always required.

	### List Available Categories
	```bash
	python backend/web-scraping/news-scrape.py --english --list
	python backend/web-scraping/news-scrape.py --hindi --list
	```

	### Scrape a Category
	```bash
	python backend/web-scraping/news-scrape.py --english --category sports
	python backend/web-scraping/news-scrape.py --english --category technology
	python backend/web-scraping/news-scrape.py --hindi --category politics
	python backend/web-scraping/news-scrape.py --hindi --category latest
	```

	### Search
	```bash
	python backend/web-scraping/news-scrape.py --english --search "climate change"
	python backend/web-scraping/news-scrape.py --english --search "stock market"
	python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3
	python backend/web-scraping/news-scrape.py --hindi --search "पुणे"
	python backend/web-scraping/news-scrape.py --hindi --search "पुणे" --pages 3
	```

	`--pages` / `--page` is optional and defaults to `1`.

	## Available Categories

	### English (`--english`)

	\| Key \| Display Name \|
	\|-----\|-------------\|
	\| `top` \| Top News \|
	\| `business` \| Business \|
	\| `entertainment` \| Entertainment \|
	\| `sports` \| Sports \|
	\| `lifestyle` \| Lifestyle \|
	\| `technology` \| Technology \|
	\| `elections` \| Elections \|

	### Hindi (`--hindi`)

	\| Key \| Display Name \|
	\|-----\|-------------\|
	\| `top` \| Top News \|
	\| `entertainment` \| Entertainment \|
	\| `sports` \| Sports \|
	\| `politics` \| Politics \|
	\| `latest` \| Latest News \|
	\| `technology` \| Technology \|
	\| `lifestyle` \| Lifestyle \|
	\| `business` \| Business \|
	\| `world` \| World News \|
	\| `crime` \| Crime \|

	---

	## How It Works

	### Scraping Pipeline

	```
	1. LINK DISCOVERY
	\|
	+-- Category mode: fetch category page, extract article URLs via regex
	+-- Search mode: fetch page 1 by default, or up to N pages via --pages
	\| page 1: /search?s=query
	\| page 2+: /search/page-2?s=query, /search/page-3?s=query
	\| stops early if a page returns no article links
	\|
	Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10)
	\|
	2. CONTENT EXTRACTION
	\|
	+-- English (EnglishScraper): finds <div class="abp-story-article"> or "article-content"
	+-- Hindi (HindiScraper): finds <div class="abp-story-detail"> or "story-detail"
	\|
	Extracts: id, title, content (<p> tags), author, published_date, url
	Skips: /photo-gallery/ and /videos/ pages (no plain text)
	\|
	3. SAVE & UPLOAD
	\|
	+-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json
	+-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json
	+-- Uploaded to Cloudinary as resource_type="raw"
	```

	### Article Link Patterns

	- English: URLs matching `abplive.com/...-{numeric_id}` or ending in `.html`
	- Hindi: URLs matching `abplive.com/.+-{6+ digit numeric ID}`, excluding photo-gallery and video paths

	---

	## Output Structure

	### Category Scraping
	```
	articles/
	+-- english/
	\| +-- categories/
	\| +-- {category}/
	\| +-- {day}_{month}_{hour}_{minute}_{ampm}.json
	+-- hindi/
	+-- categories/
	+-- {category}/
	+-- {day}_{month}_{hour}_{minute}_{ampm}.json
	```

	### Search Queries
	```
	articles/
	+-- english/
	+-- search_queries/
	+-- {sanitized_query}/
	+-- {day}_{month}_{hour}_{minute}_{ampm}.json
	+-- hindi/
	+-- search_queries/
	+-- {sanitized_query}/
	+-- {day}_{month}_{hour}_{minute}_{ampm}.json
	```

	Timestamp examples: `1_feb_2_30_pm` · `15_jan_9_45_am`

	---

	## Article Data Format

	Each scraped article:

	```json
	{
	"id": "1827329",
	"language": "english",
	"category": "Sports",
	"title": "Article Title Here",
	"author": "Author Name",
	"published_date": "2026-02-01",
	"url": "https://news.abplive.com/sports/...",
	"content": "Full article text paragraph by paragraph...",
	"scraped_at": "2026-02-01T14:30:00.000000+00:00"
	}
	```

	The `language` field (`"english"` or `"hindi"`) flows through every pipeline stage and is stored in Supabase for filtering.

	---

	## Configuration

	\| Setting \| Environment Variable \| Default \|
	\|---------\|---------------------\|---------\|
	\| Concurrent workers \| `SCRAPING_MAX_WORKERS` \| `10` \|
	\| Request timeout \| `SCRAPING_TIMEOUT` \| `30` (seconds) \|

	---

	## Error Handling

	- Network timeouts are caught per article; failed articles are counted but do not abort the run
	- HTTP non-200 responses are logged and skipped
	- Articles with no extractable `<h1>` title or content `<p>` tags are skipped
	- Hindi photo-gallery and video URLs are excluded during link discovery
	- Duplicate URLs within a single run are filtered by using a `set`

	---

	## Adding a New Language (Developer Guide)

	### Step 1: Add a `LanguageConfig` in `news-scrape.py`

	```python
	_MR_BASE = "https://marathi.abplive.com"

	MARATHI_CONFIG = LanguageConfig(
	base_url=_MR_BASE,
	categories={
	"top": {"name": "Top News", "url": f"{_MR_BASE}/news"},
	"sports": {"name": "Sports", "url": f"{_MR_BASE}/sports"},
	},
	search_url_tpl=None, # set if the site supports search
	scraper_class_name="MarathiScraper",
	output_subfolder="marathi",
	)
	```

	### Step 2: Register it in `LANGUAGE_CONFIGS`

	```python
	LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = {
	"english": ENGLISH_CONFIG,
	"hindi": HINDI_CONFIG,
	"marathi": MARATHI_CONFIG, # <- add here
	}
	```

	### Step 3: Write a `Scraper` subclass

	```python
	class MarathiScraper(BaseScraper):
	def _extract_links(self, soup, src_url, is_search=False):
	# return Set[str] of article URLs
	...

	def parse_article(self, link: str, category: str):
	# return Dict with keys: id, language, category, title,
	# author, published_date, url, content, scraped_at
	# or return None on failure
	...
	```

	Register in the factory dict at the bottom of `news-scrape.py`:
	```python
	_SCRAPER_CLASSES = {
	"EnglishScraper": EnglishScraper,
	"HindiScraper": HindiScraper,
	"MarathiScraper": MarathiScraper, # <- add here
	}
	```

	The `--marathi` CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.