Devang1290
feat: deploy News Whisper on-demand search API (FastAPI + Docker)
2cb327c

Web Scraping Module

Overview

Multi-language news scraper for ABP Live. Scrapes articles from English (news.abplive.com) and Hindi (www.abplive.com) editions. Supports category-based scraping and keyword search with pagination.

Usage

# List available categories
python backend/web_scraping/news_scrape.py --english --list
python backend/web_scraping/news_scrape.py --hindi   --list

# Scrape by category
python backend/web_scraping/news_scrape.py --english --category sports
python backend/web_scraping/news_scrape.py --hindi   --category politics

# Search with pagination
python backend/web_scraping/news_scrape.py --english --search "climate change"
python backend/web_scraping/news_scrape.py --hindi   --search "ΰ€ͺΰ₯ΰ€£ΰ₯‡" --pages 3

Output Structure

articles/
β”œβ”€β”€ english/
β”‚   β”œβ”€β”€ categories/{category}/{timestamp}.json
β”‚   └── search_queries/{query}/{timestamp}.json
└── hindi/
    β”œβ”€β”€ categories/{category}/{timestamp}.json
    └── search_queries/{query}/{timestamp}.json

Timestamp format: {day}_{month}_{hour}_{minute}_{ampm} (e.g., 24_mar_10_37_pm)

Article JSON Schema

{
    "id": "1827329",
    "language": "english",
    "category": "Sports",
    "title": "Article Title",
    "author": "Author Name",
    "published_date": "2026-02-01",
    "url": "https://news.abplive.com/...",
    "content": "Full article text...",
    "scraped_at": "2026-02-01T14:30:00+00:00"
}

Architecture

  • LanguageConfig β€” holds URLs, categories, output paths per language
  • BaseScraper β€” shared logic: link fetching, search pagination
  • EnglishScraper / HindiScraper β€” language-specific article parsers

Adding a new language: Create a LanguageConfig, a Scraper subclass, and register both. No CLI changes needed.

Configuration

Variable Default Description
SCRAPING_MAX_WORKERS 10 Parallel HTTP workers
SCRAPING_TIMEOUT 30 Request timeout (seconds)