Spaces:

dev11-13
/

news-whisper-api

Sleeping

App Files Files Community

news-whisper-api / backend /web_scraping /scrape.md

Devang1290

feat: deploy News Whisper on-demand search API (FastAPI + Docker)

2cb327c 11 days ago

preview code

raw

history blame contribute delete

7.14 kB

News Scraper — Multi-Language CLI

Overview

A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own LanguageConfig and Scraper subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic.

Flag	Source	Language
`--english`	`news.abplive.com`	English
`--hindi`	`www.abplive.com`	Hindi

Setup & Installation

Python 3.10 is required.

py -3.10 -m venv venv
.\venv\Scripts\activate        # Windows
source venv/bin/activate       # Linux / macOS

# Install uv, then install deps for your language
pip install uv
uv pip install -r requirements-ci-english.txt   # for English
uv pip install -r requirements-ci-hindi.txt     # for Hindi

Usage

A language flag (--english or --hindi) is always required.

List Available Categories

python backend/web-scraping/news-scrape.py --english --list
python backend/web-scraping/news-scrape.py --hindi   --list

Scrape a Category

python backend/web-scraping/news-scrape.py --english --category sports
python backend/web-scraping/news-scrape.py --english --category technology
python backend/web-scraping/news-scrape.py --hindi   --category politics
python backend/web-scraping/news-scrape.py --hindi   --category latest

Search

python backend/web-scraping/news-scrape.py --english --search "climate change"
python backend/web-scraping/news-scrape.py --english --search "stock market"
python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3
python backend/web-scraping/news-scrape.py --hindi   --search "पुणे"
python backend/web-scraping/news-scrape.py --hindi   --search "पुणे" --pages 3

--pages / --page is optional and defaults to 1.

Available Categories

English (`--english`)

Key	Display Name
`top`	Top News
`business`	Business
`entertainment`	Entertainment
`sports`	Sports
`lifestyle`	Lifestyle
`technology`	Technology
`elections`	Elections

Hindi (`--hindi`)

Key	Display Name
`top`	Top News
`entertainment`	Entertainment
`sports`	Sports
`politics`	Politics
`latest`	Latest News
`technology`	Technology
`lifestyle`	Lifestyle
`business`	Business
`world`	World News
`crime`	Crime

How It Works

Scraping Pipeline

1. LINK DISCOVERY
   |
   +-- Category mode: fetch category page, extract article URLs via regex
   +-- Search mode:   fetch page 1 by default, or up to N pages via --pages
   |                  page 1: /search?s=query
   |                  page 2+: /search/page-2?s=query, /search/page-3?s=query
   |                  stops early if a page returns no article links
   |
   Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10)
   |
2. CONTENT EXTRACTION
   |
   +-- English (EnglishScraper): finds <div class="abp-story-article"> or "article-content"
   +-- Hindi   (HindiScraper):  finds <div class="abp-story-detail"> or "story-detail"
   |
   Extracts: id, title, content (<p> tags), author, published_date, url
   Skips: /photo-gallery/ and /videos/ pages (no plain text)
   |
3. SAVE & UPLOAD
   |
   +-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json
   +-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json
   +-- Uploaded to Cloudinary as resource_type="raw"

Article Link Patterns

English: URLs matching abplive.com/...-{numeric_id} or ending in .html
Hindi: URLs matching abplive.com/.+-{6+ digit numeric ID}, excluding photo-gallery and video paths

Output Structure

Category Scraping

articles/
+-- english/
|   +-- categories/
|       +-- {category}/
|           +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
    +-- categories/
        +-- {category}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json

Search Queries

articles/
+-- english/
    +-- search_queries/
        +-- {sanitized_query}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
    +-- search_queries/
        +-- {sanitized_query}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json

Timestamp examples: 1_feb_2_30_pm · 15_jan_9_45_am

Article Data Format

Each scraped article:

{
  "id": "1827329",
  "language": "english",
  "category": "Sports",
  "title": "Article Title Here",
  "author": "Author Name",
  "published_date": "2026-02-01",
  "url": "https://news.abplive.com/sports/...",
  "content": "Full article text paragraph by paragraph...",
  "scraped_at": "2026-02-01T14:30:00.000000+00:00"
}

The language field ("english" or "hindi") flows through every pipeline stage and is stored in Supabase for filtering.

Configuration

Setting	Environment Variable	Default
Concurrent workers	`SCRAPING_MAX_WORKERS`	`10`
Request timeout	`SCRAPING_TIMEOUT`	`30` (seconds)

Error Handling

Network timeouts are caught per article; failed articles are counted but do not abort the run
HTTP non-200 responses are logged and skipped
Articles with no extractable <h1> title or content <p> tags are skipped
Hindi photo-gallery and video URLs are excluded during link discovery
Duplicate URLs within a single run are filtered by using a set

Adding a New Language (Developer Guide)

Step 1: Add a `LanguageConfig` in `news-scrape.py`

_MR_BASE = "https://marathi.abplive.com"

MARATHI_CONFIG = LanguageConfig(
    base_url=_MR_BASE,
    categories={
        "top":    {"name": "Top News", "url": f"{_MR_BASE}/news"},
        "sports": {"name": "Sports",   "url": f"{_MR_BASE}/sports"},
    },
    search_url_tpl=None,                 # set if the site supports search
    scraper_class_name="MarathiScraper",
    output_subfolder="marathi",
)

Step 2: Register it in `LANGUAGE_CONFIGS`

LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = {
    "english": ENGLISH_CONFIG,
    "hindi":   HINDI_CONFIG,
    "marathi": MARATHI_CONFIG,    # <- add here
}

Step 3: Write a `Scraper` subclass

class MarathiScraper(BaseScraper):
    def _extract_links(self, soup, src_url, is_search=False):
        # return Set[str] of article URLs
        ...

    def parse_article(self, link: str, category: str):
        # return Dict with keys: id, language, category, title,
        #   author, published_date, url, content, scraped_at
        # or return None on failure
        ...

_SCRAPER_CLASSES = {
    "EnglishScraper": EnglishScraper,
    "HindiScraper":   HindiScraper,
    "MarathiScraper": MarathiScraper,   # <- add here
}

The --marathi CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.