Spaces:
Sleeping
Sleeping
Web Scraping Module
Overview
Multi-language news scraper for ABP Live. Scrapes articles from English (news.abplive.com) and Hindi (www.abplive.com) editions. Supports category-based scraping and keyword search with pagination.
Usage
# List available categories
python backend/web_scraping/news_scrape.py --english --list
python backend/web_scraping/news_scrape.py --hindi --list
# Scrape by category
python backend/web_scraping/news_scrape.py --english --category sports
python backend/web_scraping/news_scrape.py --hindi --category politics
# Search with pagination
python backend/web_scraping/news_scrape.py --english --search "climate change"
python backend/web_scraping/news_scrape.py --hindi --search "ΰ€ͺΰ₯ΰ€£ΰ₯" --pages 3
Output Structure
articles/
βββ english/
β βββ categories/{category}/{timestamp}.json
β βββ search_queries/{query}/{timestamp}.json
βββ hindi/
βββ categories/{category}/{timestamp}.json
βββ search_queries/{query}/{timestamp}.json
Timestamp format: {day}_{month}_{hour}_{minute}_{ampm} (e.g., 24_mar_10_37_pm)
Article JSON Schema
{
"id": "1827329",
"language": "english",
"category": "Sports",
"title": "Article Title",
"author": "Author Name",
"published_date": "2026-02-01",
"url": "https://news.abplive.com/...",
"content": "Full article text...",
"scraped_at": "2026-02-01T14:30:00+00:00"
}
Architecture
LanguageConfigβ holds URLs, categories, output paths per languageBaseScraperβ shared logic: link fetching, search paginationEnglishScraper/HindiScraperβ language-specific article parsers
Adding a new language: Create a LanguageConfig, a Scraper subclass, and register both. No CLI changes needed.
Configuration
| Variable | Default | Description |
|---|---|---|
SCRAPING_MAX_WORKERS |
10 |
Parallel HTTP workers |
SCRAPING_TIMEOUT |
30 |
Request timeout (seconds) |