Spaces:
Sleeping
News Scraper — Multi-Language CLI
Overview
A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own LanguageConfig and Scraper subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic.
| Flag | Source | Language |
|---|---|---|
--english |
news.abplive.com |
English |
--hindi |
www.abplive.com |
Hindi |
Setup & Installation
Python 3.10 is required.
py -3.10 -m venv venv
.\venv\Scripts\activate # Windows
source venv/bin/activate # Linux / macOS
# Install uv, then install deps for your language
pip install uv
uv pip install -r requirements-ci-english.txt # for English
uv pip install -r requirements-ci-hindi.txt # for Hindi
Usage
A language flag (--english or --hindi) is always required.
List Available Categories
python backend/web-scraping/news-scrape.py --english --list
python backend/web-scraping/news-scrape.py --hindi --list
Scrape a Category
python backend/web-scraping/news-scrape.py --english --category sports
python backend/web-scraping/news-scrape.py --english --category technology
python backend/web-scraping/news-scrape.py --hindi --category politics
python backend/web-scraping/news-scrape.py --hindi --category latest
Search
python backend/web-scraping/news-scrape.py --english --search "climate change"
python backend/web-scraping/news-scrape.py --english --search "stock market"
python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3
python backend/web-scraping/news-scrape.py --hindi --search "पुणे"
python backend/web-scraping/news-scrape.py --hindi --search "पुणे" --pages 3
--pages / --page is optional and defaults to 1.
Available Categories
English (--english)
| Key | Display Name |
|---|---|
top |
Top News |
business |
Business |
entertainment |
Entertainment |
sports |
Sports |
lifestyle |
Lifestyle |
technology |
Technology |
elections |
Elections |
Hindi (--hindi)
| Key | Display Name |
|---|---|
top |
Top News |
entertainment |
Entertainment |
sports |
Sports |
politics |
Politics |
latest |
Latest News |
technology |
Technology |
lifestyle |
Lifestyle |
business |
Business |
world |
World News |
crime |
Crime |
How It Works
Scraping Pipeline
1. LINK DISCOVERY
|
+-- Category mode: fetch category page, extract article URLs via regex
+-- Search mode: fetch page 1 by default, or up to N pages via --pages
| page 1: /search?s=query
| page 2+: /search/page-2?s=query, /search/page-3?s=query
| stops early if a page returns no article links
|
Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10)
|
2. CONTENT EXTRACTION
|
+-- English (EnglishScraper): finds <div class="abp-story-article"> or "article-content"
+-- Hindi (HindiScraper): finds <div class="abp-story-detail"> or "story-detail"
|
Extracts: id, title, content (<p> tags), author, published_date, url
Skips: /photo-gallery/ and /videos/ pages (no plain text)
|
3. SAVE & UPLOAD
|
+-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json
+-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json
+-- Uploaded to Cloudinary as resource_type="raw"
Article Link Patterns
- English: URLs matching
abplive.com/...-{numeric_id}or ending in.html - Hindi: URLs matching
abplive.com/.+-{6+ digit numeric ID}, excluding photo-gallery and video paths
Output Structure
Category Scraping
articles/
+-- english/
| +-- categories/
| +-- {category}/
| +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
+-- categories/
+-- {category}/
+-- {day}_{month}_{hour}_{minute}_{ampm}.json
Search Queries
articles/
+-- english/
+-- search_queries/
+-- {sanitized_query}/
+-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
+-- search_queries/
+-- {sanitized_query}/
+-- {day}_{month}_{hour}_{minute}_{ampm}.json
Timestamp examples: 1_feb_2_30_pm · 15_jan_9_45_am
Article Data Format
Each scraped article:
{
"id": "1827329",
"language": "english",
"category": "Sports",
"title": "Article Title Here",
"author": "Author Name",
"published_date": "2026-02-01",
"url": "https://news.abplive.com/sports/...",
"content": "Full article text paragraph by paragraph...",
"scraped_at": "2026-02-01T14:30:00.000000+00:00"
}
The language field ("english" or "hindi") flows through every pipeline stage and is stored in Supabase for filtering.
Configuration
| Setting | Environment Variable | Default |
|---|---|---|
| Concurrent workers | SCRAPING_MAX_WORKERS |
10 |
| Request timeout | SCRAPING_TIMEOUT |
30 (seconds) |
Error Handling
- Network timeouts are caught per article; failed articles are counted but do not abort the run
- HTTP non-200 responses are logged and skipped
- Articles with no extractable
<h1>title or content<p>tags are skipped - Hindi photo-gallery and video URLs are excluded during link discovery
- Duplicate URLs within a single run are filtered by using a
set
Adding a New Language (Developer Guide)
Step 1: Add a LanguageConfig in news-scrape.py
_MR_BASE = "https://marathi.abplive.com"
MARATHI_CONFIG = LanguageConfig(
base_url=_MR_BASE,
categories={
"top": {"name": "Top News", "url": f"{_MR_BASE}/news"},
"sports": {"name": "Sports", "url": f"{_MR_BASE}/sports"},
},
search_url_tpl=None, # set if the site supports search
scraper_class_name="MarathiScraper",
output_subfolder="marathi",
)
Step 2: Register it in LANGUAGE_CONFIGS
LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = {
"english": ENGLISH_CONFIG,
"hindi": HINDI_CONFIG,
"marathi": MARATHI_CONFIG, # <- add here
}
Step 3: Write a Scraper subclass
class MarathiScraper(BaseScraper):
def _extract_links(self, soup, src_url, is_search=False):
# return Set[str] of article URLs
...
def parse_article(self, link: str, category: str):
# return Dict with keys: id, language, category, title,
# author, published_date, url, content, scraped_at
# or return None on failure
...
Register in the factory dict at the bottom of news-scrape.py:
_SCRAPER_CLASSES = {
"EnglishScraper": EnglishScraper,
"HindiScraper": HindiScraper,
"MarathiScraper": MarathiScraper, # <- add here
}
The --marathi CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.