# News Scraper — Multi-Language CLI ## Overview A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own `LanguageConfig` and `Scraper` subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic. | Flag | Source | Language | |------|--------|----------| | `--english` | `news.abplive.com` | English | | `--hindi` | `www.abplive.com` | Hindi | --- ## Setup & Installation **Python 3.10 is required.** ```bash py -3.10 -m venv venv .\venv\Scripts\activate # Windows source venv/bin/activate # Linux / macOS # Install uv, then install deps for your language pip install uv uv pip install -r requirements-ci-english.txt # for English uv pip install -r requirements-ci-hindi.txt # for Hindi ``` --- ## Usage A language flag (`--english` or `--hindi`) is **always required**. ### List Available Categories ```bash python backend/web-scraping/news-scrape.py --english --list python backend/web-scraping/news-scrape.py --hindi --list ``` ### Scrape a Category ```bash python backend/web-scraping/news-scrape.py --english --category sports python backend/web-scraping/news-scrape.py --english --category technology python backend/web-scraping/news-scrape.py --hindi --category politics python backend/web-scraping/news-scrape.py --hindi --category latest ``` ### Search ```bash python backend/web-scraping/news-scrape.py --english --search "climate change" python backend/web-scraping/news-scrape.py --english --search "stock market" python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3 python backend/web-scraping/news-scrape.py --hindi --search "पुणे" python backend/web-scraping/news-scrape.py --hindi --search "पुणे" --pages 3 ``` `--pages` / `--page` is optional and defaults to `1`. ## Available Categories ### English (`--english`) | Key | Display Name | |-----|-------------| | `top` | Top News | | `business` | Business | | `entertainment` | Entertainment | | `sports` | Sports | | `lifestyle` | Lifestyle | | `technology` | Technology | | `elections` | Elections | ### Hindi (`--hindi`) | Key | Display Name | |-----|-------------| | `top` | Top News | | `entertainment` | Entertainment | | `sports` | Sports | | `politics` | Politics | | `latest` | Latest News | | `technology` | Technology | | `lifestyle` | Lifestyle | | `business` | Business | | `world` | World News | | `crime` | Crime | --- ## How It Works ### Scraping Pipeline ``` 1. LINK DISCOVERY | +-- Category mode: fetch category page, extract article URLs via regex +-- Search mode: fetch page 1 by default, or up to N pages via --pages | page 1: /search?s=query | page 2+: /search/page-2?s=query, /search/page-3?s=query | stops early if a page returns no article links | Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10) | 2. CONTENT EXTRACTION | +-- English (EnglishScraper): finds
or "article-content" +-- Hindi (HindiScraper): finds
or "story-detail" | Extracts: id, title, content (

tags), author, published_date, url Skips: /photo-gallery/ and /videos/ pages (no plain text) | 3. SAVE & UPLOAD | +-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json +-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json +-- Uploaded to Cloudinary as resource_type="raw" ``` ### Article Link Patterns - **English:** URLs matching `abplive.com/...-{numeric_id}` or ending in `.html` - **Hindi:** URLs matching `abplive.com/.+-{6+ digit numeric ID}`, excluding photo-gallery and video paths --- ## Output Structure ### Category Scraping ``` articles/ +-- english/ | +-- categories/ | +-- {category}/ | +-- {day}_{month}_{hour}_{minute}_{ampm}.json +-- hindi/ +-- categories/ +-- {category}/ +-- {day}_{month}_{hour}_{minute}_{ampm}.json ``` ### Search Queries ``` articles/ +-- english/ +-- search_queries/ +-- {sanitized_query}/ +-- {day}_{month}_{hour}_{minute}_{ampm}.json +-- hindi/ +-- search_queries/ +-- {sanitized_query}/ +-- {day}_{month}_{hour}_{minute}_{ampm}.json ``` **Timestamp examples:** `1_feb_2_30_pm` · `15_jan_9_45_am` --- ## Article Data Format Each scraped article: ```json { "id": "1827329", "language": "english", "category": "Sports", "title": "Article Title Here", "author": "Author Name", "published_date": "2026-02-01", "url": "https://news.abplive.com/sports/...", "content": "Full article text paragraph by paragraph...", "scraped_at": "2026-02-01T14:30:00.000000+00:00" } ``` The `language` field (`"english"` or `"hindi"`) flows through every pipeline stage and is stored in Supabase for filtering. --- ## Configuration | Setting | Environment Variable | Default | |---------|---------------------|---------| | Concurrent workers | `SCRAPING_MAX_WORKERS` | `10` | | Request timeout | `SCRAPING_TIMEOUT` | `30` (seconds) | --- ## Error Handling - Network timeouts are caught per article; failed articles are counted but do not abort the run - HTTP non-200 responses are logged and skipped - Articles with no extractable `

` title or content `

` tags are skipped - Hindi photo-gallery and video URLs are excluded during link discovery - Duplicate URLs within a single run are filtered by using a `set` --- ## Adding a New Language (Developer Guide) ### Step 1: Add a `LanguageConfig` in `news-scrape.py` ```python _MR_BASE = "https://marathi.abplive.com" MARATHI_CONFIG = LanguageConfig( base_url=_MR_BASE, categories={ "top": {"name": "Top News", "url": f"{_MR_BASE}/news"}, "sports": {"name": "Sports", "url": f"{_MR_BASE}/sports"}, }, search_url_tpl=None, # set if the site supports search scraper_class_name="MarathiScraper", output_subfolder="marathi", ) ``` ### Step 2: Register it in `LANGUAGE_CONFIGS` ```python LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = { "english": ENGLISH_CONFIG, "hindi": HINDI_CONFIG, "marathi": MARATHI_CONFIG, # <- add here } ``` ### Step 3: Write a `Scraper` subclass ```python class MarathiScraper(BaseScraper): def _extract_links(self, soup, src_url, is_search=False): # return Set[str] of article URLs ... def parse_article(self, link: str, category: str): # return Dict with keys: id, language, category, title, # author, published_date, url, content, scraped_at # or return None on failure ... ``` Register in the factory dict at the bottom of `news-scrape.py`: ```python _SCRAPER_CLASSES = { "EnglishScraper": EnglishScraper, "HindiScraper": HindiScraper, "MarathiScraper": MarathiScraper, # <- add here } ``` The `--marathi` CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.