Spaces:
Sleeping
Sleeping
| # News Scraper — Multi-Language CLI | |
| ## Overview | |
| A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own `LanguageConfig` and `Scraper` subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic. | |
| | Flag | Source | Language | | |
| |------|--------|----------| | |
| | `--english` | `news.abplive.com` | English | | |
| | `--hindi` | `www.abplive.com` | Hindi | | |
| --- | |
| ## Setup & Installation | |
| **Python 3.10 is required.** | |
| ```bash | |
| py -3.10 -m venv venv | |
| .\venv\Scripts\activate # Windows | |
| source venv/bin/activate # Linux / macOS | |
| # Install uv, then install deps for your language | |
| pip install uv | |
| uv pip install -r requirements-ci-english.txt # for English | |
| uv pip install -r requirements-ci-hindi.txt # for Hindi | |
| ``` | |
| --- | |
| ## Usage | |
| A language flag (`--english` or `--hindi`) is **always required**. | |
| ### List Available Categories | |
| ```bash | |
| python backend/web-scraping/news-scrape.py --english --list | |
| python backend/web-scraping/news-scrape.py --hindi --list | |
| ``` | |
| ### Scrape a Category | |
| ```bash | |
| python backend/web-scraping/news-scrape.py --english --category sports | |
| python backend/web-scraping/news-scrape.py --english --category technology | |
| python backend/web-scraping/news-scrape.py --hindi --category politics | |
| python backend/web-scraping/news-scrape.py --hindi --category latest | |
| ``` | |
| ### Search | |
| ```bash | |
| python backend/web-scraping/news-scrape.py --english --search "climate change" | |
| python backend/web-scraping/news-scrape.py --english --search "stock market" | |
| python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3 | |
| python backend/web-scraping/news-scrape.py --hindi --search "पुणे" | |
| python backend/web-scraping/news-scrape.py --hindi --search "पुणे" --pages 3 | |
| ``` | |
| `--pages` / `--page` is optional and defaults to `1`. | |
| ## Available Categories | |
| ### English (`--english`) | |
| | Key | Display Name | | |
| |-----|-------------| | |
| | `top` | Top News | | |
| | `business` | Business | | |
| | `entertainment` | Entertainment | | |
| | `sports` | Sports | | |
| | `lifestyle` | Lifestyle | | |
| | `technology` | Technology | | |
| | `elections` | Elections | | |
| ### Hindi (`--hindi`) | |
| | Key | Display Name | | |
| |-----|-------------| | |
| | `top` | Top News | | |
| | `entertainment` | Entertainment | | |
| | `sports` | Sports | | |
| | `politics` | Politics | | |
| | `latest` | Latest News | | |
| | `technology` | Technology | | |
| | `lifestyle` | Lifestyle | | |
| | `business` | Business | | |
| | `world` | World News | | |
| | `crime` | Crime | | |
| --- | |
| ## How It Works | |
| ### Scraping Pipeline | |
| ``` | |
| 1. LINK DISCOVERY | |
| | | |
| +-- Category mode: fetch category page, extract article URLs via regex | |
| +-- Search mode: fetch page 1 by default, or up to N pages via --pages | |
| | page 1: /search?s=query | |
| | page 2+: /search/page-2?s=query, /search/page-3?s=query | |
| | stops early if a page returns no article links | |
| | | |
| Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10) | |
| | | |
| 2. CONTENT EXTRACTION | |
| | | |
| +-- English (EnglishScraper): finds <div class="abp-story-article"> or "article-content" | |
| +-- Hindi (HindiScraper): finds <div class="abp-story-detail"> or "story-detail" | |
| | | |
| Extracts: id, title, content (<p> tags), author, published_date, url | |
| Skips: /photo-gallery/ and /videos/ pages (no plain text) | |
| | | |
| 3. SAVE & UPLOAD | |
| | | |
| +-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json | |
| +-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json | |
| +-- Uploaded to Cloudinary as resource_type="raw" | |
| ``` | |
| ### Article Link Patterns | |
| - **English:** URLs matching `abplive.com/...-{numeric_id}` or ending in `.html` | |
| - **Hindi:** URLs matching `abplive.com/.+-{6+ digit numeric ID}`, excluding photo-gallery and video paths | |
| --- | |
| ## Output Structure | |
| ### Category Scraping | |
| ``` | |
| articles/ | |
| +-- english/ | |
| | +-- categories/ | |
| | +-- {category}/ | |
| | +-- {day}_{month}_{hour}_{minute}_{ampm}.json | |
| +-- hindi/ | |
| +-- categories/ | |
| +-- {category}/ | |
| +-- {day}_{month}_{hour}_{minute}_{ampm}.json | |
| ``` | |
| ### Search Queries | |
| ``` | |
| articles/ | |
| +-- english/ | |
| +-- search_queries/ | |
| +-- {sanitized_query}/ | |
| +-- {day}_{month}_{hour}_{minute}_{ampm}.json | |
| +-- hindi/ | |
| +-- search_queries/ | |
| +-- {sanitized_query}/ | |
| +-- {day}_{month}_{hour}_{minute}_{ampm}.json | |
| ``` | |
| **Timestamp examples:** `1_feb_2_30_pm` · `15_jan_9_45_am` | |
| --- | |
| ## Article Data Format | |
| Each scraped article: | |
| ```json | |
| { | |
| "id": "1827329", | |
| "language": "english", | |
| "category": "Sports", | |
| "title": "Article Title Here", | |
| "author": "Author Name", | |
| "published_date": "2026-02-01", | |
| "url": "https://news.abplive.com/sports/...", | |
| "content": "Full article text paragraph by paragraph...", | |
| "scraped_at": "2026-02-01T14:30:00.000000+00:00" | |
| } | |
| ``` | |
| The `language` field (`"english"` or `"hindi"`) flows through every pipeline stage and is stored in Supabase for filtering. | |
| --- | |
| ## Configuration | |
| | Setting | Environment Variable | Default | | |
| |---------|---------------------|---------| | |
| | Concurrent workers | `SCRAPING_MAX_WORKERS` | `10` | | |
| | Request timeout | `SCRAPING_TIMEOUT` | `30` (seconds) | | |
| --- | |
| ## Error Handling | |
| - Network timeouts are caught per article; failed articles are counted but do not abort the run | |
| - HTTP non-200 responses are logged and skipped | |
| - Articles with no extractable `<h1>` title or content `<p>` tags are skipped | |
| - Hindi photo-gallery and video URLs are excluded during link discovery | |
| - Duplicate URLs within a single run are filtered by using a `set` | |
| --- | |
| ## Adding a New Language (Developer Guide) | |
| ### Step 1: Add a `LanguageConfig` in `news-scrape.py` | |
| ```python | |
| _MR_BASE = "https://marathi.abplive.com" | |
| MARATHI_CONFIG = LanguageConfig( | |
| base_url=_MR_BASE, | |
| categories={ | |
| "top": {"name": "Top News", "url": f"{_MR_BASE}/news"}, | |
| "sports": {"name": "Sports", "url": f"{_MR_BASE}/sports"}, | |
| }, | |
| search_url_tpl=None, # set if the site supports search | |
| scraper_class_name="MarathiScraper", | |
| output_subfolder="marathi", | |
| ) | |
| ``` | |
| ### Step 2: Register it in `LANGUAGE_CONFIGS` | |
| ```python | |
| LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = { | |
| "english": ENGLISH_CONFIG, | |
| "hindi": HINDI_CONFIG, | |
| "marathi": MARATHI_CONFIG, # <- add here | |
| } | |
| ``` | |
| ### Step 3: Write a `Scraper` subclass | |
| ```python | |
| class MarathiScraper(BaseScraper): | |
| def _extract_links(self, soup, src_url, is_search=False): | |
| # return Set[str] of article URLs | |
| ... | |
| def parse_article(self, link: str, category: str): | |
| # return Dict with keys: id, language, category, title, | |
| # author, published_date, url, content, scraped_at | |
| # or return None on failure | |
| ... | |
| ``` | |
| Register in the factory dict at the bottom of `news-scrape.py`: | |
| ```python | |
| _SCRAPER_CLASSES = { | |
| "EnglishScraper": EnglishScraper, | |
| "HindiScraper": HindiScraper, | |
| "MarathiScraper": MarathiScraper, # <- add here | |
| } | |
| ``` | |
| The `--marathi` CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically. | |