or "article-content" +-- Hindi (HindiScraper): finds

or "story-detail" | Extracts: id, title, content (

tags), author, published_date, url Skips: /photo-gallery/ and /videos/ pages (no plain text) | 3. SAVE & UPLOAD | +-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json +-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json +-- Uploaded to Cloudinary as resource_type="raw" ``` ### Article Link Patterns - **English:** URLs matching `abplive.com/...-{numeric_id}` or ending in `.html` - **Hindi:** URLs matching `abplive.com/.+-{6+ digit numeric ID}`, excluding photo-gallery and video paths --- ## Output Structure ### Category Scraping ``` articles/ +-- english/ | +-- categories/ | +-- {category}/ | +-- {day}_{month}_{hour}_{minute}_{ampm}.json +-- hindi/ +-- categories/ +-- {category}/ +-- {day}_{month}_{hour}_{minute}_{ampm}.json ``` ### Search Queries ``` articles/ +-- english/ +-- search_queries/ +-- {sanitized_query}/ +-- {day}_{month}_{hour}_{minute}_{ampm}.json +-- hindi/ +-- search_queries/ +-- {sanitized_query}/ +-- {day}_{month}_{hour}_{minute}_{ampm}.json ``` **Timestamp examples:** `1_feb_2_30_pm` · `15_jan_9_45_am` --- ## Article Data Format Each scraped article: ```json { "id": "1827329", "language": "english", "category": "Sports", "title": "Article Title Here", "author": "Author Name", "published_date": "2026-02-01", "url": "https://news.abplive.com/sports/...", "content": "Full article text paragraph by paragraph...", "scraped_at": "2026-02-01T14:30:00.000000+00:00" } ``` The `language` field (`"english"` or `"hindi"`) flows through every pipeline stage and is stored in Supabase for filtering. --- ## Configuration | Setting | Environment Variable | Default | |---------|---------------------|---------| | Concurrent workers | `SCRAPING_MAX_WORKERS` | `10` | | Request timeout | `SCRAPING_TIMEOUT` | `30` (seconds) | --- ## Error Handling - Network timeouts are caught per article; failed articles are counted but do not abort the run - HTTP non-200 responses are logged and skipped - Articles with no extractable `

` title or content `
` tags are skipped - Hindi photo-gallery and video URLs are excluded during link discovery - Duplicate URLs within a single run are filtered by using a `set` --- ## Adding a New Language (Developer Guide) ### Step 1: Add a `LanguageConfig` in `news-scrape.py` ```python _MR_BASE = "https://marathi.abplive.com" MARATHI_CONFIG = LanguageConfig( base_url=_MR_BASE, categories={ "top": {"name": "Top News", "url": f"{_MR_BASE}/news"}, "sports": {"name": "Sports", "url": f"{_MR_BASE}/sports"}, }, search_url_tpl=None, # set if the site supports search scraper_class_name="MarathiScraper", output_subfolder="marathi", ) ``` ### Step 2: Register it in `LANGUAGE_CONFIGS` ```python LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = { "english": ENGLISH_CONFIG, "hindi": HINDI_CONFIG, "marathi": MARATHI_CONFIG, # <- add here } ``` ### Step 3: Write a `Scraper` subclass ```python class MarathiScraper(BaseScraper): def _extract_links(self, soup, src_url, is_search=False): # return Set[str] of article URLs ... def parse_article(self, link: str, category: str): # return Dict with keys: id, language, category, title, # author, published_date, url, content, scraped_at # or return None on failure ... ``` Register in the factory dict at the bottom of `news-scrape.py`: ```python _SCRAPER_CLASSES = { "EnglishScraper": EnglishScraper, "HindiScraper": HindiScraper, "MarathiScraper": MarathiScraper, # <- add here } ``` The `--marathi` CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.