Spaces:

dev11-13
/

news-whisper-api

Sleeping

File size: 7,137 Bytes

2cb327c

# News Scraper — Multi-Language CLI

## Overview

A command-line tool for scraping news from ABP Live in English and Hindi simultaneously. Each language has its own `LanguageConfig` and `Scraper` subclass. Adding a new language requires only a config entry and a scraper class — no changes to the CLI, file management, or upload logic.

| Flag | Source | Language |
|------|--------|----------|
| `--english` | `news.abplive.com` | English |
| `--hindi` | `www.abplive.com` | Hindi |

---

## Setup & Installation

**Python 3.10 is required.**

```bash
py -3.10 -m venv venv
.\venv\Scripts\activate        # Windows
source venv/bin/activate       # Linux / macOS

# Install uv, then install deps for your language
pip install uv
uv pip install -r requirements-ci-english.txt   # for English
uv pip install -r requirements-ci-hindi.txt     # for Hindi
```

---

## Usage

A language flag (`--english` or `--hindi`) is **always required**.

### List Available Categories
```bash
python backend/web-scraping/news-scrape.py --english --list
python backend/web-scraping/news-scrape.py --hindi   --list
```

### Scrape a Category
```bash
python backend/web-scraping/news-scrape.py --english --category sports
python backend/web-scraping/news-scrape.py --english --category technology
python backend/web-scraping/news-scrape.py --hindi   --category politics
python backend/web-scraping/news-scrape.py --hindi   --category latest
```

### Search
```bash
python backend/web-scraping/news-scrape.py --english --search "climate change"
python backend/web-scraping/news-scrape.py --english --search "stock market"
python backend/web-scraping/news-scrape.py --english --search "pune" --pages 3
python backend/web-scraping/news-scrape.py --hindi   --search "पुणे"
python backend/web-scraping/news-scrape.py --hindi   --search "पुणे" --pages 3
```

`--pages` / `--page` is optional and defaults to `1`.

## Available Categories

### English (`--english`)

| Key | Display Name |
|-----|-------------|
| `top` | Top News |
| `business` | Business |
| `entertainment` | Entertainment |
| `sports` | Sports |
| `lifestyle` | Lifestyle |
| `technology` | Technology |
| `elections` | Elections |

### Hindi (`--hindi`)

| Key | Display Name |
|-----|-------------|
| `top` | Top News |
| `entertainment` | Entertainment |
| `sports` | Sports |
| `politics` | Politics |
| `latest` | Latest News |
| `technology` | Technology |
| `lifestyle` | Lifestyle |
| `business` | Business |
| `world` | World News |
| `crime` | Crime |

---

## How It Works

### Scraping Pipeline

```
1. LINK DISCOVERY
   |
   +-- Category mode: fetch category page, extract article URLs via regex
   +-- Search mode:   fetch page 1 by default, or up to N pages via --pages
   |                  page 1: /search?s=query
   |                  page 2+: /search/page-2?s=query, /search/page-3?s=query
   |                  stops early if a page returns no article links
   |
   Concurrency: up to SCRAPING_MAX_WORKERS parallel workers (default: 10)
   |
2. CONTENT EXTRACTION
   |
   +-- English (EnglishScraper): finds <div class="abp-story-article"> or "article-content"
   +-- Hindi   (HindiScraper):  finds <div class="abp-story-detail"> or "story-detail"
   |
   Extracts: id, title, content (<p> tags), author, published_date, url
   Skips: /photo-gallery/ and /videos/ pages (no plain text)
   |
3. SAVE & UPLOAD
   |
   +-- JSON saved to articles/{language}/categories/{category}/{timestamp}.json
   +-- Search JSON saved to articles/{language}/search_queries/{query}/{timestamp}.json
   +-- Uploaded to Cloudinary as resource_type="raw"
```

### Article Link Patterns

- **English:** URLs matching `abplive.com/...-{numeric_id}` or ending in `.html`
- **Hindi:** URLs matching `abplive.com/.+-{6+ digit numeric ID}`, excluding photo-gallery and video paths

---

## Output Structure

### Category Scraping
```
articles/
+-- english/
|   +-- categories/
|       +-- {category}/
|           +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
    +-- categories/
        +-- {category}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json
```

### Search Queries
```
articles/
+-- english/
    +-- search_queries/
        +-- {sanitized_query}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json
+-- hindi/
    +-- search_queries/
        +-- {sanitized_query}/
            +-- {day}_{month}_{hour}_{minute}_{ampm}.json
```

**Timestamp examples:** `1_feb_2_30_pm` · `15_jan_9_45_am`

---

## Article Data Format

Each scraped article:

```json
{
  "id": "1827329",
  "language": "english",
  "category": "Sports",
  "title": "Article Title Here",
  "author": "Author Name",
  "published_date": "2026-02-01",
  "url": "https://news.abplive.com/sports/...",
  "content": "Full article text paragraph by paragraph...",
  "scraped_at": "2026-02-01T14:30:00.000000+00:00"
}
```

The `language` field (`"english"` or `"hindi"`) flows through every pipeline stage and is stored in Supabase for filtering.

---

## Configuration

| Setting | Environment Variable | Default |
|---------|---------------------|---------|
| Concurrent workers | `SCRAPING_MAX_WORKERS` | `10` |
| Request timeout | `SCRAPING_TIMEOUT` | `30` (seconds) |

---

## Error Handling

- Network timeouts are caught per article; failed articles are counted but do not abort the run
- HTTP non-200 responses are logged and skipped
- Articles with no extractable `<h1>` title or content `<p>` tags are skipped
- Hindi photo-gallery and video URLs are excluded during link discovery
- Duplicate URLs within a single run are filtered by using a `set`

---

## Adding a New Language (Developer Guide)

### Step 1: Add a `LanguageConfig` in `news-scrape.py`

```python
_MR_BASE = "https://marathi.abplive.com"

MARATHI_CONFIG = LanguageConfig(
    base_url=_MR_BASE,
    categories={
        "top":    {"name": "Top News", "url": f"{_MR_BASE}/news"},
        "sports": {"name": "Sports",   "url": f"{_MR_BASE}/sports"},
    },
    search_url_tpl=None,                 # set if the site supports search
    scraper_class_name="MarathiScraper",
    output_subfolder="marathi",
)
```

### Step 2: Register it in `LANGUAGE_CONFIGS`

```python
LANGUAGE_CONFIGS: Dict[str, LanguageConfig] = {
    "english": ENGLISH_CONFIG,
    "hindi":   HINDI_CONFIG,
    "marathi": MARATHI_CONFIG,    # <- add here
}
```

### Step 3: Write a `Scraper` subclass

```python
class MarathiScraper(BaseScraper):
    def _extract_links(self, soup, src_url, is_search=False):
        # return Set[str] of article URLs
        ...

    def parse_article(self, link: str, category: str):
        # return Dict with keys: id, language, category, title,
        #   author, published_date, url, content, scraped_at
        # or return None on failure
        ...
```

Register in the factory dict at the bottom of `news-scrape.py`:
```python
_SCRAPER_CLASSES = {
    "EnglishScraper": EnglishScraper,
    "HindiScraper":   HindiScraper,
    "MarathiScraper": MarathiScraper,   # <- add here
}
```

The `--marathi` CLI flag, output paths, Cloudinary upload, and all downstream pipeline steps work automatically.