News-Scraper / src /README.md
Nishitha03's picture
Upload 15 files
dd99def verified

Indian News Scraper

A collection of web scrapers for various Indian news websites that can extract articles based on specific topics.

Features

  • Scrapes articles from major Indian news sources:
    • Times of India (TOI)
    • NDTV
    • WION
    • Scroll.in
  • Command-line interface for easy use
  • Multithreaded scraping for fast performance
  • Automatic progress saving to prevent data loss
  • CSV output format for easy analysis

Requirements

  • Python 3.7+
  • Chrome browser
  • ChromeDriver (compatible with your Chrome version)

Installation

  1. Clone this repository:

    git clone https://github.com/yourusername/indian-news-scraper.git
    cd indian-news-scraper
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    
  3. Make sure you have Chrome and ChromeDriver installed:

Usage

Run the main script with the desired news source and topic:

python run_scraper.py --source toi --topic "Climate Change"

Available News Sources

  • toi - Times of India
  • ndtv - NDTV
  • wion - WION News
  • scroll - Scroll.in

Command Line Options

usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL]

Scrape news articles from Indian news websites

optional arguments:
  -h, --help            show this help message and exit
  --source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll}
                        News source to scrape from
  --topic TOPIC, -t TOPIC
                        Topic to search for (e.g., "Climate Change", "Politics")
  --workers WORKERS, -w WORKERS
                        Number of worker threads (default: 4)
  --interval INTERVAL, -i INTERVAL
                        Auto-save interval in seconds (default: 300)

Examples

Scrape articles about "COVID" from Times of India:

python run_scraper.py --source toi --topic COVID

Scrape articles about "Elections" from NDTV with 8 worker threads:

python run_scraper.py --source ndtv --topic Elections --workers 8

Scrape articles about "Climate Change" from Scroll.in with auto-save every minute:

python run_scraper.py --source scroll --topic "Climate Change" --interval 60

Output

The scraped articles are saved in CSV format in the output directory with filenames in the following format:

{source}_{topic}articles_{timestamp}_{status}.csv

For example:

output/toi_COVIDarticles_20250407_121530_final.csv

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.