Spaces:

Nishitha03
/

News-Scraper

Sleeping

App Files Files Community

News-Scraper / src /README.md

Nishitha03

Upload 15 files

dd99def verified 4 months ago

preview code

raw

history blame contribute delete

3.35 kB

Indian News Scraper

A collection of web scrapers for various Indian news websites that can extract articles based on specific topics.

Features

Scrapes articles from major Indian news sources:
- Times of India (TOI)
- NDTV
- WION
- Scroll.in
Command-line interface for easy use
Multithreaded scraping for fast performance
Automatic progress saving to prevent data loss
CSV output format for easy analysis

Requirements

Python 3.7+
Chrome browser
ChromeDriver (compatible with your Chrome version)

Installation

Clone this repository:

git clone https://github.com/yourusername/indian-news-scraper.git
cd indian-news-scraper

Install the required dependencies:
```
pip install -r requirements.txt
```
Make sure you have Chrome and ChromeDriver installed:
- Install Chrome: https://www.google.com/chrome/
- Download ChromeDriver: https://chromedriver.chromium.org/downloads
- Make sure ChromeDriver is in your PATH

Usage

Run the main script with the desired news source and topic:

python run_scraper.py --source toi --topic "Climate Change"

Available News Sources

toi - Times of India
ndtv - NDTV
wion - WION News
scroll - Scroll.in

Command Line Options

usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL]

Scrape news articles from Indian news websites

optional arguments:
  -h, --help            show this help message and exit
  --source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll}
                        News source to scrape from
  --topic TOPIC, -t TOPIC
                        Topic to search for (e.g., "Climate Change", "Politics")
  --workers WORKERS, -w WORKERS
                        Number of worker threads (default: 4)
  --interval INTERVAL, -i INTERVAL
                        Auto-save interval in seconds (default: 300)

Examples

Scrape articles about "COVID" from Times of India:

python run_scraper.py --source toi --topic COVID

Scrape articles about "Elections" from NDTV with 8 worker threads:

python run_scraper.py --source ndtv --topic Elections --workers 8

Scrape articles about "Climate Change" from Scroll.in with auto-save every minute:

python run_scraper.py --source scroll --topic "Climate Change" --interval 60

Output

The scraped articles are saved in CSV format in the output directory with filenames in the following format:

{source}_{topic}articles_{timestamp}_{status}.csv

For example:

output/toi_COVIDarticles_20250407_121530_final.csv

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.