Spaces:
Sleeping
Indian News Scraper
A collection of web scrapers for various Indian news websites that can extract articles based on specific topics.
Features
- Scrapes articles from major Indian news sources:
- Times of India (TOI)
- NDTV
- WION
- Scroll.in
- Command-line interface for easy use
- Multithreaded scraping for fast performance
- Automatic progress saving to prevent data loss
- CSV output format for easy analysis
Requirements
- Python 3.7+
- Chrome browser
- ChromeDriver (compatible with your Chrome version)
Installation
Clone this repository:
git clone https://github.com/yourusername/indian-news-scraper.git cd indian-news-scraperInstall the required dependencies:
pip install -r requirements.txtMake sure you have Chrome and ChromeDriver installed:
- Install Chrome: https://www.google.com/chrome/
- Download ChromeDriver: https://chromedriver.chromium.org/downloads
- Make sure ChromeDriver is in your PATH
Usage
Run the main script with the desired news source and topic:
python run_scraper.py --source toi --topic "Climate Change"
Available News Sources
toi- Times of Indiandtv- NDTVwion- WION Newsscroll- Scroll.in
Command Line Options
usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL]
Scrape news articles from Indian news websites
optional arguments:
-h, --help show this help message and exit
--source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll}
News source to scrape from
--topic TOPIC, -t TOPIC
Topic to search for (e.g., "Climate Change", "Politics")
--workers WORKERS, -w WORKERS
Number of worker threads (default: 4)
--interval INTERVAL, -i INTERVAL
Auto-save interval in seconds (default: 300)
Examples
Scrape articles about "COVID" from Times of India:
python run_scraper.py --source toi --topic COVID
Scrape articles about "Elections" from NDTV with 8 worker threads:
python run_scraper.py --source ndtv --topic Elections --workers 8
Scrape articles about "Climate Change" from Scroll.in with auto-save every minute:
python run_scraper.py --source scroll --topic "Climate Change" --interval 60
Output
The scraped articles are saved in CSV format in the output directory with filenames in the following format:
{source}_{topic}articles_{timestamp}_{status}.csv
For example:
output/toi_COVIDarticles_20250407_121530_final.csv
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Disclaimer
This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.