# Indian News Scraper A collection of web scrapers for various Indian news websites that can extract articles based on specific topics. ## Features - Scrapes articles from major Indian news sources: - Times of India (TOI) - NDTV - WION - Scroll.in - Command-line interface for easy use - Multithreaded scraping for fast performance - Automatic progress saving to prevent data loss - CSV output format for easy analysis ## Requirements - Python 3.7+ - Chrome browser - ChromeDriver (compatible with your Chrome version) ## Installation 1. Clone this repository: ```bash git clone https://github.com/yourusername/indian-news-scraper.git cd indian-news-scraper ``` 2. Install the required dependencies: ```bash pip install -r requirements.txt ``` 3. Make sure you have Chrome and ChromeDriver installed: - Install Chrome: [https://www.google.com/chrome/](https://www.google.com/chrome/) - Download ChromeDriver: [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads) - Make sure ChromeDriver is in your PATH ## Usage Run the main script with the desired news source and topic: ```bash python run_scraper.py --source toi --topic "Climate Change" ``` ### Available News Sources - `toi` - Times of India - `ndtv` - NDTV - `wion` - WION News - `scroll` - Scroll.in ### Command Line Options ``` usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL] Scrape news articles from Indian news websites optional arguments: -h, --help show this help message and exit --source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll} News source to scrape from --topic TOPIC, -t TOPIC Topic to search for (e.g., "Climate Change", "Politics") --workers WORKERS, -w WORKERS Number of worker threads (default: 4) --interval INTERVAL, -i INTERVAL Auto-save interval in seconds (default: 300) ``` ### Examples Scrape articles about "COVID" from Times of India: ```bash python run_scraper.py --source toi --topic COVID ``` Scrape articles about "Elections" from NDTV with 8 worker threads: ```bash python run_scraper.py --source ndtv --topic Elections --workers 8 ``` Scrape articles about "Climate Change" from Scroll.in with auto-save every minute: ```bash python run_scraper.py --source scroll --topic "Climate Change" --interval 60 ``` ## Output The scraped articles are saved in CSV format in the `output` directory with filenames in the following format: ``` {source}_{topic}articles_{timestamp}_{status}.csv ``` For example: ``` output/toi_COVIDarticles_20250407_121530_final.csv ``` ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add some amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## License This project is licensed under the MIT License - see the LICENSE file for details. ## Disclaimer This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.