Spaces:
Sleeping
Sleeping
| # Indian News Scraper | |
| A collection of web scrapers for various Indian news websites that can extract articles based on specific topics. | |
| ## Features | |
| - Scrapes articles from major Indian news sources: | |
| - Times of India (TOI) | |
| - NDTV | |
| - WION | |
| - Scroll.in | |
| - Command-line interface for easy use | |
| - Multithreaded scraping for fast performance | |
| - Automatic progress saving to prevent data loss | |
| - CSV output format for easy analysis | |
| ## Requirements | |
| - Python 3.7+ | |
| - Chrome browser | |
| - ChromeDriver (compatible with your Chrome version) | |
| ## Installation | |
| 1. Clone this repository: | |
| ```bash | |
| git clone https://github.com/yourusername/indian-news-scraper.git | |
| cd indian-news-scraper | |
| ``` | |
| 2. Install the required dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Make sure you have Chrome and ChromeDriver installed: | |
| - Install Chrome: [https://www.google.com/chrome/](https://www.google.com/chrome/) | |
| - Download ChromeDriver: [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads) | |
| - Make sure ChromeDriver is in your PATH | |
| ## Usage | |
| Run the main script with the desired news source and topic: | |
| ```bash | |
| python run_scraper.py --source toi --topic "Climate Change" | |
| ``` | |
| ### Available News Sources | |
| - `toi` - Times of India | |
| - `ndtv` - NDTV | |
| - `wion` - WION News | |
| - `scroll` - Scroll.in | |
| ### Command Line Options | |
| ``` | |
| usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL] | |
| Scrape news articles from Indian news websites | |
| optional arguments: | |
| -h, --help show this help message and exit | |
| --source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll} | |
| News source to scrape from | |
| --topic TOPIC, -t TOPIC | |
| Topic to search for (e.g., "Climate Change", "Politics") | |
| --workers WORKERS, -w WORKERS | |
| Number of worker threads (default: 4) | |
| --interval INTERVAL, -i INTERVAL | |
| Auto-save interval in seconds (default: 300) | |
| ``` | |
| ### Examples | |
| Scrape articles about "COVID" from Times of India: | |
| ```bash | |
| python run_scraper.py --source toi --topic COVID | |
| ``` | |
| Scrape articles about "Elections" from NDTV with 8 worker threads: | |
| ```bash | |
| python run_scraper.py --source ndtv --topic Elections --workers 8 | |
| ``` | |
| Scrape articles about "Climate Change" from Scroll.in with auto-save every minute: | |
| ```bash | |
| python run_scraper.py --source scroll --topic "Climate Change" --interval 60 | |
| ``` | |
| ## Output | |
| The scraped articles are saved in CSV format in the `output` directory with filenames in the following format: | |
| ``` | |
| {source}_{topic}articles_{timestamp}_{status}.csv | |
| ``` | |
| For example: | |
| ``` | |
| output/toi_COVIDarticles_20250407_121530_final.csv | |
| ``` | |
| ## Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. | |
| 1. Fork the repository | |
| 2. Create your feature branch (`git checkout -b feature/amazing-feature`) | |
| 3. Commit your changes (`git commit -m 'Add some amazing feature'`) | |
| 4. Push to the branch (`git push origin feature/amazing-feature`) | |
| 5. Open a Pull Request | |
| ## License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| ## Disclaimer | |
| This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly. |