News-Scraper / src /README.md
Nishitha03's picture
Upload 15 files
dd99def verified
# Indian News Scraper
A collection of web scrapers for various Indian news websites that can extract articles based on specific topics.
## Features
- Scrapes articles from major Indian news sources:
- Times of India (TOI)
- NDTV
- WION
- Scroll.in
- Command-line interface for easy use
- Multithreaded scraping for fast performance
- Automatic progress saving to prevent data loss
- CSV output format for easy analysis
## Requirements
- Python 3.7+
- Chrome browser
- ChromeDriver (compatible with your Chrome version)
## Installation
1. Clone this repository:
```bash
git clone https://github.com/yourusername/indian-news-scraper.git
cd indian-news-scraper
```
2. Install the required dependencies:
```bash
pip install -r requirements.txt
```
3. Make sure you have Chrome and ChromeDriver installed:
- Install Chrome: [https://www.google.com/chrome/](https://www.google.com/chrome/)
- Download ChromeDriver: [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads)
- Make sure ChromeDriver is in your PATH
## Usage
Run the main script with the desired news source and topic:
```bash
python run_scraper.py --source toi --topic "Climate Change"
```
### Available News Sources
- `toi` - Times of India
- `ndtv` - NDTV
- `wion` - WION News
- `scroll` - Scroll.in
### Command Line Options
```
usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL]
Scrape news articles from Indian news websites
optional arguments:
-h, --help show this help message and exit
--source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll}
News source to scrape from
--topic TOPIC, -t TOPIC
Topic to search for (e.g., "Climate Change", "Politics")
--workers WORKERS, -w WORKERS
Number of worker threads (default: 4)
--interval INTERVAL, -i INTERVAL
Auto-save interval in seconds (default: 300)
```
### Examples
Scrape articles about "COVID" from Times of India:
```bash
python run_scraper.py --source toi --topic COVID
```
Scrape articles about "Elections" from NDTV with 8 worker threads:
```bash
python run_scraper.py --source ndtv --topic Elections --workers 8
```
Scrape articles about "Climate Change" from Scroll.in with auto-save every minute:
```bash
python run_scraper.py --source scroll --topic "Climate Change" --interval 60
```
## Output
The scraped articles are saved in CSV format in the `output` directory with filenames in the following format:
```
{source}_{topic}articles_{timestamp}_{status}.csv
```
For example:
```
output/toi_COVIDarticles_20250407_121530_final.csv
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Disclaimer
This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.