Spaces:

Nishitha03
/

News-Scraper

Sleeping

File size: 3,348 Bytes

dd99def

# Indian News Scraper

A collection of web scrapers for various Indian news websites that can extract articles based on specific topics.

## Features

- Scrapes articles from major Indian news sources:
  - Times of India (TOI)
  - NDTV
  - WION
  - Scroll.in
- Command-line interface for easy use
- Multithreaded scraping for fast performance
- Automatic progress saving to prevent data loss
- CSV output format for easy analysis

## Requirements

- Python 3.7+
- Chrome browser
- ChromeDriver (compatible with your Chrome version)

## Installation

1. Clone this repository:
   ```bash
   git clone https://github.com/yourusername/indian-news-scraper.git
   cd indian-news-scraper
   ```

2. Install the required dependencies:
   ```bash
   pip install -r requirements.txt
   ```

3. Make sure you have Chrome and ChromeDriver installed:
   - Install Chrome: [https://www.google.com/chrome/](https://www.google.com/chrome/)
   - Download ChromeDriver: [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads)
   - Make sure ChromeDriver is in your PATH

## Usage

Run the main script with the desired news source and topic:

```bash
python run_scraper.py --source toi --topic "Climate Change"
```

### Available News Sources

- `toi` - Times of India
- `ndtv` - NDTV
- `wion` - WION News
- `scroll` - Scroll.in

### Command Line Options

```
usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL]

Scrape news articles from Indian news websites

optional arguments:
  -h, --help            show this help message and exit
  --source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll}
                        News source to scrape from
  --topic TOPIC, -t TOPIC
                        Topic to search for (e.g., "Climate Change", "Politics")
  --workers WORKERS, -w WORKERS
                        Number of worker threads (default: 4)
  --interval INTERVAL, -i INTERVAL
                        Auto-save interval in seconds (default: 300)
```

### Examples

Scrape articles about "COVID" from Times of India:
```bash
python run_scraper.py --source toi --topic COVID
```

Scrape articles about "Elections" from NDTV with 8 worker threads:
```bash
python run_scraper.py --source ndtv --topic Elections --workers 8
```

Scrape articles about "Climate Change" from Scroll.in with auto-save every minute:
```bash
python run_scraper.py --source scroll --topic "Climate Change" --interval 60
```

## Output

The scraped articles are saved in CSV format in the `output` directory with filenames in the following format:
```
{source}_{topic}articles_{timestamp}_{status}.csv
```

For example:
```
output/toi_COVIDarticles_20250407_121530_final.csv
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Disclaimer

This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.