Spaces:

Nishitha03
/

News-Scraper

Sleeping

App Files Files Community

News-Scraper / src /README.md

Nishitha03

Upload 15 files

dd99def verified 4 months ago

preview code

raw

history blame contribute delete

3.35 kB

	# Indian News Scraper

	A collection of web scrapers for various Indian news websites that can extract articles based on specific topics.

	## Features

	- Scrapes articles from major Indian news sources:
	- Times of India (TOI)
	- NDTV
	- WION
	- Scroll.in
	- Command-line interface for easy use
	- Multithreaded scraping for fast performance
	- Automatic progress saving to prevent data loss
	- CSV output format for easy analysis

	## Requirements

	- Python 3.7+
	- Chrome browser
	- ChromeDriver (compatible with your Chrome version)

	## Installation

	1. Clone this repository:
	```bash
	git clone https://github.com/yourusername/indian-news-scraper.git
	cd indian-news-scraper
	```

	2. Install the required dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Make sure you have Chrome and ChromeDriver installed:
	- Install Chrome: [https://www.google.com/chrome/](https://www.google.com/chrome/)
	- Download ChromeDriver: [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads)
	- Make sure ChromeDriver is in your PATH

	## Usage

	Run the main script with the desired news source and topic:

	```bash
	python run_scraper.py --source toi --topic "Climate Change"
	```

	### Available News Sources

	- `toi` - Times of India
	- `ndtv` - NDTV
	- `wion` - WION News
	- `scroll` - Scroll.in

	### Command Line Options

	```
	usage: run_scraper.py [-h] --source {toi,ndtv,wion,scroll} --topic TOPIC [--workers WORKERS] [--interval INTERVAL]

	Scrape news articles from Indian news websites

	optional arguments:
	-h, --help show this help message and exit
	--source {toi,ndtv,wion,scroll}, -s {toi,ndtv,wion,scroll}
	News source to scrape from
	--topic TOPIC, -t TOPIC
	Topic to search for (e.g., "Climate Change", "Politics")
	--workers WORKERS, -w WORKERS
	Number of worker threads (default: 4)
	--interval INTERVAL, -i INTERVAL
	Auto-save interval in seconds (default: 300)
	```

	### Examples

	Scrape articles about "COVID" from Times of India:
	```bash
	python run_scraper.py --source toi --topic COVID
	```

	Scrape articles about "Elections" from NDTV with 8 worker threads:
	```bash
	python run_scraper.py --source ndtv --topic Elections --workers 8
	```

	Scrape articles about "Climate Change" from Scroll.in with auto-save every minute:
	```bash
	python run_scraper.py --source scroll --topic "Climate Change" --interval 60
	```

	## Output

	The scraped articles are saved in CSV format in the `output` directory with filenames in the following format:
	```
	{source}_{topic}articles_{timestamp}_{status}.csv
	```

	For example:
	```
	output/toi_COVIDarticles_20250407_121530_final.csv
	```

	## Contributing

	Contributions are welcome! Please feel free to submit a Pull Request.

	1. Fork the repository
	2. Create your feature branch (`git checkout -b feature/amazing-feature`)
	3. Commit your changes (`git commit -m 'Add some amazing feature'`)
	4. Push to the branch (`git push origin feature/amazing-feature`)
	5. Open a Pull Request

	## License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## Disclaimer

	This tool is meant for research and educational purposes only. Please respect the terms of service of the websites you scrape and use the data responsibly.