Spaces:

LovnishVerma
/

webscrapingexample

Sleeping

App Files Files Community

webscrapingexample / README.md

LovnishVerma

Update README.md

102d6ed verified 7 days ago

preview code

raw

history blame contribute delete

4.53 kB

	---
	title: Webscrapingexample
	emoji: 🐨
	colorFrom: pink
	colorTo: purple
	sdk: docker
	pinned: false
	---

	# Flask Web Scraper

	Live Demo: https://lovnishverma-webscrapingexample.hf.space/

	Live Demo: https://webscaraping-simplified.onrender.com

	GitHub Repo Link: https://github.com/lovnishverma/webscaraping-simplified

	## Overview
	This is a simple Flask-based web scraping application that allows users to enter a URL and an HTML tag to extract and display content from that webpage.

	![image](https://github.com/user-attachments/assets/13838a50-71e5-411d-ac7c-034e5caac405)


	## Features
	✅ User-friendly Web Interface: Enter URL and tag to scrape data.

	✅ Web Scraping with BeautifulSoup: Extracts text from specified HTML tags.

	✅ Error Handling: Displays an error if URL or tag is missing.

	✅ Minimal and Lightweight: Uses Flask for the backend.

	## Requirements
	Ensure you have the following installed before running the project:

	- 🐍 Python
	- 🌐 Flask
	- 🔗 Requests
	- 🏗️ BeautifulSoup4 (bs4)
	- 🦄 Gunicorn (for production/deployment)

	You can install dependencies locally using:
	```sh
	pip install -r requirements.txt

	```

	## Project Structure

	📂 Project Directory:

	```
	/your_project_directory
	│── app.py # 🏗️ Main Flask application
	│── Dockerfile # 🐳 Docker configuration for Hugging Face
	│── requirements.txt # 📦 Project dependencies
	│── templates/
	│ │── index.html # 📄 Home page with form
	│ │── result.html # 📄 Page to display scraped data
	│── README.md # 📖 Project Documentation

	```

	## Usage & Deployment (Hugging Face Spaces)

	This project is configured to easily deploy on Hugging Face Spaces using the Docker SDK.

	🚀 Deploying to Hugging Face:

	1. Create a new Space on [Hugging Face](https://huggingface.co/spaces).
	2. Set the Space SDK to Docker and choose the Blank template.
	3. Upload your project files (`app.py`, `Dockerfile`, `requirements.txt`, and the `templates` folder) to the Space.
	4. The Space will automatically build the container and start the app on port `7860` using `gunicorn`.

	🌍 Access the Web App:
	Once the build is complete, your app will be live at your Hugging Face Space URL:

	```
	[https://yourusername-yourspacename.hf.space/](https://yourusername-yourspacename.hf.space/)

	```

	📝 Enter URL and Tag:

	1. Provide a valid URL.
	2. Specify an HTML tag (e.g., `p`, `h1`, `div`).
	3. Click submit to fetch and display the data.

	## Code

	```python
	from flask import Flask, render_template, request # Flask is used to create a web app
	import requests # To send HTTP requests
	from bs4 import BeautifulSoup # BeautifulSoup is used for web scraping

	app = Flask(__name__) # Creating a Flask app instance

	# Home route - Displays the form
	@app.route("/")
	def index():
	return render_template("index.html") # Renders the index.html template

	# Scraping route - Scrapes data based on user input
	@app.route("/scrape", methods=["POST"])
	def scrape():
	url, tag = request.form.get("url"), request.form.get("tag") # Get URL and tag from the form
	if not url or not tag: # If any value is missing, return an error
	return render_template("result.html", error="Both URL and Tag are required.")

	# Send an HTTP request to fetch the webpage content
	response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
	response.raise_for_status() # Raise an error if the request fails

	# Parse the HTML content of the page
	soup = BeautifulSoup(response.text, "html.parser")

	# Extract text from all occurrences of the given tag
	elements = [e.get_text() for e in soup.find_all(tag)]

	# Render the result page with extracted data
	return render_template("result.html", tag=tag, url=url, title=soup.title.string or "No Title", elements=elements)

	# Run the Flask server
	if __name__ == "__main__":
	app.run(debug=True) # Debug mode is enabled to show errors in the console


	```

	## Notes

	⚠️ Important Considerations:

	* Works only with publicly accessible websites.
	* Some websites may block requests (Use user-agent headers to avoid 403 errors).
	* Handles missing input errors but does not handle all exceptions.

	## Future Improvements

	✨ Potential Enhancements:

	* 🔄 Add support for multiple tags.
	* 📜 Implement pagination for large data sets.
	* 💾 Store scraped data in a database.

	---

	🎉 Happy Coding! 🚀