Spaces:

LovnishVerma
/

webscrapingexample

Paused

File size: 4,533 Bytes

---
title: Webscrapingexample
emoji: 🐨
colorFrom: pink
colorTo: purple
sdk: docker
pinned: false
---

# Flask Web Scraper

Live Demo: https://lovnishverma-webscrapingexample.hf.space/

Live Demo: https://webscaraping-simplified.onrender.com

GitHub Repo Link: https://github.com/lovnishverma/webscaraping-simplified

## Overview
This is a simple **Flask-based web scraping application** that allows users to enter a URL and an HTML tag to extract and display content from that webpage.

![image](https://github.com/user-attachments/assets/13838a50-71e5-411d-ac7c-034e5caac405)


## Features
✅ **User-friendly Web Interface:** Enter URL and tag to scrape data.

✅ **Web Scraping with BeautifulSoup:** Extracts text from specified HTML tags.

✅ **Error Handling:** Displays an error if URL or tag is missing.

✅ **Minimal and Lightweight:** Uses Flask for the backend.

## Requirements
Ensure you have the following installed before running the project:

- 🐍 Python
- 🌐 Flask
- 🔗 Requests
- 🏗️ BeautifulSoup4 (bs4)
- 🦄 Gunicorn (for production/deployment)

You can install dependencies locally using:
```sh
pip install -r requirements.txt

```

## Project Structure

📂 **Project Directory:**

```
/your_project_directory
│── app.py               # 🏗️ Main Flask application
│── Dockerfile           # 🐳 Docker configuration for Hugging Face
│── requirements.txt     # 📦 Project dependencies
│── templates/
│   │── index.html       # 📄 Home page with form
│   │── result.html      # 📄 Page to display scraped data
│── README.md            # 📖 Project Documentation

```

## Usage & Deployment (Hugging Face Spaces)

This project is configured to easily deploy on **Hugging Face Spaces** using the Docker SDK.

🚀 **Deploying to Hugging Face:**

1. Create a new Space on [Hugging Face](https://huggingface.co/spaces).
2. Set the **Space SDK** to **Docker** and choose the **Blank** template.
3. Upload your project files (`app.py`, `Dockerfile`, `requirements.txt`, and the `templates` folder) to the Space.
4. The Space will automatically build the container and start the app on port `7860` using `gunicorn`.

🌍 **Access the Web App:**
Once the build is complete, your app will be live at your Hugging Face Space URL:

```
[https://yourusername-yourspacename.hf.space/](https://yourusername-yourspacename.hf.space/)

```

📝 **Enter URL and Tag:**

1. Provide a valid URL.
2. Specify an HTML tag (e.g., `p`, `h1`, `div`).
3. Click submit to fetch and display the data.

## Code

```python
from flask import Flask, render_template, request  # Flask is used to create a web app
import requests  # To send HTTP requests
from bs4 import BeautifulSoup  # BeautifulSoup is used for web scraping

app = Flask(__name__)  # Creating a Flask app instance

# Home route - Displays the form
@app.route("/")
def index():
    return render_template("index.html")  # Renders the index.html template

# Scraping route - Scrapes data based on user input
@app.route("/scrape", methods=["POST"])
def scrape():
    url, tag = request.form.get("url"), request.form.get("tag")  # Get URL and tag from the form
    if not url or not tag:  # If any value is missing, return an error
        return render_template("result.html", error="Both URL and Tag are required.")

    # Send an HTTP request to fetch the webpage content
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    response.raise_for_status()  # Raise an error if the request fails

    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract text from all occurrences of the given tag
    elements = [e.get_text() for e in soup.find_all(tag)]

    # Render the result page with extracted data
    return render_template("result.html", tag=tag, url=url, title=soup.title.string or "No Title", elements=elements)

# Run the Flask server
if __name__ == "__main__":
    app.run(debug=True)  # Debug mode is enabled to show errors in the console


```

## Notes

⚠️ **Important Considerations:**

* Works only with publicly accessible websites.
* Some websites may block requests (**Use user-agent headers to avoid 403 errors**).
* Handles missing input errors but does not handle all exceptions.

## Future Improvements

✨ **Potential Enhancements:**

* 🔄 Add support for multiple tags.
* 📜 Implement pagination for large data sets.
* 💾 Store scraped data in a database.

---

🎉 **Happy Coding! 🚀**