Spaces:

LovnishVerma
/

webscrapingexample

Running

App Files Files Community

webscrapingexample / README.md

LovnishVerma

Update README.md

102d6ed verified 6 days ago

preview code

raw

history blame contribute delete

4.53 kB

metadata

title: Webscrapingexample
emoji: 🐨
colorFrom: pink
colorTo: purple
sdk: docker
pinned: false

Flask Web Scraper

Live Demo: https://lovnishverma-webscrapingexample.hf.space/

Live Demo: https://webscaraping-simplified.onrender.com

GitHub Repo Link: https://github.com/lovnishverma/webscaraping-simplified

Overview

This is a simple Flask-based web scraping application that allows users to enter a URL and an HTML tag to extract and display content from that webpage.

Features

✅ User-friendly Web Interface: Enter URL and tag to scrape data.

✅ Web Scraping with BeautifulSoup: Extracts text from specified HTML tags.

✅ Error Handling: Displays an error if URL or tag is missing.

✅ Minimal and Lightweight: Uses Flask for the backend.

Requirements

Ensure you have the following installed before running the project:

🐍 Python
🌐 Flask
🔗 Requests
🏗️ BeautifulSoup4 (bs4)
🦄 Gunicorn (for production/deployment)

You can install dependencies locally using:

pip install -r requirements.txt

Project Structure

📂 Project Directory:

/your_project_directory
│── app.py               # 🏗️ Main Flask application
│── Dockerfile           # 🐳 Docker configuration for Hugging Face
│── requirements.txt     # 📦 Project dependencies
│── templates/
│   │── index.html       # 📄 Home page with form
│   │── result.html      # 📄 Page to display scraped data
│── README.md            # 📖 Project Documentation

Usage & Deployment (Hugging Face Spaces)

This project is configured to easily deploy on Hugging Face Spaces using the Docker SDK.

🚀 Deploying to Hugging Face:

Create a new Space on Hugging Face.
Set the Space SDK to Docker and choose the Blank template.
Upload your project files (app.py, Dockerfile, requirements.txt, and the templates folder) to the Space.
The Space will automatically build the container and start the app on port 7860 using gunicorn.

🌍 Access the Web App: Once the build is complete, your app will be live at your Hugging Face Space URL:

[https://yourusername-yourspacename.hf.space/](https://yourusername-yourspacename.hf.space/)

📝 Enter URL and Tag:

Provide a valid URL.
Specify an HTML tag (e.g., p, h1, div).
Click submit to fetch and display the data.

Code

from flask import Flask, render_template, request  # Flask is used to create a web app
import requests  # To send HTTP requests
from bs4 import BeautifulSoup  # BeautifulSoup is used for web scraping

app = Flask(__name__)  # Creating a Flask app instance

# Home route - Displays the form
@app.route("/")
def index():
    return render_template("index.html")  # Renders the index.html template

# Scraping route - Scrapes data based on user input
@app.route("/scrape", methods=["POST"])
def scrape():
    url, tag = request.form.get("url"), request.form.get("tag")  # Get URL and tag from the form
    if not url or not tag:  # If any value is missing, return an error
        return render_template("result.html", error="Both URL and Tag are required.")

    # Send an HTTP request to fetch the webpage content
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    response.raise_for_status()  # Raise an error if the request fails

    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract text from all occurrences of the given tag
    elements = [e.get_text() for e in soup.find_all(tag)]

    # Render the result page with extracted data
    return render_template("result.html", tag=tag, url=url, title=soup.title.string or "No Title", elements=elements)

# Run the Flask server
if __name__ == "__main__":
    app.run(debug=True)  # Debug mode is enabled to show errors in the console

Notes

⚠️ Important Considerations:

Works only with publicly accessible websites.
Some websites may block requests (Use user-agent headers to avoid 403 errors).
Handles missing input errors but does not handle all exceptions.

Future Improvements

✨ Potential Enhancements:

🔄 Add support for multiple tags.
📜 Implement pagination for large data sets.
💾 Store scraped data in a database.

🎉 Happy Coding! 🚀