Spaces:
Sleeping
Sleeping
| title: Webscrapingexample | |
| emoji: π¨ | |
| colorFrom: pink | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| # Flask Web Scraper | |
| Live Demo: https://lovnishverma-webscrapingexample.hf.space/ | |
| Live Demo: https://webscaraping-simplified.onrender.com | |
| GitHub Repo Link: https://github.com/lovnishverma/webscaraping-simplified | |
| ## Overview | |
| This is a simple **Flask-based web scraping application** that allows users to enter a URL and an HTML tag to extract and display content from that webpage. | |
|  | |
| ## Features | |
| β **User-friendly Web Interface:** Enter URL and tag to scrape data. | |
| β **Web Scraping with BeautifulSoup:** Extracts text from specified HTML tags. | |
| β **Error Handling:** Displays an error if URL or tag is missing. | |
| β **Minimal and Lightweight:** Uses Flask for the backend. | |
| ## Requirements | |
| Ensure you have the following installed before running the project: | |
| - π Python | |
| - π Flask | |
| - π Requests | |
| - ποΈ BeautifulSoup4 (bs4) | |
| - π¦ Gunicorn (for production/deployment) | |
| You can install dependencies locally using: | |
| ```sh | |
| pip install -r requirements.txt | |
| ``` | |
| ## Project Structure | |
| π **Project Directory:** | |
| ``` | |
| /your_project_directory | |
| βββ app.py # ποΈ Main Flask application | |
| βββ Dockerfile # π³ Docker configuration for Hugging Face | |
| βββ requirements.txt # π¦ Project dependencies | |
| βββ templates/ | |
| β βββ index.html # π Home page with form | |
| β βββ result.html # π Page to display scraped data | |
| βββ README.md # π Project Documentation | |
| ``` | |
| ## Usage & Deployment (Hugging Face Spaces) | |
| This project is configured to easily deploy on **Hugging Face Spaces** using the Docker SDK. | |
| π **Deploying to Hugging Face:** | |
| 1. Create a new Space on [Hugging Face](https://huggingface.co/spaces). | |
| 2. Set the **Space SDK** to **Docker** and choose the **Blank** template. | |
| 3. Upload your project files (`app.py`, `Dockerfile`, `requirements.txt`, and the `templates` folder) to the Space. | |
| 4. The Space will automatically build the container and start the app on port `7860` using `gunicorn`. | |
| π **Access the Web App:** | |
| Once the build is complete, your app will be live at your Hugging Face Space URL: | |
| ``` | |
| [https://yourusername-yourspacename.hf.space/](https://yourusername-yourspacename.hf.space/) | |
| ``` | |
| π **Enter URL and Tag:** | |
| 1. Provide a valid URL. | |
| 2. Specify an HTML tag (e.g., `p`, `h1`, `div`). | |
| 3. Click submit to fetch and display the data. | |
| ## Code | |
| ```python | |
| from flask import Flask, render_template, request # Flask is used to create a web app | |
| import requests # To send HTTP requests | |
| from bs4 import BeautifulSoup # BeautifulSoup is used for web scraping | |
| app = Flask(__name__) # Creating a Flask app instance | |
| # Home route - Displays the form | |
| @app.route("/") | |
| def index(): | |
| return render_template("index.html") # Renders the index.html template | |
| # Scraping route - Scrapes data based on user input | |
| @app.route("/scrape", methods=["POST"]) | |
| def scrape(): | |
| url, tag = request.form.get("url"), request.form.get("tag") # Get URL and tag from the form | |
| if not url or not tag: # If any value is missing, return an error | |
| return render_template("result.html", error="Both URL and Tag are required.") | |
| # Send an HTTP request to fetch the webpage content | |
| response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}) | |
| response.raise_for_status() # Raise an error if the request fails | |
| # Parse the HTML content of the page | |
| soup = BeautifulSoup(response.text, "html.parser") | |
| # Extract text from all occurrences of the given tag | |
| elements = [e.get_text() for e in soup.find_all(tag)] | |
| # Render the result page with extracted data | |
| return render_template("result.html", tag=tag, url=url, title=soup.title.string or "No Title", elements=elements) | |
| # Run the Flask server | |
| if __name__ == "__main__": | |
| app.run(debug=True) # Debug mode is enabled to show errors in the console | |
| ``` | |
| ## Notes | |
| β οΈ **Important Considerations:** | |
| * Works only with publicly accessible websites. | |
| * Some websites may block requests (**Use user-agent headers to avoid 403 errors**). | |
| * Handles missing input errors but does not handle all exceptions. | |
| ## Future Improvements | |
| β¨ **Potential Enhancements:** | |
| * π Add support for multiple tags. | |
| * π Implement pagination for large data sets. | |
| * πΎ Store scraped data in a database. | |
| --- | |
| π **Happy Coding! π** | |