File size: 4,533 Bytes
ed21ad0
 
 
 
 
 
 
 
 
83019af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102d6ed
83019af
102d6ed
83019af
102d6ed
 
83019af
 
 
102d6ed
83019af
102d6ed
83019af
 
 
102d6ed
 
83019af
 
 
 
102d6ed
83019af
 
102d6ed
83019af
102d6ed
83019af
102d6ed
 
 
 
 
 
83019af
 
102d6ed
 
83019af
102d6ed
 
83019af
 
 
102d6ed
83019af
 
 
 
 
102d6ed
83019af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102d6ed
83019af
 
102d6ed
83019af
102d6ed
 
 
 
83019af
 
102d6ed
83019af
102d6ed
 
 
 
83019af
 
 
102d6ed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: Webscrapingexample
emoji: 🐨
colorFrom: pink
colorTo: purple
sdk: docker
pinned: false
---

# Flask Web Scraper

Live Demo: https://lovnishverma-webscrapingexample.hf.space/

Live Demo: https://webscaraping-simplified.onrender.com

GitHub Repo Link: https://github.com/lovnishverma/webscaraping-simplified

## Overview
This is a simple **Flask-based web scraping application** that allows users to enter a URL and an HTML tag to extract and display content from that webpage.

![image](https://github.com/user-attachments/assets/13838a50-71e5-411d-ac7c-034e5caac405)


## Features
βœ… **User-friendly Web Interface:** Enter URL and tag to scrape data.

βœ… **Web Scraping with BeautifulSoup:** Extracts text from specified HTML tags.

βœ… **Error Handling:** Displays an error if URL or tag is missing.

βœ… **Minimal and Lightweight:** Uses Flask for the backend.

## Requirements
Ensure you have the following installed before running the project:

- 🐍 Python
- 🌐 Flask
- πŸ”— Requests
- πŸ—οΈ BeautifulSoup4 (bs4)
- πŸ¦„ Gunicorn (for production/deployment)

You can install dependencies locally using:
```sh
pip install -r requirements.txt

```

## Project Structure

πŸ“‚ **Project Directory:**

```
/your_project_directory
│── app.py               # πŸ—οΈ Main Flask application
│── Dockerfile           # 🐳 Docker configuration for Hugging Face
│── requirements.txt     # πŸ“¦ Project dependencies
│── templates/
β”‚   │── index.html       # πŸ“„ Home page with form
β”‚   │── result.html      # πŸ“„ Page to display scraped data
│── README.md            # πŸ“– Project Documentation

```

## Usage & Deployment (Hugging Face Spaces)

This project is configured to easily deploy on **Hugging Face Spaces** using the Docker SDK.

πŸš€ **Deploying to Hugging Face:**

1. Create a new Space on [Hugging Face](https://huggingface.co/spaces).
2. Set the **Space SDK** to **Docker** and choose the **Blank** template.
3. Upload your project files (`app.py`, `Dockerfile`, `requirements.txt`, and the `templates` folder) to the Space.
4. The Space will automatically build the container and start the app on port `7860` using `gunicorn`.

🌍 **Access the Web App:**
Once the build is complete, your app will be live at your Hugging Face Space URL:

```
[https://yourusername-yourspacename.hf.space/](https://yourusername-yourspacename.hf.space/)

```

πŸ“ **Enter URL and Tag:**

1. Provide a valid URL.
2. Specify an HTML tag (e.g., `p`, `h1`, `div`).
3. Click submit to fetch and display the data.

## Code

```python
from flask import Flask, render_template, request  # Flask is used to create a web app
import requests  # To send HTTP requests
from bs4 import BeautifulSoup  # BeautifulSoup is used for web scraping

app = Flask(__name__)  # Creating a Flask app instance

# Home route - Displays the form
@app.route("/")
def index():
    return render_template("index.html")  # Renders the index.html template

# Scraping route - Scrapes data based on user input
@app.route("/scrape", methods=["POST"])
def scrape():
    url, tag = request.form.get("url"), request.form.get("tag")  # Get URL and tag from the form
    if not url or not tag:  # If any value is missing, return an error
        return render_template("result.html", error="Both URL and Tag are required.")

    # Send an HTTP request to fetch the webpage content
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    response.raise_for_status()  # Raise an error if the request fails

    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract text from all occurrences of the given tag
    elements = [e.get_text() for e in soup.find_all(tag)]

    # Render the result page with extracted data
    return render_template("result.html", tag=tag, url=url, title=soup.title.string or "No Title", elements=elements)

# Run the Flask server
if __name__ == "__main__":
    app.run(debug=True)  # Debug mode is enabled to show errors in the console


```

## Notes

⚠️ **Important Considerations:**

* Works only with publicly accessible websites.
* Some websites may block requests (**Use user-agent headers to avoid 403 errors**).
* Handles missing input errors but does not handle all exceptions.

## Future Improvements

✨ **Potential Enhancements:**

* πŸ”„ Add support for multiple tags.
* πŸ“œ Implement pagination for large data sets.
* πŸ’Ύ Store scraped data in a database.

---

πŸŽ‰ **Happy Coding! πŸš€**