ZainabEman commited on
Commit
e146eba
·
verified ·
1 Parent(s): cf9f587

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +59 -3
  2. main.py +73 -0
README.md CHANGED
@@ -1,3 +1,59 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Customizable Web Scraper
2
+
3
+ ## Overview
4
+ The **Customizable Web Scraper** is a lightweight Python tool that allows users to extract specific elements from any webpage using a simple graphical interface. Built with **Streamlit**, **BeautifulSoup**, and **Pandas**, this tool enables users to analyze HTML structure, select relevant tags, and download the extracted data in CSV format.
5
+
6
+ ## Features
7
+ ✅ **User-friendly Streamlit interface**
8
+ 🔍 **Automatic detection of available HTML tags**
9
+ 📌 **Custom tag selection** (`h1`, `h2`, `p`, `a`, `img`, `ul`, etc.)
10
+ 📊 **Displays scraped data in a structured table**
11
+ 📥 **Download extracted data as a CSV file**
12
+
13
+ ## Installation
14
+
15
+ ### Prerequisites
16
+ Ensure you have **Python 3.x** installed on your system.
17
+
18
+ ### Steps
19
+ 1. Clone this repository or download the script:
20
+ ```sh
21
+ git clone https://github.com/your-repository/Customizable-Scraper.git
22
+ cd Customizable-Scraper
23
+ ```
24
+ 2. Install the required dependencies:
25
+ ```sh
26
+ pip install streamlit requests beautifulsoup4 pandas
27
+ ```
28
+ 3. Run the Streamlit app:
29
+ ```sh
30
+ streamlit run app.py
31
+ ```
32
+
33
+ ## Usage
34
+
35
+ 1. **Enter a URL**: Provide the webpage link you want to scrape.
36
+ 2. **Analyze the page**: The scraper will identify available HTML tags.
37
+ 3. **Select tags**: Choose which elements (headings, paragraphs, links, images, lists, etc.) to extract.
38
+ 4. **Scrape Data**: Click the **"Scrape Data"** button to fetch and display the extracted content.
39
+ 5. **Download CSV**: Export the scraped data as a CSV file for offline use.
40
+
41
+ ## Technologies Used
42
+ - **Streamlit** – Interactive UI for user-friendly operation
43
+ - **Requests** – Fetching webpage content
44
+ - **BeautifulSoup4** – Parsing and extracting HTML elements
45
+ - **Pandas** – Structuring and exporting scraped data
46
+
47
+ ## Limitations
48
+ ⚠️ This scraper **cannot**:
49
+ - Extract data from **JavaScript-rendered content**
50
+ - Access **login-restricted** or **protected** pages
51
+ - Scrape sites that block requests in **robots.txt**
52
+
53
+ ## License
54
+ This project is **open-source** and available for personal and educational use.
55
+
56
+ ## Contributions
57
+ 🔹 Contributions are welcome!
58
+ If you’d like to improve this project, feel free to fork the repository, make enhancements, and submit a **Pull Request**.
59
+
main.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import requests
3
+ from bs4 import BeautifulSoup
4
+ import pandas as pd
5
+ import json
6
+
7
+ def analyze_page(url):
8
+ response = requests.get(url)
9
+ if response.status_code == 200:
10
+ soup = BeautifulSoup(response.content, 'html.parser')
11
+ useful_tags = {'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'a', 'img', 'ul', 'ol', 'li'}
12
+ available_tags = {tag.name for tag in soup.find_all(True) if tag.name in useful_tags}
13
+ return list(available_tags)
14
+ else:
15
+ return None
16
+
17
+ def scrape_data(url, selected_tags):
18
+ response = requests.get(url)
19
+ if response.status_code == 200:
20
+ soup = BeautifulSoup(response.content, 'html.parser')
21
+ data = []
22
+ for tag in selected_tags:
23
+ for item in soup.find_all(tag):
24
+ if tag == 'img':
25
+ data.append({'Type': tag, 'Src': item.get('src', ''), 'Alt Text': item.get('alt', '')})
26
+ elif tag == 'a':
27
+ data.append({'Type': tag, 'URL': item.get('href', ''), 'Text': item.get_text(strip=True)})
28
+ else:
29
+ data.append({'Type': tag, 'Content': item.get_text(strip=True)})
30
+ return pd.DataFrame(data)
31
+ else:
32
+ return None
33
+
34
+ def main():
35
+ st.title("Customizable Web Scraper")
36
+ url = st.text_input("Enter the URL to scrape:")
37
+
38
+ if url:
39
+ available_tags = analyze_page(url)
40
+ if available_tags:
41
+ selected_tags = st.multiselect("Select tags to scrape:", available_tags)
42
+ if st.button("Scrape Data"):
43
+ df = scrape_data(url, selected_tags)
44
+ if df is not None and not df.empty:
45
+ st.write("### Scraped Data:")
46
+ st.dataframe(df)
47
+
48
+ # CSV Download
49
+ csv = df.to_csv(index=False).encode('utf-8')
50
+ st.download_button(
51
+ label="Download as CSV",
52
+ data=csv,
53
+ file_name="scraped_data.csv",
54
+ mime="text/csv",
55
+ key="csv_download"
56
+ )
57
+
58
+ # JSON Download
59
+ json_data = df.to_json(orient='records')
60
+ st.download_button(
61
+ label="Download as JSON",
62
+ data=json_data,
63
+ file_name="scraped_data.json",
64
+ mime="application/json",
65
+ key="json_download"
66
+ )
67
+ else:
68
+ st.warning("No data found for the selected tags.")
69
+ else:
70
+ st.error("Failed to analyze the page. Check the URL.")
71
+
72
+ if __name__ == "__main__":
73
+ main()