ZainabEman
/

Customizable-Web_Scrapper

Model card Files Files and versions

xet

Community

ZainabEman commited on Feb 21, 2025

Commit

e146eba

verified ·

1 Parent(s): cf9f587

Upload 2 files

Browse files

Files changed (2) hide show

README.md +59 -3
main.py +73 -0

README.md CHANGED Viewed

@@ -1,3 +1,59 @@
----
-license: mit
----

+# Customizable Web Scraper
+## Overview
+The **Customizable Web Scraper** is a lightweight Python tool that allows users to extract specific elements from any webpage using a simple graphical interface. Built with **Streamlit**, **BeautifulSoup**, and **Pandas**, this tool enables users to analyze HTML structure, select relevant tags, and download the extracted data in CSV format.
+## Features
+✅ **User-friendly Streamlit interface**
+🔍 **Automatic detection of available HTML tags**
+📌 **Custom tag selection** (`h1`, `h2`, `p`, `a`, `img`, `ul`, etc.)
+📊 **Displays scraped data in a structured table**
+📥 **Download extracted data as a CSV file**
+## Installation
+### Prerequisites
+Ensure you have **Python 3.x** installed on your system.
+### Steps
+1. Clone this repository or download the script:
+   ```sh
+   git clone https://github.com/your-repository/Customizable-Scraper.git
+   cd Customizable-Scraper
+   ```
+2. Install the required dependencies:
+   ```sh
+   pip install streamlit requests beautifulsoup4 pandas
+   ```
+3. Run the Streamlit app:
+   ```sh
+   streamlit run app.py
+   ```
+## Usage
+1. **Enter a URL**: Provide the webpage link you want to scrape.
+2. **Analyze the page**: The scraper will identify available HTML tags.
+3. **Select tags**: Choose which elements (headings, paragraphs, links, images, lists, etc.) to extract.
+4. **Scrape Data**: Click the **"Scrape Data"** button to fetch and display the extracted content.
+5. **Download CSV**: Export the scraped data as a CSV file for offline use.
+## Technologies Used
+- **Streamlit** – Interactive UI for user-friendly operation
+- **Requests** – Fetching webpage content
+- **BeautifulSoup4** – Parsing and extracting HTML elements
+- **Pandas** – Structuring and exporting scraped data
+## Limitations
+⚠️ This scraper **cannot**:
+- Extract data from **JavaScript-rendered content**
+- Access **login-restricted** or **protected** pages
+- Scrape sites that block requests in **robots.txt**
+## License
+This project is **open-source** and available for personal and educational use.
+## Contributions
+🔹 Contributions are welcome!
+If you’d like to improve this project, feel free to fork the repository, make enhancements, and submit a **Pull Request**.

main.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import streamlit as st
+import requests
+from bs4 import BeautifulSoup
+import pandas as pd
+import json
+def analyze_page(url):
+    response = requests.get(url)
+    if response.status_code == 200:
+        soup = BeautifulSoup(response.content, 'html.parser')
+        useful_tags = {'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'a', 'img', 'ul', 'ol', 'li'}
+        available_tags = {tag.name for tag in soup.find_all(True) if tag.name in useful_tags}
+        return list(available_tags)
+    else:
+        return None
+def scrape_data(url, selected_tags):
+    response = requests.get(url)
+    if response.status_code == 200:
+        soup = BeautifulSoup(response.content, 'html.parser')
+        data = []
+        for tag in selected_tags:
+            for item in soup.find_all(tag):
+                if tag == 'img':
+                    data.append({'Type': tag, 'Src': item.get('src', ''), 'Alt Text': item.get('alt', '')})
+                elif tag == 'a':
+                    data.append({'Type': tag, 'URL': item.get('href', ''), 'Text': item.get_text(strip=True)})
+                else:
+                    data.append({'Type': tag, 'Content': item.get_text(strip=True)})
+        return pd.DataFrame(data)
+    else:
+        return None
+def main():
+    st.title("Customizable Web Scraper")
+    url = st.text_input("Enter the URL to scrape:")
+    if url:
+        available_tags = analyze_page(url)
+        if available_tags:
+            selected_tags = st.multiselect("Select tags to scrape:", available_tags)
+            if st.button("Scrape Data"):
+                df = scrape_data(url, selected_tags)
+                if df is not None and not df.empty:
+                    st.write("### Scraped Data:")
+                    st.dataframe(df)
+                    # CSV Download
+                    csv = df.to_csv(index=False).encode('utf-8')
+                    st.download_button(
+                        label="Download as CSV",
+                        data=csv,
+                        file_name="scraped_data.csv",
+                        mime="text/csv",
+                        key="csv_download"
+                    )
+                    # JSON Download
+                    json_data = df.to_json(orient='records')
+                    st.download_button(
+                        label="Download as JSON",
+                        data=json_data,
+                        file_name="scraped_data.json",
+                        mime="application/json",
+                        key="json_download"
+                    )
+                else:
+                    st.warning("No data found for the selected tags.")
+        else:
+            st.error("Failed to analyze the page. Check the URL.")
+if __name__ == "__main__":
+    main()