File size: 6,020 Bytes
925eb59
 
 
 
 
 
 
 
 
 
c9210a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
925eb59
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: WebScraper Pro
emoji: πŸ•ΈοΈ
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

# πŸ•·οΈ WebScraper.pro β€” Premium Web Scraping Platform

[![Python Version](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![Flask Version](https://img.shields.io/badge/flask-3.0.3-green.svg)](https://flask.palletsprojects.com/)
[![Playwright](https://img.shields.io/badge/playwright-1.44.0-orange.svg)](https://playwright.dev/python/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**WebScraper.pro** is a professional, secure, and production-ready web scraping platform built on Python and Flask. It provides a visual dashboard to configure, manage, and schedule scraping jobs targeting both static pages and modern dynamic (JavaScript-rendered) websites, exporting cleanly to JSON, CSV, or Excel.

Designed with a **premium dark-mode visual interface** featuring glassmorphism elements, dynamic micro-animations, and live job status polling.

---

## 🌟 Key Features

* **⚑ Dual Scraping Engines:**
  * **Static Engine:** Fast, lightweight HTTP requests combined with BeautifulSoup4 parsing.
  * **Dynamic Engine:** Headless Playwright integration for complex, client-side, JavaScript-heavy SPAs.
* **πŸ“Š Premium Real-Time Dashboard:**
  * Live-updating analytics cards (success rates, items scraped, running jobs).
  * Auto-refreshing status progress bars and real-time console log stream.
* **πŸ”Ž Granular Visual Extractors:**
  * Select data via **CSS Selectors**, **XPath Expressions**, **HTML Tags**, or **HTML Attributes**.
  * Custom configurations for **Table Data**, **JSON-LD Schema**, and **Full HTML** extractions.
* **βš™οΈ Enterprise Crawl Controls:**
  * Auto-rotating random User Agents per request.
  * Adaptive request throttling (custom delays) and automatic retry policies.
  * Infinite scroll triggers (custom scroll depth) and standard pagination crawling.
  * Full support for custom HTTP request headers (JSON format).
* **πŸ”’ Production-Grade Security:**
  * High-performance request rate limiting powered by `Flask-Limiter`.
  * Industry-standard Cross-Site Request Forgery (`CSRFProtect`) protection.
  * Strict security headers (CSP, X-Frame-Options, X-Content-Type) and sanitize filters (`bleach`).
* **πŸ“₯ Multi-Format Exporters:** Export scraped datasets on-demand to structured **JSON**, **CSV**, or Microsoft **Excel** (`.xlsx`).

---

## πŸ—οΈ Modular Architecture

The platform has been re-architected from a flat layout into a highly maintainable, standardized **Application Factory Blueprint** structure:

```
webscraping/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py           # Application Factory initialization
β”‚   β”œβ”€β”€ config/               # Settings & Environment configurations
β”‚   β”œβ”€β”€ models/               # SQLAlchemy ORM schemas (BaseModel, User, ScrapeJob, etc.)
β”‚   β”œβ”€β”€ scrapers/             # Core Scraping Engine (Static & Playwright)
β”‚   β”œβ”€β”€ services/             # Job Execution, Excel/CSV Exports, Statistics calculations
β”‚   β”œβ”€β”€ middleware/           # Rate limiting and Security headers middleware
β”‚   β”œβ”€β”€ routes/               # Modular controller Blueprints (Main, Jobs)
β”‚   β”œβ”€β”€ utils/                # Input validation & Logging configurators
β”‚   β”œβ”€β”€ templates/            # HTML templates (dashboard, job details, forms)
β”‚   └── static/               # Premium custom CSS, JS assets
β”œβ”€β”€ run.py                    # Production-ready entry point
β”œβ”€β”€ .env                      # Environment secrets
└── requirements.txt          # Package dependencies
```

---

## πŸš€ Quick Start

### πŸ“‹ Prerequisites
- Python **3.10** or higher
- Node.js (required for Playwright browsers dependency)

### 1. Clone & Set Up Directory
```bash
git clone https://github.com/your-username/webscraper-pro.git
cd webscraper-pro
```

### 2. Configure Virtual Environment
```bash
python -m venv venv
# On Windows (PowerShell)
.\venv\Scripts\Activate.ps1
# On macOS/Linux
source venv/bin/activate
```

### 3. Install Dependencies
```bash
pip install -r requirements.txt
playwright install chromium
```

### 4. Setup Environment Variables
Create a `.env` file in the root directory. You can copy the example file:
```bash
cp env.example .env
```

Open `.env` and set up your application environment details:
```ini
# Core
FLASK_APP=app
FLASK_ENV=development
SECRET_KEY=generate-a-strong-random-key-here
DEBUG=True

# Database (Leave commented out to use default SQLite)
# DATABASE_URL=sqlite:///database/scraper.db
```

### 5. Launch the Server
```bash
python run.py
```
The application will start, seed database configurations automatically, and be accessible at **`http://127.0.0.1:5000`**.

---

## πŸ› οΈ Configuration Settings (`.env`)

| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `FLASK_ENV` | String | `production` | Environment type (`development` or `production`) |
| `SECRET_KEY` | String | *Required* | Secret key for session encryption & CSRF tokens |
| `DEBUG` | Boolean | `False` | Enables or disables interactive debug tools |
| `DATABASE_URL` | String | `sqlite:///database/scraper.db` | Target database connection URI |
| `RATELIMIT_DEFAULT` | String | `100 per minute` | Default global API rate limit constraint |

---

## πŸ”’ Security Compliance

This platform enforces secure engineering best practices:
- **Input Sanitization:** Uses `bleach` and custom content validators to strip dangerous Javascript/HTML injection payloads before database commit.
- **Robust Rate-Limiting:** Defends against scraping-abuse/DoS using local in-memory window limitation schemas.
- **Strict Headers:** Employs security middleware to restrict frame injection, content sniffing, and force standard CORS controls.

---

## πŸ“„ License
Distributed under the MIT License. See [LICENSE](LICENSE) for more details.