Spaces:
Paused
Paused
File size: 8,529 Bytes
925eb59 c9210a4 903a1f6 c9210a4 925eb59 7eedd44 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 | ---
title: WebScraper Pro
emoji: πΈοΈ
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# π·οΈ WebScraper.pro β Premium Web Scraping Platform
[](https://www.python.org/)
[](https://flask.palletsprojects.com/)
[](https://playwright.dev/python/)
[](https://opensource.org/licenses/MIT)
**WebScraper.pro** is a professional, secure, and production-ready web scraping platform built on Python and Flask. It provides a visual dashboard to configure, manage, and schedule scraping jobs targeting both static pages and modern dynamic (JavaScript-rendered) websites, exporting cleanly to JSON, CSV, or Excel.
Designed with a **premium dark-mode visual interface** featuring glassmorphism elements, dynamic micro-animations, and live job status polling.
---
## Live Demo:
[https://lovnishverma-webscraper-pro.hf.space](https://lovnishverma-webscraper-pro.hf.space)
<img width="1920" height="1080" alt="webscraping" src="https://github.com/user-attachments/assets/79bc49e6-c919-42c6-8085-1a6bce599ee4" />
## π Key Features
* **β‘ Dual Scraping Engines:**
* **Static Engine:** Fast, lightweight HTTP requests combined with BeautifulSoup4 parsing.
* **Dynamic Engine:** Headless Playwright integration for complex, client-side, JavaScript-heavy SPAs.
* **π Premium Real-Time Dashboard:**
* Live-updating analytics cards (success rates, items scraped, running jobs).
* Auto-refreshing status progress bars and real-time console log stream.
* **π Granular Visual Extractors:**
* Select data via **CSS Selectors**, **XPath Expressions**, **HTML Tags**, or **HTML Attributes**.
* Custom configurations for **Table Data**, **JSON-LD Schema**, and **Full HTML** extractions.
* **βοΈ Enterprise Crawl Controls:**
* Auto-rotating random User Agents per request.
* Adaptive request throttling (custom delays) and automatic retry policies.
* Infinite scroll triggers (custom scroll depth) and standard pagination crawling.
* Full support for custom HTTP request headers (JSON format).
* **π Production-Grade Security:**
* High-performance request rate limiting powered by `Flask-Limiter`.
* Industry-standard Cross-Site Request Forgery (`CSRFProtect`) protection.
* Strict security headers (CSP, X-Frame-Options, X-Content-Type) and sanitize filters (`bleach`).
* **π₯ Multi-Format Exporters:** Export scraped datasets on-demand to structured **JSON**, **CSV**, or Microsoft **Excel** (`.xlsx`).
---
## ποΈ Modular Architecture
The platform has been re-architected from a flat layout into a highly maintainable, standardized **Application Factory Blueprint** structure:
```
webscraping/
βββ app/
β βββ __init__.py # Application Factory initialization
β βββ config/ # Settings & Environment configurations
β βββ models/ # SQLAlchemy ORM schemas (BaseModel, User, ScrapeJob, etc.)
β βββ scrapers/ # Core Scraping Engine (Static & Playwright)
β βββ services/ # Job Execution, Excel/CSV Exports, Statistics calculations
β βββ middleware/ # Rate limiting and Security headers middleware
β βββ routes/ # Modular controller Blueprints (Main, Jobs)
β βββ utils/ # Input validation & Logging configurators
β βββ templates/ # HTML templates (dashboard, job details, forms)
β βββ static/ # Premium custom CSS, JS assets
βββ run.py # Production-ready entry point
βββ .env # Environment secrets
βββ requirements.txt # Package dependencies
```
---
## π Quick Start
### π Prerequisites
- Python **3.10** or higher
- Node.js (required for Playwright browsers dependency)
### 1. Clone & Set Up Directory
```bash
git clone https://github.com/your-username/webscraper-pro.git
cd webscraper-pro
```
### 2. Configure Virtual Environment
```bash
python -m venv venv
# On Windows (PowerShell)
.\venv\Scripts\Activate.ps1
# On macOS/Linux
source venv/bin/activate
```
### 3. Install Dependencies
```bash
pip install -r requirements.txt
playwright install chromium
```
### 4. Setup Environment Variables
Create a `.env` file in the root directory. You can copy the example file:
```bash
cp env.example .env
```
Open `.env` and set up your application environment details:
```ini
# Core
FLASK_APP=app
FLASK_ENV=development
SECRET_KEY=generate-a-strong-random-key-here
DEBUG=True
# Database (Leave commented out to use default SQLite)
# DATABASE_URL=sqlite:///database/scraper.db
```
### 5. Launch the Server
```bash
python run.py
```
The application will start, seed database configurations automatically, and be accessible at **`http://127.0.0.1:5000`**.
---
## π οΈ Configuration Settings (`.env`)
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `FLASK_ENV` | String | `production` | Environment type (`development` or `production`) |
| `SECRET_KEY` | String | *Required* | Secret key for session encryption & CSRF tokens |
| `DEBUG` | Boolean | `False` | Enables or disables interactive debug tools |
| `DATABASE_URL` | String | `sqlite:///database/scraper.db` | Target database connection URI |
| `RATELIMIT_DEFAULT` | String | `100 per minute` | Default global API rate limit constraint |
---
## π Security Compliance
This platform enforces secure engineering best practices:
- **Input Sanitization:** Uses `bleach` and custom content validators to strip dangerous Javascript/HTML injection payloads before database commit.
- **Robust Rate-Limiting:** Defends against scraping-abuse/DoS using local in-memory window limitation schemas.
- **Strict Headers:** Employs security middleware to restrict frame injection, content sniffing, and force standard CORS controls.
---
## π License
Distributed under the MIT License. See [LICENSE](LICENSE) for more details.
# Test
Test on **Aaj Tak (Hindi News)**. News websites are excellent targets for scraping because they are rich in constantly updating structured data.
Here are the three most valuable things you can scrape from this page, along with exactly how to fill out your configuration form for each!
### π‘ Option 1: The Easiest & Cleanest Data (JSON-LD Metadata)
News sites embed hidden structured data for Google. This page has a massive `ItemList` JSON-LD block containing the top 20+ trending headlines and their exact URLs. This is the cleanest way to get the top stories without dealing with messy HTML tags.
* **Job Name:** AajTak Top Stories (JSON)
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
* **Scrape Type:** Static (Requests + BS4)
* **Extraction Type:** JSON-LD Schema
* **CSS Selector:** *(Leave blank)*
* **Delay Between Requests (s):** 2
### π‘ Option 2: Scraping All Article Headlines (Text)
If you want to pull the visible text of every news headline on the page (Top stories, Sports, Entertainment, Tech, etc.).
* **Job Name:** AajTak All Headlines
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
* **Scrape Type:** Static (Requests + BS4)
* **Extraction Type:** Text Content
* **CSS Selector:** `.title h3, .fv-cap h3, .sstitle-listing h3, .title-big h3`
* **Delay Between Requests (s):** 2
### π‘ Option 3: Scraping Article URLs (Links)
If your goal is to build a crawler that finds news articles to scrape their full text later, you need the URLs of the articles.
* **Job Name:** AajTak Article Links
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
* **Scrape Type:** Static (Requests + BS4)
* **Extraction Type:** HTML Attributes
* **CSS Selector:** `li[data-tb-region-item] a`
* **Attribute Name:** `href`
### π‘ Option 4: Scraping Thumbnail Images
*Note: Because this site uses "lazy loading" for performance, the actual image URL isn't in the standard `src` attribute; it's hidden in `data-src`.*
* **Job Name:** AajTak Thumbnails
* **Target URL:** [https://www.aajtak.in/](https://www.aajtak.in/)
* **Scrape Type:** Static (Requests + BS4)
* **Extraction Type:** HTML Attributes
* **CSS Selector:** `img.lazyload`
* **Attribute Name:** `data-src`
---
|