Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
6.7.0
metadata
title: UniversalScrap
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
π Universal Web Scraper
A powerful web scraping tool that can handle ANY website, including JavaScript-heavy single-page applications (SPAs). Built with Playwright and designed for Hugging Face Spaces.
β¨ Features
- π― Universal Compatibility: Scrapes static HTML and JavaScript-rendered content
- π Recursive Crawling: Automatically follows and scrapes all internal links
- π Smart Content Extraction: Converts HTML to clean, readable text
- πΎ Multiple Export Options: Individual TXT files + ZIP download
- π‘οΈ Failure-Resistant: Multiple fallback methods ensure success
- β‘ Optimized Performance: Rate limiting and timeout handling
π Perfect For
- Documentation websites
- E-commerce sites
- News portals
- Blogs and content sites
- Single-page applications (React, Vue, Angular)
- Any website with dynamic content
π οΈ How It Works
- Primary Method: Uses Playwright to handle JavaScript-heavy sites
- Fallback Method: Uses aiohttp for static content if Playwright fails
- Content Processing: Extracts clean text and all internal links
- Recursive Discovery: Follows links up to specified depth
- File Generation: Creates individual TXT files for each page
Built with β€οΈ for the web scraping community.