Spaces:
Sleeping
Sleeping
| title: UniversalScrap | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 5.34.2 | |
| app_file: app.py | |
| pinned: false | |
| # π Universal Web Scraper | |
| A powerful web scraping tool that can handle **ANY** website, including JavaScript-heavy single-page applications (SPAs). Built with Playwright and designed for Hugging Face Spaces. | |
| ## β¨ Features | |
| - **π― Universal Compatibility**: Scrapes static HTML and JavaScript-rendered content | |
| - **π Recursive Crawling**: Automatically follows and scrapes all internal links | |
| - **π Smart Content Extraction**: Converts HTML to clean, readable text | |
| - **πΎ Multiple Export Options**: Individual TXT files + ZIP download | |
| - **π‘οΈ Failure-Resistant**: Multiple fallback methods ensure success | |
| - **β‘ Optimized Performance**: Rate limiting and timeout handling | |
| ## π Perfect For | |
| - Documentation websites | |
| - E-commerce sites | |
| - News portals | |
| - Blogs and content sites | |
| - Single-page applications (React, Vue, Angular) | |
| - Any website with dynamic content | |
| ## π οΈ How It Works | |
| 1. **Primary Method**: Uses Playwright to handle JavaScript-heavy sites | |
| 2. **Fallback Method**: Uses aiohttp for static content if Playwright fails | |
| 3. **Content Processing**: Extracts clean text and all internal links | |
| 4. **Recursive Discovery**: Follows links up to specified depth | |
| 5. **File Generation**: Creates individual TXT files for each page | |
| Built with β€οΈ for the web scraping community. |