File size: 1,444 Bytes
a175850
 
 
 
 
 
 
 
 
 
 
96750d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
title: UniversalScrap
emoji: πŸ‘€
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
---

# πŸš€ Universal Web Scraper

A powerful web scraping tool that can handle **ANY** website, including JavaScript-heavy single-page applications (SPAs). Built with Playwright and designed for Hugging Face Spaces.

## ✨ Features

- **🎯 Universal Compatibility**: Scrapes static HTML and JavaScript-rendered content
- **πŸ”„ Recursive Crawling**: Automatically follows and scrapes all internal links
- **πŸ“Š Smart Content Extraction**: Converts HTML to clean, readable text
- **πŸ’Ύ Multiple Export Options**: Individual TXT files + ZIP download
- **πŸ›‘οΈ Failure-Resistant**: Multiple fallback methods ensure success
- **⚑ Optimized Performance**: Rate limiting and timeout handling

## πŸš€ Perfect For

- Documentation websites
- E-commerce sites
- News portals
- Blogs and content sites
- Single-page applications (React, Vue, Angular)
- Any website with dynamic content

## πŸ› οΈ How It Works

1. **Primary Method**: Uses Playwright to handle JavaScript-heavy sites
2. **Fallback Method**: Uses aiohttp for static content if Playwright fails
3. **Content Processing**: Extracts clean text and all internal links
4. **Recursive Discovery**: Follows links up to specified depth
5. **File Generation**: Creates individual TXT files for each page

Built with ❀️ for the web scraping community.