Spaces:

Lukeetah
/

UniversalScrap

Sleeping

Update README.md

96750d7 verified 8 months ago

1.44 kB

	---
	title: UniversalScrap
	emoji: 👀
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 5.34.2
	app_file: app.py
	pinned: false
	---

	# 🚀 Universal Web Scraper

	A powerful web scraping tool that can handle ANY website, including JavaScript-heavy single-page applications (SPAs). Built with Playwright and designed for Hugging Face Spaces.

	## ✨ Features

	- 🎯 Universal Compatibility: Scrapes static HTML and JavaScript-rendered content
	- 🔄 Recursive Crawling: Automatically follows and scrapes all internal links
	- 📊 Smart Content Extraction: Converts HTML to clean, readable text
	- 💾 Multiple Export Options: Individual TXT files + ZIP download
	- 🛡️ Failure-Resistant: Multiple fallback methods ensure success
	- ⚡ Optimized Performance: Rate limiting and timeout handling

	## 🚀 Perfect For

	- Documentation websites
	- E-commerce sites
	- News portals
	- Blogs and content sites
	- Single-page applications (React, Vue, Angular)
	- Any website with dynamic content

	## 🛠️ How It Works

	1. Primary Method: Uses Playwright to handle JavaScript-heavy sites
	2. Fallback Method: Uses aiohttp for static content if Playwright fails
	3. Content Processing: Extracts clean text and all internal links
	4. Recursive Discovery: Follows links up to specified depth
	5. File Generation: Creates individual TXT files for each page

	Built with ❤️ for the web scraping community.