UniversalScrap / README.md
Lukeetah's picture
Update README.md
96750d7 verified

A newer version of the Gradio SDK is available: 6.7.0

Upgrade
metadata
title: UniversalScrap
emoji: πŸ‘€
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false

πŸš€ Universal Web Scraper

A powerful web scraping tool that can handle ANY website, including JavaScript-heavy single-page applications (SPAs). Built with Playwright and designed for Hugging Face Spaces.

✨ Features

  • 🎯 Universal Compatibility: Scrapes static HTML and JavaScript-rendered content
  • πŸ”„ Recursive Crawling: Automatically follows and scrapes all internal links
  • πŸ“Š Smart Content Extraction: Converts HTML to clean, readable text
  • πŸ’Ύ Multiple Export Options: Individual TXT files + ZIP download
  • πŸ›‘οΈ Failure-Resistant: Multiple fallback methods ensure success
  • ⚑ Optimized Performance: Rate limiting and timeout handling

πŸš€ Perfect For

  • Documentation websites
  • E-commerce sites
  • News portals
  • Blogs and content sites
  • Single-page applications (React, Vue, Angular)
  • Any website with dynamic content

πŸ› οΈ How It Works

  1. Primary Method: Uses Playwright to handle JavaScript-heavy sites
  2. Fallback Method: Uses aiohttp for static content if Playwright fails
  3. Content Processing: Extracts clean text and all internal links
  4. Recursive Discovery: Follows links up to specified depth
  5. File Generation: Creates individual TXT files for each page

Built with ❀️ for the web scraping community.