web-scraper

Sleeping

App Files Files Community

web-scraper / README.md

spagestic

Update README.md

89b22f4 verified 7 months ago

preview code

raw

history blame contribute delete

4.03 kB

	---
	title: Web Scraper
	emoji: 🚀
	colorFrom: yellow
	colorTo: green
	sdk: gradio
	sdk_version: 5.32.1
	app_file: app.py
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# Web Scraper & Sitemap Generator

	A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration.

	## Features

	- 🕷️ Web Scraping: Extract text content from any website
	- 📝 Markdown Conversion: Convert scraped HTML content to clean markdown format
	- 🗺️ Sitemap Generation: Create organized sitemaps based on all links found on the page
	- 🌐 User-Friendly Interface: Easy-to-use Gradio web interface
	- 🔗 Link Organization: Separate internal and external links for better navigation
	- 🤖 MCP Server: Expose scraping tools for AI assistants and LLMs

	## Installation

	1. Install Python dependencies:

	```bash
	pip install -r requirements.txt
	```

	## Usage

	### Web Interface

	1. Run the web application:

	```bash
	python app.py
	```

	2. Open your browser and navigate to `http://localhost:7861`

	3. Enter a URL in the input field and click "Scrape Website"

	4. View the results:
	- Status: Shows success/error messages
	- Scraped Content: Website content converted to markdown
	- Sitemap: Organized list of all links found on the page

	### MCP Server

	1. Run the MCP server:

	```bash
	python mcp_server.py
	```

	2. The server will be available at `http://localhost:7862`

	3. MCP Endpoint: `http://localhost:7862/gradio_api/mcp/sse`

	#### Available MCP Tools

	- scrape_content: Extract and format website content as markdown
	- generate_sitemap: Generate a sitemap of all links found on a webpage
	- analyze_website: Complete website analysis with both content and sitemap

	#### MCP Client Configuration

	To use with Claude Desktop or other MCP clients, add this to your configuration:

	```json
	{
	"mcpServers": {
	"web-scraper": {
	"url": "http://localhost:7862/gradio_api/mcp/sse"
	}
	}
	}
	```

	## Dependencies

	- `gradio[mcp]`: Web interface framework with MCP support
	- `requests`: HTTP library for making web requests
	- `beautifulsoup4`: HTML parsing library
	- `markdownify`: HTML to markdown conversion
	- `lxml`: XML and HTML parser

	## Project Structure

	```
	web-scraper/
	├── app.py # Main web interface application
	├── mcp_server.py # MCP server with exposed tools
	├── requirements.txt # Python dependencies
	├── requirements.txt # Python dependencies
	├── README.md # Project documentation
	├── .github/
	│ └── copilot-instructions.md
	└── .vscode/
	└── tasks.json # VS Code tasks
	```

	## Features Details

	### Web Scraping

	- Handles both HTTP and HTTPS URLs
	- Automatically adds protocol if missing
	- Removes unwanted elements (scripts, styles, navigation)
	- Focuses on main content areas

	### Markdown Conversion

	- Converts HTML to clean markdown format
	- Preserves heading structure
	- Removes empty links and excessive whitespace
	- Adds page title as main heading

	### Sitemap Generation

	- Extracts all links from the page
	- Converts relative URLs to absolute URLs
	- Organizes links by domain (internal vs external)
	- Limits display to prevent overwhelming output
	- Filters out unwanted links (anchors, javascript, etc.)

	## Example URLs to Try

	- `https://httpbin.org/html` - Simple test page
	- `https://example.com` - Basic example site
	- `https://python.org` - Python official website

	## Error Handling

	The application includes comprehensive error handling for:

	- Invalid URLs
	- Network timeouts
	- HTTP errors
	- Content parsing issues

	## Customization

	You can customize the scraper by modifying:

	- User-Agent string in the `WebScraper` class
	- Content extraction selectors
	- Markdown formatting rules
	- Link filtering criteria