Spaces:
Sleeping
Sleeping
| title: Web Scraper | |
| emoji: π | |
| colorFrom: yellow | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 5.32.1 | |
| app_file: app.py | |
| pinned: false | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
| # Web Scraper & Sitemap Generator | |
| A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration. | |
| ## Features | |
| - π·οΈ **Web Scraping**: Extract text content from any website | |
| - π **Markdown Conversion**: Convert scraped HTML content to clean markdown format | |
| - πΊοΈ **Sitemap Generation**: Create organized sitemaps based on all links found on the page | |
| - π **User-Friendly Interface**: Easy-to-use Gradio web interface | |
| - π **Link Organization**: Separate internal and external links for better navigation | |
| - π€ **MCP Server**: Expose scraping tools for AI assistants and LLMs | |
| ## Installation | |
| 1. Install Python dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ### Web Interface | |
| 1. Run the web application: | |
| ```bash | |
| python app.py | |
| ``` | |
| 2. Open your browser and navigate to `http://localhost:7861` | |
| 3. Enter a URL in the input field and click "Scrape Website" | |
| 4. View the results: | |
| - **Status**: Shows success/error messages | |
| - **Scraped Content**: Website content converted to markdown | |
| - **Sitemap**: Organized list of all links found on the page | |
| ### MCP Server | |
| 1. Run the MCP server: | |
| ```bash | |
| python mcp_server.py | |
| ``` | |
| 2. The server will be available at `http://localhost:7862` | |
| 3. **MCP Endpoint**: `http://localhost:7862/gradio_api/mcp/sse` | |
| #### Available MCP Tools | |
| - **scrape_content**: Extract and format website content as markdown | |
| - **generate_sitemap**: Generate a sitemap of all links found on a webpage | |
| - **analyze_website**: Complete website analysis with both content and sitemap | |
| #### MCP Client Configuration | |
| To use with Claude Desktop or other MCP clients, add this to your configuration: | |
| ```json | |
| { | |
| "mcpServers": { | |
| "web-scraper": { | |
| "url": "http://localhost:7862/gradio_api/mcp/sse" | |
| } | |
| } | |
| } | |
| ``` | |
| ## Dependencies | |
| - `gradio[mcp]`: Web interface framework with MCP support | |
| - `requests`: HTTP library for making web requests | |
| - `beautifulsoup4`: HTML parsing library | |
| - `markdownify`: HTML to markdown conversion | |
| - `lxml`: XML and HTML parser | |
| ## Project Structure | |
| ``` | |
| web-scraper/ | |
| βββ app.py # Main web interface application | |
| βββ mcp_server.py # MCP server with exposed tools | |
| βββ requirements.txt # Python dependencies | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # Project documentation | |
| βββ .github/ | |
| β βββ copilot-instructions.md | |
| βββ .vscode/ | |
| βββ tasks.json # VS Code tasks | |
| ``` | |
| ## Features Details | |
| ### Web Scraping | |
| - Handles both HTTP and HTTPS URLs | |
| - Automatically adds protocol if missing | |
| - Removes unwanted elements (scripts, styles, navigation) | |
| - Focuses on main content areas | |
| ### Markdown Conversion | |
| - Converts HTML to clean markdown format | |
| - Preserves heading structure | |
| - Removes empty links and excessive whitespace | |
| - Adds page title as main heading | |
| ### Sitemap Generation | |
| - Extracts all links from the page | |
| - Converts relative URLs to absolute URLs | |
| - Organizes links by domain (internal vs external) | |
| - Limits display to prevent overwhelming output | |
| - Filters out unwanted links (anchors, javascript, etc.) | |
| ## Example URLs to Try | |
| - `https://httpbin.org/html` - Simple test page | |
| - `https://example.com` - Basic example site | |
| - `https://python.org` - Python official website | |
| ## Error Handling | |
| The application includes comprehensive error handling for: | |
| - Invalid URLs | |
| - Network timeouts | |
| - HTTP errors | |
| - Content parsing issues | |
| ## Customization | |
| You can customize the scraper by modifying: | |
| - User-Agent string in the `WebScraper` class | |
| - Content extraction selectors | |
| - Markdown formatting rules | |
| - Link filtering criteria | |