# π MCP Server Integration ## Table of Contents 1. [Overview](#overview) 2. [Available MCP Servers](#available-mcp-servers) 3. [Tool Registry & Discovery](#tool-registry--discovery) 4. [HTML Processing MCPs](#html-processing-mcps) 5. [Lazy Loading System](#lazy-loading-system) 6. [MCP Composition](#mcp-composition) 7. [Testing Panel](#testing-panel) 8. [Configuration](#configuration) --- ## Overview The **Model Context Protocol (MCP)** enables the WebScraper agent to interact with external tools, databases, and services through a standardized interface. MCP servers expose **tools** that the agent can discover and use dynamically. ### Why MCP? **Without MCP:** - Agent limited to built-in capabilities - Cannot access external databases, APIs, or specialized libraries - Difficult to extend without code changes **With MCP:** - β Dynamically discover and use 100+ community tools - β Access databases (PostgreSQL, MongoDB, etc.) - β Use specialized libraries (BeautifulSoup, Selenium, Playwright) - β Integrate with external APIs (Google, GitHub, etc.) - β Extend agent capabilities without code changes ### Architecture ``` βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β WebScraper Agent β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β MCP Tool Registry β β β β - Discovers available tools from all MCP servers β β β β - Provides tool metadata to agent β β β β - Routes tool calls to appropriate server β β β ββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ β β β β βββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ β βββββββββββββΌββββββββββββ¬βββββββββββββββ¬ββββββββββββββ β β β β β βΌ βΌ βΌ βΌ βΌ ββββββββββββββββ βββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β HTML Parser β βBrowser β β Database β β File β β Custom β β MCP β β MCP β β MCP β β System β β MCP β β β β β β β β MCP β β β ββ’ BeautifulSoupβββ’ Puppeteerβββ’ Postgresβββ’ Read βββ’ Your β ββ’ lxml βββ’ Playwrightβββ’ MongoDB ββββ’ Write ββ tools β ββ’ html5lib βββ’ Selenium βββ’ Redis ββββ’ Search ββ β ββββββββββββββββ βββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ``` --- ## Available MCP Servers ### 1. HTML Processing & Parsing #### **beautifulsoup-mcp** Advanced HTML parsing and extraction. **Tools:** - `parse_html(html: str, parser: str = "html.parser")` β Parse HTML into DOM tree - `find_all(html: str, selector: str)` β CSS selector search - `extract_text(html: str, selector: str)` β Extract text content - `extract_attributes(html: str, selector: str, attrs: List[str])` β Get element attributes - `clean_html(html: str)` β Remove scripts, styles, comments - `extract_tables(html: str)` β Parse all tables into structured data **Configuration:** ```json { "mcpServers": { "beautifulsoup": { "command": "python", "args": ["-m", "mcp_beautifulsoup"], "enabled": true, "autoDownload": true, "config": { "default_parser": "lxml", "encodings": ["utf-8", "latin-1"] } } } } ``` **Example Usage:** ```python # Agent action action = Action( action_type="MCP_TOOL_CALL", tool_name="beautifulsoup.find_all", tool_params={ "html": observation.page_html, "selector": "div.product-card" } ) # Response { "products": [ {"name": "Widget", "price": "$49.99"}, {"name": "Gadget", "price": "$39.99"} ] } ``` #### **lxml-mcp** Fast XML/HTML parsing with XPath support. **Tools:** - `xpath_query(html: str, xpath: str)` β XPath extraction - `css_select(html: str, css: str)` β CSS selector (fast) - `validate_html(html: str)` β Check well-formedness #### **html5lib-mcp** Standards-compliant HTML5 parsing. **Tools:** - `parse_html5(html: str)` β Parse like a browser would - `sanitize_html(html: str, allowed_tags: List[str])` β Safe HTML cleaning ### 2. Browser Automation #### **playwright-mcp** Full browser automation with JavaScript rendering. **Tools:** - `navigate(url: str, wait_for: str = "networkidle")` β Load page with JS - `click(selector: str)` β Click element - `fill_form(selector: str, value: str)` β Fill input - `screenshot(selector: str = None)` β Capture screenshot - `wait_for_selector(selector: str, timeout: int = 5000)` β Wait for element - `execute_script(script: str)` β Run custom JavaScript **Use Cases:** - Pages with client-side rendering (React, Vue, Angular) - Infinite scroll / lazy loading - Forms and interactions - Captcha handling **Configuration:** ```json { "mcpServers": { "playwright": { "command": "npx", "args": ["@playwright/mcp-server"], "enabled": false, // Only enable when needed (heavy) "autoDownload": true, "config": { "browser": "chromium", "headless": true, "viewport": {"width": 1920, "height": 1080} } } } } ``` #### **puppeteer-mcp** Lightweight browser automation (Chrome DevTools Protocol). Similar to Playwright but lighter weight. #### **selenium-mcp** Legacy browser automation (more compatible, slower). ### 3. Database Access #### **postgresql-mcp** Access PostgreSQL databases. **Tools:** - `query(sql: str, params: List = [])` β Execute SELECT - `execute(sql: str, params: List = [])` β Execute INSERT/UPDATE/DELETE - `list_tables()` β Get schema **Use Case:** Store scraped data directly to production database. #### **mongodb-mcp** Access MongoDB collections. **Tools:** - `find(collection: str, query: dict)` β Query documents - `insert(collection: str, document: dict)` β Insert document - `aggregate(collection: str, pipeline: List)` β Aggregation pipeline #### **redis-mcp** Fast cache and pub/sub. **Tools:** - `get(key: str)` β Retrieve cached value - `set(key: str, value: str, ttl: int)` β Cache value - `publish(channel: str, message: str)` β Pub/sub **Use Case:** Cache parsed HTML, share state between agents. ### 4. File System #### **filesystem-mcp** Read/write local files. **Tools:** - `read_file(path: str)` β Read text/binary file - `write_file(path: str, content: str)` β Write file - `list_directory(path: str)` β List files - `search_files(pattern: str)` β Glob search **Use Case:** Save scraped data to CSV/JSON, read configuration files. ### 5. Search Engines #### **google-search-mcp** Google Search API integration. **Tools:** - `search(query: str, num: int = 10)` β Google Search results - `search_images(query: str)` β Image search **Configuration:** ```json { "mcpServers": { "google-search": { "command": "python", "args": ["-m", "mcp_google_search"], "enabled": true, "autoDownload": true, "config": { "api_key": "YOUR_GOOGLE_API_KEY", "search_engine_id": "YOUR_SEARCH_ENGINE_ID" } } } } ``` #### **bing-search-mcp** Bing Search API. #### **brave-search-mcp** Privacy-focused search (Brave Search API). #### **duckduckgo-mcp** Free, no-API search. **Tools:** - `search(query: str, max_results: int = 10)` β DDG results ### 6. Data Extraction #### **readability-mcp** Extract main article content (removes ads, navigation, etc.). **Tools:** - `extract_article(html: str)` β Returns clean article text + metadata **Use Case:** Extract blog posts, news articles, documentation. #### **trafilatura-mcp** Advanced web scraping and text extraction. **Tools:** - `extract(url: str)` β Extract main content - `extract_metadata(html: str)` β Get title, author, date, etc. #### **newspaper-mcp** News article extraction and NLP. **Tools:** - `parse_article(url: str)` β Full article data - `extract_keywords(text: str)` β Keyword extraction - `summarize(text: str)` β Auto-summarization ### 7. Data Validation #### **cerberus-mcp** Schema validation for extracted data. **Tools:** - `validate(data: dict, schema: dict)` β Validate against schema **Example:** ```python # Define schema schema = { "product_name": {"type": "string", "required": True, "minlength": 1}, "price": {"type": "float", "required": True, "min": 0}, "rating": {"type": "float", "min": 0, "max": 5} } # Validate extracted data result = mcp.call("cerberus.validate", data=extracted_data, schema=schema) if not result["valid"]: print("Validation errors:", result["errors"]) ``` #### **pydantic-mcp** Pydantic model validation. ### 8. Computer Vision #### **ocr-mcp** Extract text from images (Tesseract OCR). **Tools:** - `extract_text(image_path: str, lang: str = "eng")` β OCR text **Use Case:** Extract prices from product images, read captchas (if legal). #### **image-analysis-mcp** Vision AI (GPT-4 Vision, Claude Vision). **Tools:** - `describe_image(image_path: str)` β Natural language description - `extract_structured(image_path: str, schema: dict)` β Extract structured data from images ### 9. HTTP & Networking #### **requests-mcp** HTTP client with retry, session management. **Tools:** - `get(url: str, headers: dict = {})` β HTTP GET - `post(url: str, data: dict = {})` β HTTP POST #### **proxy-manager-mcp** Manage proxy rotation, IP reputation. **Tools:** - `get_proxy()` β Get next proxy from pool - `report_dead_proxy(proxy: str)` β Mark proxy as failed ### 10. Utility #### **regex-mcp** Advanced regex operations. **Tools:** - `find_all(pattern: str, text: str)` β Find all matches - `replace(pattern: str, replacement: str, text: str)` β Regex replace - `validate(pattern: str)` β Check if regex is valid #### **datetime-mcp** Parse and normalize dates. **Tools:** - `parse_date(text: str)` β Parse natural language dates - `normalize_timezone(date: str, tz: str)` β Convert timezone #### **currency-mcp** Currency parsing and conversion. **Tools:** - `parse_price(text: str)` β Extract price and currency - `convert(amount: float, from_currency: str, to_currency: str)` β Convert --- ## Tool Registry & Discovery The **Tool Registry** automatically discovers all available tools from enabled MCP servers. ### Architecture ```python class MCPToolRegistry: def __init__(self): self.servers: Dict[str, MCPServer] = {} self.tools: Dict[str, Tool] = {} # tool_name β Tool def discover_servers(self, config: MCPConfig): """Load and connect to all enabled MCP servers.""" for server_name, server_config in config.mcpServers.items(): if not server_config.enabled: continue # Auto-download if needed if server_config.autoDownload and not self.is_installed(server_config): self.download_and_install(server_name, server_config) # Connect to server server = self.connect_server(server_name, server_config) self.servers[server_name] = server # Discover tools for tool in server.list_tools(): full_name = f"{server_name}.{tool.name}" self.tools[full_name] = tool def get_tool(self, tool_name: str) -> Tool: """Get tool by fully qualified name (server.tool).""" return self.tools.get(tool_name) def search_tools(self, query: str, category: str = None) -> List[Tool]: """Search tools by natural language query.""" # Semantic search using tool descriptions candidates = list(self.tools.values()) if category: candidates = [t for t in candidates if t.category == category] # Embed query and tools, rank by similarity scored = [] for tool in candidates: score = self.semantic_similarity(query, tool.description) scored.append((tool, score)) scored.sort(key=lambda x: x[1], reverse=True) return [tool for tool, score in scored[:10]] ``` ### Tool Metadata Each tool exposes rich metadata: ```python class Tool(BaseModel): name: str # e.g., "find_all" full_name: str # e.g., "beautifulsoup.find_all" server: str # Server name description: str # Human-readable description category: str # "parsing" | "browser" | "database" | ... input_schema: Dict[str, Any] # JSON Schema for parameters output_schema: Dict[str, Any] # JSON Schema for return value examples: List[ToolExample] # Usage examples cost: ToolCost # Time/resource cost estimate requires_auth: bool # Needs API keys? rate_limit: Optional[RateLimit] # Rate limiting info ``` **Example:** ```python Tool( name="find_all", full_name="beautifulsoup.find_all", server="beautifulsoup", description="Find all HTML elements matching a CSS selector", category="parsing", input_schema={ "type": "object", "properties": { "html": {"type": "string", "description": "HTML content to search"}, "selector": {"type": "string", "description": "CSS selector"} }, "required": ["html", "selector"] }, output_schema={ "type": "array", "items": {"type": "object"} }, examples=[ ToolExample( input={"html": "