# mcp-server-integration ## table-of-contents 1. [Overview](#overview) 2. [Available MCP Servers](#available-mcp-servers) 3. [Tool Registry & Discovery](#tool-registry--discovery) 4. [HTML Processing MCPs](#html-processing-mcps) 5. [Lazy Loading System](#lazy-loading-system) 6. [MCP Composition](#mcp-composition) 7. [Testing Panel](#testing-panel) 8. [Configuration](#configuration) --- ## overview The **Model Context Protocol (MCP)** enables the WebScraper agent to interact with external tools, databases, and services through a standardized interface. MCP servers expose **tools** that the agent can discover and use dynamically. ### why-mcp **Without MCP:** - Agent limited to built-in capabilities - Cannot access external databases, APIs, or specialized libraries - Difficult to extend without code changes **With MCP:** - Dynamically discover and use 100+ community tools - Access databases (PostgreSQL, MongoDB, etc.) - Use specialized libraries (BeautifulSoup, Selenium, Playwright) - Integrate with external APIs (Google, GitHub, etc.) - Extend agent capabilities without code changes ### architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ WebScraper Agent │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ MCP Tool Registry │ │ │ │ - Discovers available tools from all MCP servers │ │ │ │ - Provides tool metadata to agent │ │ │ │ - Routes tool calls to appropriate server │ │ │ └────────────────┬───────────────────────────────────┘ │ │ │ │ └───────────────────┼──────────────────────────────────────────┘ │ ┌───────────┼───────────┬──────────────┬─────────────┐ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌──────────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ HTML Parser │ │Browser │ │ Database │ │ File │ │ Custom │ │ MCP │ │ MCP │ │ MCP │ │ System │ │ MCP │ │ │ │ │ │ │ │ MCP │ │ │ │• BeautifulSoup││• Puppeteer││• Postgres││• Read ││• Your │ │• lxml ││• Playwright││• MongoDB │││• Write ││ tools │ │• html5lib ││• Selenium ││• Redis │││• Search ││ │ └──────────────┘ └─────────┘ └──────────┘ └──────────┘ └──────────┘ ``` --- ## available-mcp-servers ### 1-html-processing-and-parsing #### beautifulsoup-mcp Advanced HTML parsing and extraction. **Tools:** - `parse_html(html: str, parser: str = "html.parser")` → Parse HTML into DOM tree - `find_all(html: str, selector: str)` → CSS selector search - `extract_text(html: str, selector: str)` → Extract text content - `extract_attributes(html: str, selector: str, attrs: List[str])` → Get element attributes - `clean_html(html: str)` → Remove scripts, styles, comments - `extract_tables(html: str)` → Parse all tables into structured data **Configuration:** ```json { "mcpServers": { "beautifulsoup": { "command": "python", "args": ["-m", "mcp_beautifulsoup"], "enabled": true, "autoDownload": true, "config": { "default_parser": "lxml", "encodings": ["utf-8", "latin-1"] } } } } ``` **Example Usage:** ```python # Agent action action = Action( action_type="MCP_TOOL_CALL", tool_name="beautifulsoup.find_all", tool_params={ "html": observation.page_html, "selector": "div.product-card" } ) # Response { "products": [ {"name": "Widget", "price": "$49.99"}, {"name": "Gadget", "price": "$39.99"} ] } ``` #### lxml-mcp Fast XML/HTML parsing with XPath support. **Tools:** - `xpath_query(html: str, xpath: str)` → XPath extraction - `css_select(html: str, css: str)` → CSS selector (fast) - `validate_html(html: str)` → Check well-formedness #### html5lib-mcp Standards-compliant HTML5 parsing. **Tools:** - `parse_html5(html: str)` → Parse like a browser would - `sanitize_html(html: str, allowed_tags: List[str])` → Safe HTML cleaning ### 2-browser-automation #### playwright-mcp Full browser automation with JavaScript rendering. **Tools:** - `navigate(url: str, wait_for: str = "networkidle")` → Load page with JS - `click(selector: str)` → Click element - `fill_form(selector: str, value: str)` → Fill input - `screenshot(selector: str = None)` → Capture screenshot - `wait_for_selector(selector: str, timeout: int = 5000)` → Wait for element - `execute_script(script: str)` → Run custom JavaScript **Use Cases:** - Pages with client-side rendering (React, Vue, Angular) - Infinite scroll / lazy loading - Forms and interactions - Captcha handling **Configuration:** ```json { "mcpServers": { "playwright": { "command": "npx", "args": ["@playwright/mcp-server"], "enabled": false, // Only enable when needed (heavy) "autoDownload": true, "config": { "browser": "chromium", "headless": true, "viewport": {"width": 1920, "height": 1080} } } } } ``` #### puppeteer-mcp Lightweight browser automation (Chrome DevTools Protocol). Similar to Playwright but lighter weight. #### selenium-mcp Legacy browser automation (more compatible, slower). ### 3-database-access #### postgresql-mcp Access PostgreSQL databases. **Tools:** - `query(sql: str, params: List = [])` → Execute SELECT - `execute(sql: str, params: List = [])` → Execute INSERT/UPDATE/DELETE - `list_tables()` → Get schema **Use Case:** Store scraped data directly to production database. #### mongodb-mcp Access MongoDB collections. **Tools:** - `find(collection: str, query: dict)` → Query documents - `insert(collection: str, document: dict)` → Insert document - `aggregate(collection: str, pipeline: List)` → Aggregation pipeline #### redis-mcp Fast cache and pub/sub. **Tools:** - `get(key: str)` → Retrieve cached value - `set(key: str, value: str, ttl: int)` → Cache value - `publish(channel: str, message: str)` → Pub/sub **Use Case:** Cache parsed HTML, share state between agents. ### 4-file-system #### filesystem-mcp Read/write local files. **Tools:** - `read_file(path: str)` → Read text/binary file - `write_file(path: str, content: str)` → Write file - `list_directory(path: str)` → List files - `search_files(pattern: str)` → Glob search **Use Case:** Save scraped data to CSV/JSON, read configuration files. ### 5-search-engines #### google-search-mcp Google Search API integration. **Tools:** - `search(query: str, num: int = 10)` → Google Search results - `search_images(query: str)` → Image search **Configuration:** ```json { "mcpServers": { "google-search": { "command": "python", "args": ["-m", "mcp_google_search"], "enabled": true, "autoDownload": true, "config": { "api_key": "YOUR_GOOGLE_API_KEY", "search_engine_id": "YOUR_SEARCH_ENGINE_ID" } } } } ``` #### bing-search-mcp Bing Search API. #### brave-search-mcp Privacy-focused search (Brave Search API). #### duckduckgo-mcp Free, no-API search. **Tools:** - `search(query: str, max_results: int = 10)` → DDG results ### 6-data-extraction #### readability-mcp Extract main article content (removes ads, navigation, etc.). **Tools:** - `extract_article(html: str)` → Returns clean article text + metadata **Use Case:** Extract blog posts, news articles, documentation. #### trafilatura-mcp Advanced web scraping and text extraction. **Tools:** - `extract(url: str)` → Extract main content - `extract_metadata(html: str)` → Get title, author, date, etc. #### newspaper-mcp News article extraction and NLP. **Tools:** - `parse_article(url: str)` → Full article data - `extract_keywords(text: str)` → Keyword extraction - `summarize(text: str)` → Auto-summarization ### 7-data-validation #### cerberus-mcp Schema validation for extracted data. **Tools:** - `validate(data: dict, schema: dict)` → Validate against schema **Example:** ```python # Define schema schema = { "product_name": {"type": "string", "required": True, "minlength": 1}, "price": {"type": "float", "required": True, "min": 0}, "rating": {"type": "float", "min": 0, "max": 5} } # Validate extracted data result = mcp.call("cerberus.validate", data=extracted_data, schema=schema) if not result["valid"]: print("Validation errors:", result["errors"]) ``` #### pydantic-mcp Pydantic model validation. ### 8-computer-vision #### ocr-mcp Extract text from images (Tesseract OCR). **Tools:** - `extract_text(image_path: str, lang: str = "eng")` → OCR text **Use Case:** Extract prices from product images, read captchas (if legal). #### image-analysis-mcp Vision AI (GPT-4 Vision, Claude Vision). **Tools:** - `describe_image(image_path: str)` → Natural language description - `extract_structured(image_path: str, schema: dict)` → Extract structured data from images ### 9-http-and-networking #### requests-mcp HTTP client with retry, session management. **Tools:** - `get(url: str, headers: dict = {})` → HTTP GET - `post(url: str, data: dict = {})` → HTTP POST #### proxy-manager-mcp Manage proxy rotation, IP reputation. **Tools:** - `get_proxy()` → Get next proxy from pool - `report_dead_proxy(proxy: str)` → Mark proxy as failed ### 10-utility #### regex-mcp Advanced regex operations. **Tools:** - `find_all(pattern: str, text: str)` → Find all matches - `replace(pattern: str, replacement: str, text: str)` → Regex replace - `validate(pattern: str)` → Check if regex is valid #### datetime-mcp Parse and normalize dates. **Tools:** - `parse_date(text: str)` → Parse natural language dates - `normalize_timezone(date: str, tz: str)` → Convert timezone #### currency-mcp Currency parsing and conversion. **Tools:** - `parse_price(text: str)` → Extract price and currency - `convert(amount: float, from_currency: str, to_currency: str)` → Convert --- ## tool-registry-and-discovery The **Tool Registry** automatically discovers all available tools from enabled MCP servers. ### architecture ```python class MCPToolRegistry: def __init__(self): self.servers: Dict[str, MCPServer] = {} self.tools: Dict[str, Tool] = {} # tool_name → Tool def discover_servers(self, config: MCPConfig): """Load and connect to all enabled MCP servers.""" for server_name, server_config in config.mcpServers.items(): if not server_config.enabled: continue # Auto-download if needed if server_config.autoDownload and not self.is_installed(server_config): self.download_and_install(server_name, server_config) # Connect to server server = self.connect_server(server_name, server_config) self.servers[server_name] = server # Discover tools for tool in server.list_tools(): full_name = f"{server_name}.{tool.name}" self.tools[full_name] = tool def get_tool(self, tool_name: str) -> Tool: """Get tool by fully qualified name (server.tool).""" return self.tools.get(tool_name) def search_tools(self, query: str, category: str = None) -> List[Tool]: """Search tools by natural language query.""" # Semantic search using tool descriptions candidates = list(self.tools.values()) if category: candidates = [t for t in candidates if t.category == category] # Embed query and tools, rank by similarity scored = [] for tool in candidates: score = self.semantic_similarity(query, tool.description) scored.append((tool, score)) scored.sort(key=lambda x: x[1], reverse=True) return [tool for tool, score in scored[:10]] ``` ### tool-metadata Each tool exposes rich metadata: ```python class Tool(BaseModel): name: str # e.g., "find_all" full_name: str # e.g., "beautifulsoup.find_all" server: str # Server name description: str # Human-readable description category: str # "parsing" | "browser" | "database" | ... input_schema: Dict[str, Any] # JSON Schema for parameters output_schema: Dict[str, Any] # JSON Schema for return value examples: List[ToolExample] # Usage examples cost: ToolCost # Time/resource cost estimate requires_auth: bool # Needs API keys? rate_limit: Optional[RateLimit] # Rate limiting info ``` **Example:** ```python Tool( name="find_all", full_name="beautifulsoup.find_all", server="beautifulsoup", description="Find all HTML elements matching a CSS selector", category="parsing", input_schema={ "type": "object", "properties": { "html": {"type": "string", "description": "HTML content to search"}, "selector": {"type": "string", "description": "CSS selector"} }, "required": ["html", "selector"] }, output_schema={ "type": "array", "items": {"type": "object"} }, examples=[ ToolExample( input={"html": "
A
", "selector": ".item"}, output=[{"tag": "div", "text": "A", "class": "item"}] ) ], cost=ToolCost(time_ms=10, cpu_intensive=False), requires_auth=False ) ``` ### auto-tool-discovery-by-agent The agent can query the registry to find relevant tools: ```python # Agent needs to parse HTML available_tools = tool_registry.search_tools( query="parse HTML and extract elements by CSS selector", category="parsing" ) # Top result: beautifulsoup.find_all tool = available_tools[0] # Agent calls the tool action = Action( action_type="MCP_TOOL_CALL", tool_name=tool.full_name, tool_params={ "html": observation.page_html, "selector": "div.product-price" } ) ``` --- ## html-processing-mcps ### beautifulsoup-mcp-detailed **Installation:** ```bash pip install mcp-beautifulsoup ``` **Tools:** #### 1-find-all-html-selector-limit-none Find all elements matching CSS selector. ```python result = mcp.call("beautifulsoup.find_all", { "html": "
$10
$20
", "selector": "div.price" }) # Returns: [{"text": "$10"}, {"text": "$20"}] ``` #### 2-find-one-html-selector Find first matching element. ```python result = mcp.call("beautifulsoup.find_one", { "html": obs.page_html, "selector": "h1.product-title" }) # Returns: {"text": "Widget Pro", "tag": "h1"} ``` #### 3-extract-tables-html Parse all `` elements into structured data. ```python result = mcp.call("beautifulsoup.extract_tables", {"html": obs.page_html}) # Returns: [ { "headers": ["Product", "Price", "Stock"], "rows": [ ["Widget", "$49.99", "In Stock"], ["Gadget", "$39.99", "Out of Stock"] ] } ] ``` #### 4-extract-links-html-base-url-none Extract all links from page. ```python result = mcp.call("beautifulsoup.extract_links", { "html": obs.page_html, "base_url": "https://example.com" }) # Returns: [ {"url": "https://example.com/product/123", "text": "View Product"}, {"url": "https://example.com/category/widgets", "text": "Widgets"} ] ``` #### 5-clean-html-html-remove-script-style-noscript Remove unwanted elements. ```python result = mcp.call("beautifulsoup.clean_html", { "html": obs.page_html, "remove": ["script", "style", "footer", "nav"] }) # Returns: Clean HTML without ads, scripts, navigation ``` #### 6-smart-extract-html-field-name Intelligent extraction based on field name. ```python # Agent wants to extract "price" result = mcp.call("beautifulsoup.smart_extract", { "html": obs.page_html, "field_name": "price" }) # MCP searches for: # - Elements with class/id containing "price" # - Text matching price patterns ($X.XX, €X,XX) # - Schema.org markup (itemprop="price") # Returns: {"value": "$49.99", "confidence": 0.92, "selector": "span.product-price"} ``` ### batch-processing-for-long-content When HTML is too large (> 100KB), process in batches: ```python class HTMLBatchProcessor: def __init__(self, mcp_client, chunk_size: int = 50000): self.mcp = mcp_client self.chunk_size = chunk_size def process_large_html(self, html: str, selector: str) -> List[Dict]: """Process large HTML in chunks.""" # Split HTML into meaningful chunks (by sections, not mid-tag) chunks = self.split_html_intelligently(html) results = [] for i, chunk in enumerate(chunks): # Process each chunk chunk_results = self.mcp.call("beautifulsoup.find_all", { "html": chunk, "selector": selector }) # Deduplicate across chunk boundaries results.extend(self.deduplicate(chunk_results, results)) return results def split_html_intelligently(self, html: str) -> List[str]: """Split HTML at section boundaries, not mid-tag.""" soup = BeautifulSoup(html, 'lxml') # Split by major sections (article, section, div.container, etc.) sections = soup.find_all(['article', 'section', 'main']) chunks = [] current_chunk = "" for section in sections: section_html = str(section) if len(current_chunk) + len(section_html) > self.chunk_size: chunks.append(current_chunk) current_chunk = section_html else: current_chunk += section_html if current_chunk: chunks.append(current_chunk) return chunks ``` --- ## lazy-loading-system MCP servers are **NOT downloaded by default**. They are installed on-demand when first used. ### download-on-demand-flow ``` Agent wants to use a tool │ ▼ Is MCP server installed? │ ┌────┴────┐ No Yes │ │ ▼ ▼ Show dialog Execute tool "Download server X?" │ ┌───┴───┐ No Yes │ │ Skip Download & Install │ ▼ Cache for future use │ ▼ Execute tool ``` ### implementation ```python class LazyMCPLoader: def __init__(self): self.installed_servers: Set[str] = set() self.download_queue: Queue[str] = Queue() def ensure_server(self, server_name: str, config: MCPServerConfig) -> bool: """Ensure MCP server is installed, download if needed.""" if server_name in self.installed_servers: return True if not config.autoDownload: # Prompt user if not self.prompt_user_download(server_name): return False # Download and install return self.download_server(server_name, config) def download_server(self, server_name: str, config: MCPServerConfig) -> bool: """Download and install MCP server.""" try: logger.info(f"Downloading MCP server: {server_name}") if config.command == "npx": # NPM package subprocess.run([ "npm", "install", "-g", config.args[1] ], check=True) elif config.command == "python": # Python package package_name = config.args[1].replace("-m ", "") subprocess.run([ "pip", "install", package_name ], check=True) self.installed_servers.add(server_name) logger.info(f" Installed {server_name}") return True except Exception as e: logger.error(f"Failed to install {server_name}: {e}") return False def prompt_user_download(self, server_name: str) -> bool: """Ask user if they want to download the server.""" # In UI, show dialog: # "Tool X requires MCP server Y. Download and install? (50MB) [Yes] [No]" return self.show_download_dialog(server_name) ``` ### ui-dialog ``` ┌──────────────────────────────────────────────────────────┐ │ MCP Server Required │ ├──────────────────────────────────────────────────────────┤ │ │ │ The tool "beautifulsoup.find_all" requires the MCP │ │ server "beautifulsoup" which is not installed. │ │ │ │ Package: mcp-beautifulsoup │ │ Size: ~5 MB │ │ │ │ Would you like to download and install it now? │ │ │ │ [Download & Install] [Skip] │ │ │ │ Remember my choice for this server │ └──────────────────────────────────────────────────────────┘ ``` --- ## mcp-composition Combine multiple MCP tools to create powerful workflows. ### example-1-parse-html-extract-tables-save-to-database ```python # Step 1: Clean HTML cleaned = mcp.call("beautifulsoup.clean_html", { "html": observation.page_html }) # Step 2: Extract tables tables = mcp.call("beautifulsoup.extract_tables", { "html": cleaned["html"] }) # Step 3: Save to PostgreSQL for table in tables: mcp.call("postgresql.execute", { "sql": "INSERT INTO scraped_data (data) VALUES (%s)", "params": [json.dumps(table)] }) ``` ### example-2-search-google-navigate-parse-article-summarize ```python # Step 1: Search results = mcp.call("google-search.search", { "query": "best widgets 2026", "num": 5 }) # Step 2: Navigate to top result mcp.call("playwright.navigate", { "url": results[0]["url"] }) # Step 3: Extract article article = mcp.call("readability.extract_article", { "html": mcp.call("playwright.get_html", {}) }) # Step 4: Summarize summary = mcp.call("llm.summarize", { "text": article["text"], "max_length": 200 }) ``` ### composition-dsl Define reusable workflows: ```python class MCPWorkflow: def __init__(self, name: str, steps: List[WorkflowStep]): self.name = name self.steps = steps async def execute(self, initial_input: Dict) -> Dict: """Execute workflow steps sequentially.""" context = initial_input for step in self.steps: result = await mcp.call(step.tool, step.params(context)) context[step.output_var] = result return context # Define workflow extract_and_save = MCPWorkflow( name="extract_and_save", steps=[ WorkflowStep( tool="beautifulsoup.find_all", params=lambda ctx: {"html": ctx["html"], "selector": ctx["selector"]}, output_var="extracted" ), WorkflowStep( tool="cerberus.validate", params=lambda ctx: {"data": ctx["extracted"], "schema": ctx["schema"]}, output_var="validated" ), WorkflowStep( tool="postgresql.execute", params=lambda ctx: {"sql": "INSERT INTO items ...", "params": ctx["validated"]}, output_var="saved" ) ] ) # Execute result = await extract_and_save.execute({ "html": obs.page_html, "selector": "div.product", "schema": PRODUCT_SCHEMA }) ``` --- ## testing-panel Test MCP tools manually before using them in agent workflows. ### ui ``` ┌─────────────────────────────────────────────────────────────┐ │ MCP Testing Panel │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Server: [beautifulsoup ▼] │ │ Tool: [find_all ▼] │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Input Parameters: │ │ │ │ │ │ │ │ html: │ │ │ │ ┌───────────────────────────────────────────────┐ │ │ │ │ │
Item 1
│ │ │ │ │ │
Item 2
│ │ │ │ │ └───────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ selector: [div.item ] │ │ │ │ │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ [Execute Tool] [Clear] │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Output: │ │ │ │ │ │ │ │ [ │ │ │ │ {"tag": "div", "class": "item", "text": "Item 1"}, │ │ │ │ {"tag": "div", "class": "item", "text": "Item 2"} │ │ │ │ ] │ │ │ │ │ │ │ │ Execution time: 12ms │ │ │ │ Status: Success │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ [Save as Example] │ └─────────────────────────────────────────────────────────────┘ ``` --- ## configuration ### full-mcp-configuration-example ```json { "mcpServers": { "beautifulsoup": { "command": "python", "args": ["-m", "mcp_beautifulsoup"], "enabled": true, "autoDownload": true, "config": { "default_parser": "lxml" } }, "playwright": { "command": "npx", "args": ["@playwright/mcp-server"], "enabled": false, "autoDownload": false, "config": { "browser": "chromium", "headless": true } }, "postgresql": { "command": "python", "args": ["-m", "mcp_postgresql"], "enabled": false, "autoDownload": false, "config": { "host": "localhost", "port": 5432, "database": "scraper_db", "user": "postgres", "password": "${PG_PASSWORD}" } }, "google-search": { "command": "python", "args": ["-m", "mcp_google_search"], "enabled": true, "autoDownload": true, "config": { "api_key": "${GOOGLE_API_KEY}", "search_engine_id": "${GOOGLE_SE_ID}" } }, "filesystem": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-filesystem", "./scraped_data"], "enabled": true, "autoDownload": true } }, "mcpSettings": { "autoDiscoverTools": true, "toolTimeout": 30, "maxConcurrentCalls": 5, "retryFailedCalls": true, "cacheToolResults": true, "cacheTTL": 3600 } } ``` --- **Next:** See [settings.md](./settings.md) for complete dashboard settings. ## related-api-reference | item | value | | --- | --- | | api-reference | `api-reference.md` | ## document-metadata | key | value | | --- | --- | | document | `mcp.md` | | status | active | ## document-flow ```mermaid flowchart TD A[document] --> B[key-sections] B --> C[implementation] B --> D[operations] B --> E[validation] ```