Spaces:

benkassmi
/

craw4ai-mcp

Sleeping

App Files Files Community

benkassmi commited on Jan 13

Commit

bd67676

verified ·

1 Parent(s): 707cca1

Upload 2 files

Browse files

Files changed (2) hide show

README.md +115 -15
app.py +303 -103

README.md CHANGED Viewed

@@ -11,22 +11,122 @@ app_port: 7860
 # Crawl4AI MCP Server
-This is a Crawl4AI MCP (Model Context Protocol) server deployed on Hugging Face Spaces.
-## Features
-- Web scraping with Playwright
-- Markdown extraction
-- JavaScript execution
-- Batch URL processing
-- Screenshot capture
-- PDF generation
-## API Endpoints
-- `GET /` - Health check
-- `GET /health` - Detailed health status
-- `POST /md` - Extract markdown from URL
-- `POST /crawl` - Batch process multiple URLs
-- `POST /execute_js` - Execute JavaScript on page
-- `GET /mcp/sse` - MCP Server-Sent Events endpoint

 # Crawl4AI MCP Server
+Serveur MCP (Model Context Protocol) pour le web scraping avec Crawl4AI, compatible avec **Microsoft Copilot Studio**.
+## ⚠️ Important - Transport Streamable HTTP
+Ce serveur utilise le transport **Streamable HTTP** (et non SSE qui est déprécié).
+- **Endpoint MCP** : `POST /mcp`
+- **Protocole** : MCP Streamable HTTP 1.0
+## 🛠️ Outils disponibles
+| Outil | Description |
+|-------|-------------|
+| `md` | Extraire le contenu markdown d'une page web |
+| `html` | Extraire le HTML brut d'une page web |
+| `crawl` | Crawler plusieurs URLs en batch |
+| `execute_js` | Exécuter du JavaScript sur une page |
+## 🔗 Endpoints API
+| Endpoint | Méthode | Description |
+|----------|---------|-------------|
+| `/mcp` | POST | **Endpoint MCP principal** (Streamable HTTP) |
+| `/mcp` | GET | Retourne une erreur 405 (attendu par Copilot Studio) |
+| `/` | GET | Health check et info serveur |
+| `/health` | GET | Statut détaillé |
+| `/debug/tools` | GET | Liste des outils (debug) |
+## 🚀 Configuration avec Microsoft Copilot Studio
+### Étape 1 : Vérifier que le serveur fonctionne
+Accédez à `https://YOUR_SPACE.hf.space/mcp` - vous devriez voir :
+```json
+{"jsonrpc":"2.0","error":{"code":-32000,"message":"Method not allowed."},"id":null}
+```
+C'est normal ! Cela confirme que le serveur est correctement configuré.
+### Étape 2 : Créer un Custom Connector
+1. Allez sur [Power Apps Custom Connectors](https://make.preview.powerapps.com/customconnectors)
+2. Cliquez **+ New custom connector** → **Import from GitHub**
+3. Sélectionnez :
+   - **Connector Type** : `Custom`
+   - **Branch** : `dev`
+   - **Connector** : `MCP-Streamable-HTTP`
+4. Cliquez **Continue**
+5. Modifiez :
+   - **Connector Name** : `Crawl4AI MCP`
+   - **Host** : `YOUR_SPACE.hf.space` (sans https://)
+6. Cliquez **Create connector**
+### Étape 3 : Ajouter à votre Agent
+1. Allez dans [Copilot Studio](https://copilotstudio.preview.microsoft.com/)
+2. Sélectionnez votre agent
+3. Activez **Generative Orchestration**
+4. Allez dans **Tools** → **Add a tool**
+5. Filtrez par **Model Context Protocol**
+6. Sélectionnez **Crawl4AI MCP**
+7. Créez une nouvelle connexion
+8. Cliquez **Add to agent**
+### Étape 4 : Tester
+Dans le panneau de test, essayez :
+```
+Can you extract the content from https://example.com?
+```
+## 📋 Prérequis Copilot Studio
+- Environment avec **"Get new features early"** activé
+- **Generative Orchestration** activé sur l'agent
+- Custom Connector configuré correctement
+## 🧪 Test local
+```bash
+# Test de l'endpoint MCP
+curl -X POST https://YOUR_SPACE.hf.space/mcp \
+  -H "Content-Type: application/json" \
+  -d '{"jsonrpc":"2.0","method":"tools/list","id":1}'
+```
+Réponse attendue :
+```json
+{
+  "jsonrpc": "2.0",
+  "id": 1,
+  "result": {
+    "tools": [
+      {"name": "md", ...},
+      {"name": "html", ...},
+      {"name": "crawl", ...},
+      {"name": "execute_js", ...}
+    ]
+  }
+}
+```
+## 🔧 Développement local
+```bash
+# Installer les dépendances
+pip install -r requirements.txt
+playwright install chromium
+# Lancer le serveur
+python app.py
+```
+## 📝 Notes
+- SSE (`/mcp/sse`) est déprécié et redirige vers une info
+- Le serveur retourne `405 Method Not Allowed` pour GET sur `/mcp` (comportement attendu)
+- Les outils sont automatiquement découverts par Copilot Studio via `tools/list`

app.py CHANGED Viewed

@@ -1,11 +1,16 @@
 #!/usr/bin/env python3
 import asyncio
 import json
 import logging
-from typing import Any
-from fastapi import FastAPI, Request
-from fastapi.responses import StreamingResponse, JSONResponse
-from sse_starlette.sse import EventSourceResponse
 from crawl4ai import AsyncWebCrawler
 import uvicorn
@@ -14,25 +19,127 @@ logger = logging.getLogger(__name__)
 app = FastAPI(title="Crawl4AI MCP Server")
 # === MCP PROTOCOL HANDLERS ===
-async def handle_mcp_request(request_data: dict) -> dict:
     """Handle MCP JSON-RPC 2.0 requests"""
     method = request_data.get("method")
     params = request_data.get("params", {})
     request_id = request_data.get("id")
-    logger.info(f"MCP Request: {method}")
     try:
         if method == "initialize":
             return {
                 "jsonrpc": "2.0",
                 "id": request_id,
                 "result": {
                     "protocolVersion": "2024-11-05",
                     "capabilities": {
-                        "tools": {}
                     },
                     "serverInfo": {
                         "name": "crawl4ai-mcp-server",
@@ -41,66 +148,25 @@ async def handle_mcp_request(request_data: dict) -> dict:
                 }
             }
         elif method == "tools/list":
-            tools = [
-                {
-                    "name": "md",
-                    "description": "Extract markdown content from a webpage",
-                    "inputSchema": {
-                        "type": "object",
-                        "properties": {
-                            "url": {"type": "string", "description": "URL to scrape"},
-                            "filter_mode": {"type": "string", "enum": ["raw", "fit"], "default": "fit"}
-                        },
-                        "required": ["url"]
-                    }
-                },
-                {
-                    "name": "html",
-                    "description": "Extract HTML from a webpage",
-                    "inputSchema": {
-                        "type": "object",
-                        "properties": {
-                            "url": {"type": "string", "description": "URL to scrape"}
-                        },
-                        "required": ["url"]
-                    }
-                },
-                {
-                    "name": "crawl",
-                    "description": "Batch crawl multiple URLs",
-                    "inputSchema": {
-                        "type": "object",
-                        "properties": {
-                            "urls": {"type": "array", "items": {"type": "string"}},
-                            "filter_mode": {"type": "string", "enum": ["raw", "fit"], "default": "fit"}
-                        },
-                        "required": ["urls"]
-                    }
-                },
-                {
-                    "name": "execute_js",
-                    "description": "Execute JavaScript on a page",
-                    "inputSchema": {
-                        "type": "object",
-                        "properties": {
-                            "url": {"type": "string"},
-                            "scripts": {"type": "array", "items": {"type": "string"}}
-                        },
-                        "required": ["url", "scripts"]
-                    }
-                }
-            ]
             return {
                 "jsonrpc": "2.0",
                 "id": request_id,
-                "result": {"tools": tools}
             }
         elif method == "tools/call":
             tool_name = params.get("name")
             tool_args = params.get("arguments", {})
             result = await execute_tool(tool_name, tool_args)
             return {
@@ -112,11 +178,40 @@ async def handle_mcp_request(request_data: dict) -> dict:
                             "type": "text",
                             "text": result
                         }
-                    ]
                 }
             }
         else:
             return {
                 "jsonrpc": "2.0",
                 "id": request_id,
@@ -127,7 +222,7 @@ async def handle_mcp_request(request_data: dict) -> dict:
             }
     except Exception as e:
-        logger.error(f"Error handling MCP request: {str(e)}")
         return {
             "jsonrpc": "2.0",
             "id": request_id,
@@ -145,6 +240,7 @@ async def execute_tool(name: str, args: dict) -> str:
             url = args.get("url")
             filter_mode = args.get("filter_mode", "fit")
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 result = await crawler.arun(
                     url=url,
@@ -160,11 +256,12 @@ async def execute_tool(name: str, args: dict) -> str:
         elif name == "html":
             url = args.get("url")
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 result = await crawler.arun(url=url, bypass_cache=True)
                 if result.success:
-                    return result.html[:10000]  # Limiter la taille
                 else:
                     return f"❌ Failed: {result.error_message}"
@@ -172,6 +269,7 @@ async def execute_tool(name: str, args: dict) -> str:
             urls = args.get("urls", [])[:10]
             results_text = f"# Batch Crawl Results ({len(urls)} URLs)\n\n"
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 for idx, url in enumerate(urls, 1):
                     result = await crawler.arun(url=url, bypass_cache=True)
@@ -186,6 +284,7 @@ async def execute_tool(name: str, args: dict) -> str:
             url = args.get("url")
             scripts = args.get("scripts", [])
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 result = await crawler.arun(url=url, js_code=scripts, bypass_cache=True)
@@ -198,70 +297,171 @@ async def execute_tool(name: str, args: dict) -> str:
             return f"❌ Unknown tool: {name}"
     except Exception as e:
-        logger.error(f"Error executing tool {name}: {str(e)}")
         return f"❌ Error: {str(e)}"
-# === ENDPOINTS ===
 @app.get("/")
 async def root():
     return {
         "status": "running",
         "server": "crawl4ai-mcp-server",
         "version": "1.0.0",
-        "protocol": "MCP over SSE",
-        "endpoint": "/mcp/sse"
     }
 @app.get("/mcp/sse")
-async def mcp_sse_get():
-    """SSE endpoint for MCP protocol"""
-    async def event_generator():
-        # Send initial connection message
-        yield {
-            "event": "message",
-            "data": json.dumps({
-                "jsonrpc": "2.0",
-                "method": "notifications/initialized",
-                "params": {}
-            })
-        }
-        # Keep connection alive
-        while True:
-            await asyncio.sleep(30)
-            yield {"event": "ping", "data": ""}
-    return EventSourceResponse(event_generator())
-@app.post("/mcp/sse")
-async def mcp_sse_post(request: Request):
-    """Handle MCP requests via POST"""
     try:
         body = await request.json()
-        logger.info(f"Received MCP request: {body}")
-        response = await handle_mcp_request(body)
-        return JSONResponse(content=response)
     except Exception as e:
-        logger.error(f"Error in MCP POST: {str(e)}")
-        return JSONResponse(
-            content={
-                "jsonrpc": "2.0",
-                "error": {
-                    "code": -32700,
-                    "message": f"Parse error: {str(e)}"
-                },
-                "id": None
-            },
-            status_code=500
-        )
 if __name__ == "__main__":
-    logger.info("🚀 Starting Crawl4AI MCP Server on port 7860")
     uvicorn.run(app, host="0.0.0.0", port=7860)

 #!/usr/bin/env python3
+"""
+Crawl4AI MCP Server - Compatible with Microsoft Copilot Studio
+Uses Streamable HTTP transport (not deprecated SSE)
+"""
 import asyncio
 import json
 import logging
+import uuid
+from typing import Any, Dict, Optional
+from fastapi import FastAPI, Request, Response, HTTPException
+from fastapi.responses import JSONResponse, StreamingResponse
+from fastapi.middleware.cors import CORSMiddleware
 from crawl4ai import AsyncWebCrawler
 import uvicorn
 app = FastAPI(title="Crawl4AI MCP Server")
+# Add CORS middleware for cross-origin requests
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Session storage for stateful connections
+sessions: Dict[str, Dict] = {}
+# === MCP TOOLS DEFINITION ===
+MCP_TOOLS = [
+    {
+        "name": "md",
+        "description": "Extract markdown content from a webpage. Use this to scrape and convert web pages to clean markdown format.",
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "url": {
+                    "type": "string",
+                    "description": "The URL of the webpage to scrape"
+                },
+                "filter_mode": {
+                    "type": "string",
+                    "enum": ["raw", "fit"],
+                    "default": "fit",
+                    "description": "Filter mode: 'fit' for cleaned content, 'raw' for all content"
+                }
+            },
+            "required": ["url"]
+        }
+    },
+    {
+        "name": "html",
+        "description": "Extract raw HTML from a webpage",
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "url": {
+                    "type": "string",
+                    "description": "The URL of the webpage to scrape"
+                }
+            },
+            "required": ["url"]
+        }
+    },
+    {
+        "name": "crawl",
+        "description": "Batch crawl multiple URLs and extract markdown content from each",
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "urls": {
+                    "type": "array",
+                    "items": {"type": "string"},
+                    "description": "List of URLs to crawl (max 10)"
+                },
+                "filter_mode": {
+                    "type": "string",
+                    "enum": ["raw", "fit"],
+                    "default": "fit",
+                    "description": "Filter mode for content extraction"
+                }
+            },
+            "required": ["urls"]
+        }
+    },
+    {
+        "name": "execute_js",
+        "description": "Execute JavaScript code on a webpage and return the resulting content",
+        "inputSchema": {
+            "type": "object",
+            "properties": {
+                "url": {
+                    "type": "string",
+                    "description": "The URL of the webpage"
+                },
+                "scripts": {
+                    "type": "array",
+                    "items": {"type": "string"},
+                    "description": "List of JavaScript code snippets to execute"
+                }
+            },
+            "required": ["url", "scripts"]
+        }
+    }
+]
 # === MCP PROTOCOL HANDLERS ===
+async def handle_mcp_request(request_data: dict, session_id: str = None) -> dict:
     """Handle MCP JSON-RPC 2.0 requests"""
     method = request_data.get("method")
     params = request_data.get("params", {})
     request_id = request_data.get("id")
+    logger.info(f"MCP Request: method={method}, id={request_id}, session={session_id}")
     try:
         if method == "initialize":
+            # Store session info
+            if session_id:
+                sessions[session_id] = {
+                    "initialized": True,
+                    "protocol_version": params.get("protocolVersion", "2024-11-05")
+                }
             return {
                 "jsonrpc": "2.0",
                 "id": request_id,
                 "result": {
                     "protocolVersion": "2024-11-05",
                     "capabilities": {
+                        "tools": {
+                            "listChanged": False
+                        }
                     },
                     "serverInfo": {
                         "name": "crawl4ai-mcp-server",
                 }
             }
+        elif method == "notifications/initialized":
+            # Client acknowledgment - no response needed for notifications
+            return None
         elif method == "tools/list":
+            logger.info(f"Returning {len(MCP_TOOLS)} tools")
             return {
                 "jsonrpc": "2.0",
                 "id": request_id,
+                "result": {
+                    "tools": MCP_TOOLS
+                }
             }
         elif method == "tools/call":
             tool_name = params.get("name")
             tool_args = params.get("arguments", {})
+            logger.info(f"Calling tool: {tool_name} with args: {tool_args}")
             result = await execute_tool(tool_name, tool_args)
             return {
                             "type": "text",
                             "text": result
                         }
+                    ],
+                    "isError": False
+                }
+            }
+        elif method == "ping":
+            return {
+                "jsonrpc": "2.0",
+                "id": request_id,
+                "result": {}
+            }
+        elif method == "resources/list":
+            # We don't have resources, return empty list
+            return {
+                "jsonrpc": "2.0",
+                "id": request_id,
+                "result": {
+                    "resources": []
+                }
+            }
+        elif method == "prompts/list":
+            # We don't have prompts, return empty list
+            return {
+                "jsonrpc": "2.0",
+                "id": request_id,
+                "result": {
+                    "prompts": []
                 }
             }
         else:
+            logger.warning(f"Unknown method: {method}")
             return {
                 "jsonrpc": "2.0",
                 "id": request_id,
             }
     except Exception as e:
+        logger.error(f"Error handling MCP request: {str(e)}", exc_info=True)
         return {
             "jsonrpc": "2.0",
             "id": request_id,
             url = args.get("url")
             filter_mode = args.get("filter_mode", "fit")
+            logger.info(f"Extracting markdown from: {url}")
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 result = await crawler.arun(
                     url=url,
         elif name == "html":
             url = args.get("url")
+            logger.info(f"Extracting HTML from: {url}")
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 result = await crawler.arun(url=url, bypass_cache=True)
                 if result.success:
+                    return result.html[:10000]  # Limit size
                 else:
                     return f"❌ Failed: {result.error_message}"
             urls = args.get("urls", [])[:10]
             results_text = f"# Batch Crawl Results ({len(urls)} URLs)\n\n"
+            logger.info(f"Batch crawling {len(urls)} URLs")
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 for idx, url in enumerate(urls, 1):
                     result = await crawler.arun(url=url, bypass_cache=True)
             url = args.get("url")
             scripts = args.get("scripts", [])
+            logger.info(f"Executing JS on: {url}")
             async with AsyncWebCrawler(headless=True, verbose=False) as crawler:
                 result = await crawler.arun(url=url, js_code=scripts, bypass_cache=True)
             return f"❌ Unknown tool: {name}"
     except Exception as e:
+        logger.error(f"Error executing tool {name}: {str(e)}", exc_info=True)
         return f"❌ Error: {str(e)}"
+# === STREAMABLE HTTP ENDPOINT (for Copilot Studio) ===
+@app.post("/mcp")
+async def mcp_streamable_http(request: Request):
+    """
+    Main MCP endpoint using Streamable HTTP transport.
+    This is what Microsoft Copilot Studio expects.
+    """
+    try:
+        body = await request.json()
+        logger.info(f"MCP POST /mcp: {json.dumps(body)[:500]}")
+        # Get or create session from header
+        session_id = request.headers.get("mcp-session-id", str(uuid.uuid4()))
+        response = await handle_mcp_request(body, session_id)
+        if response is None:
+            # For notifications, return 202 Accepted
+            return Response(status_code=202)
+        # Return JSON response with session header
+        return JSONResponse(
+            content=response,
+            headers={
+                "mcp-session-id": session_id,
+                "Content-Type": "application/json"
+            }
+        )
+    except json.JSONDecodeError as e:
+        logger.error(f"JSON decode error: {str(e)}")
+        return JSONResponse(
+            content={
+                "jsonrpc": "2.0",
+                "error": {
+                    "code": -32700,
+                    "message": f"Parse error: {str(e)}"
+                },
+                "id": None
+            },
+            status_code=400
+        )
+    except Exception as e:
+        logger.error(f"Error in MCP endpoint: {str(e)}", exc_info=True)
+        return JSONResponse(
+            content={
+                "jsonrpc": "2.0",
+                "error": {
+                    "code": -32603,
+                    "message": f"Internal error: {str(e)}"
+                },
+                "id": None
+            },
+            status_code=500
+        )
+@app.get("/mcp")
+async def mcp_get_not_allowed():
+    """
+    GET requests to /mcp are not allowed in Streamable HTTP.
+    This error message is expected by Copilot Studio to validate the server.
+    """
+    return JSONResponse(
+        content={
+            "jsonrpc": "2.0",
+            "error": {
+                "code": -32000,
+                "message": "Method not allowed."
+            },
+            "id": None
+        },
+        status_code=405
+    )
+@app.delete("/mcp")
+async def mcp_delete_session(request: Request):
+    """Handle session termination"""
+    session_id = request.headers.get("mcp-session-id")
+    if session_id and session_id in sessions:
+        del sessions[session_id]
+        logger.info(f"Session deleted: {session_id}")
+    return Response(status_code=204)
+# === HEALTH & INFO ENDPOINTS ===
 @app.get("/")
 async def root():
+    """Root endpoint with server information"""
     return {
         "status": "running",
         "server": "crawl4ai-mcp-server",
         "version": "1.0.0",
+        "protocol": "MCP Streamable HTTP",
+        "mcp_endpoint": "/mcp",
+        "tools_count": len(MCP_TOOLS),
+        "tools": [t["name"] for t in MCP_TOOLS]
+    }
+@app.get("/health")
+async def health():
+    """Health check endpoint"""
+    return {
+        "status": "healthy",
+        "tools_count": len(MCP_TOOLS),
+        "tools": [t["name"] for t in MCP_TOOLS],
+        "active_sessions": len(sessions)
     }
+# === SSE ENDPOINTS (Legacy - for backward compatibility) ===
+@app.get("/sse")
 @app.get("/mcp/sse")
+async def sse_legacy_redirect():
+    """
+    Legacy SSE endpoint - redirects to info about the new endpoint.
+    SSE is deprecated, use Streamable HTTP at /mcp instead.
+    """
+    return JSONResponse(
+        content={
+            "message": "SSE transport is deprecated. Use Streamable HTTP instead.",
+            "mcp_endpoint": "/mcp",
+            "method": "POST",
+            "documentation": "https://modelcontextprotocol.io/specification/basic/transports#streamable-http"
+        },
+        status_code=200
+    )
+# === DEBUG ENDPOINTS ===
+@app.get("/debug/tools")
+async def debug_tools():
+    """Debug endpoint to verify tools configuration"""
+    return {
+        "tools_count": len(MCP_TOOLS),
+        "tools": MCP_TOOLS
+    }
+@app.post("/debug/test-tool")
+async def debug_test_tool(request: Request):
+    """Debug endpoint to test a tool directly"""
     try:
         body = await request.json()
+        tool_name = body.get("name")
+        tool_args = body.get("arguments", {})
+        result = await execute_tool(tool_name, tool_args)
+        return {"result": result}
     except Exception as e:
+        return {"error": str(e)}
 if __name__ == "__main__":
+    logger.info("🚀 Starting Crawl4AI MCP Server (Streamable HTTP)")
+    logger.info(f"📋 Available tools: {[t['name'] for t in MCP_TOOLS]}")
+    logger.info("🔗 MCP Endpoint: POST /mcp")
     uvicorn.run(app, host="0.0.0.0", port=7860)