victor HF Staff commited on
Commit
58d88bf
Β·
1 Parent(s): d82748f

Refactor README and analytics for improved clarity and functionality; update app.py to enhance search and fetch tools with better error handling and analytics tracking.

Browse files
Files changed (3) hide show
  1. README.md +78 -150
  2. analytics.py +66 -65
  3. app.py +262 -248
README.md CHANGED
@@ -7,176 +7,104 @@ sdk: gradio
7
  sdk_version: 5.36.2
8
  app_file: app.py
9
  pinned: false
10
- short_description: Search and extract web content for LLM ingestion
11
  thumbnail: >-
12
  https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/tfYtTMw9FgiWdyyIYz6A6.png
13
  ---
14
 
15
- # Web Search MCP Server
16
 
17
- A Model Context Protocol (MCP) server that provides web search capabilities to LLMs, allowing them to fetch and extract content from web pages and news articles.
18
 
19
- ## Features
 
 
 
 
 
20
 
21
- - **Dual search modes**:
22
- - **General Search**: Get diverse results from blogs, documentation, articles, and more
23
- - **News Search**: Find fresh news articles and breaking stories from news sources
24
- - **Real-time web search**: Search for any topic with up-to-date results
25
- - **Content extraction**: Automatically extracts main article content, removing ads and boilerplate
26
- - **Rate limiting**: Built-in rate limiting (200 requests/hour) to prevent API abuse
27
- - **Structured output**: Returns formatted content with metadata (title, source, date, URL)
28
- - **Flexible results**: Control the number of results (1-20)
29
 
30
- ## Prerequisites
31
-
32
- 1. **Serper API Key**: Sign up at [serper.dev](https://serper.dev) to get your API key
33
- 2. **Python 3.8+**: Ensure you have Python installed
34
- 3. **MCP-compatible LLM client**: Such as Claude Desktop, Cursor, or any MCP-enabled application
35
-
36
- ## Installation
37
 
38
- 1. Clone or download this repository
39
- 2. Install dependencies:
40
- ```bash
41
- pip install -r requirements.txt
42
- ```
43
- Or install manually:
44
  ```bash
45
- pip install "gradio[mcp]" httpx trafilatura python-dateutil limits
46
  ```
47
-
48
- 3. Set your Serper API key:
49
  ```bash
50
- export SERPER_API_KEY="your-api-key-here"
51
  ```
 
52
 
53
- ## Usage
54
-
55
- ### Starting the MCP Server
56
 
 
 
57
  ```bash
58
- python app_mcp.py
59
- ```
60
-
61
- The server will start on `http://localhost:7860` with the MCP endpoint at:
62
- ```
63
- http://localhost:7860/gradio_api/mcp/sse
64
  ```
65
-
66
- ### Connecting to LLM Clients
67
-
68
- #### Claude Desktop
69
- Add to your `claude_desktop_config.json`:
70
- ```json
71
- {
72
- "mcpServers": {
73
- "web-search": {
74
- "command": "python",
75
- "args": ["/path/to/app_mcp.py"],
76
- "env": {
77
- "SERPER_API_KEY": "your-api-key-here"
78
  }
79
  }
80
  }
81
- }
82
- ```
83
-
84
- #### Direct URL Connection
85
- For clients that support URL-based MCP servers:
86
- 1. Start the server: `python app_mcp.py`
87
- 2. Connect to: `http://localhost:7860/gradio_api/mcp/sse`
88
-
89
- ## Tool Documentation
90
-
91
- ### `search_web` Function
92
-
93
- **Purpose**: Search the web for information or fresh news and extract content.
94
-
95
- **Parameters**:
96
- - `query` (str, **REQUIRED**): The search query
97
- - Examples: "OpenAI news", "climate change 2024", "python tutorial"
98
-
99
- - `num_results` (int, **OPTIONAL**): Number of results to fetch
100
- - Default: 4
101
- - Range: 1-20
102
- - More results provide more context but take longer
103
-
104
- - `search_type` (str, **OPTIONAL**): Type of search to perform
105
- - Default: "search" (general web search)
106
- - Options: "search" or "news"
107
- - Use "news" for fresh, time-sensitive news articles
108
- - Use "search" for general information, documentation, tutorials
109
-
110
- **Returns**: Formatted text containing:
111
- - Summary of extraction results
112
- - For each article:
113
- - Title
114
- - Source and date
115
- - URL
116
- - Extracted main content
117
-
118
- **When to use each search type**:
119
- - **Use "news" mode for**:
120
- - Breaking news or very recent events
121
- - Time-sensitive information ("today", "this week")
122
- - Current affairs and latest developments
123
- - Press releases and announcements
124
-
125
- - **Use "search" mode for**:
126
- - General information and research
127
- - Technical documentation or tutorials
128
- - Historical information
129
- - Diverse perspectives from various sources
130
- - How-to guides and explanations
131
-
132
- **Example Usage in LLM**:
133
- ```
134
- # News mode examples
135
- "Search for breaking news about OpenAI" -> uses news mode
136
- "Find today's stock market updates" -> uses news mode
137
- "Get latest climate change developments" -> uses news mode
138
-
139
- # Search mode examples (default)
140
- "Search for Python programming tutorials" -> uses search mode
141
- "Find information about machine learning algorithms" -> uses search mode
142
- "Research historical data about climate change" -> uses search mode
143
- ```
144
-
145
- ## Error Handling
146
-
147
- The tool handles various error scenarios:
148
- - Missing API key: Clear error message with setup instructions
149
- - Rate limiting: Informs when limit is exceeded
150
- - Failed extractions: Reports which articles couldn't be extracted
151
- - Network errors: Graceful error messages
152
-
153
- ## Testing
154
-
155
- You can test the server manually:
156
- 1. Open `http://localhost:7860` in your browser
157
- 2. Enter a search query
158
- 3. Adjust the number of results
159
- 4. Click "Search" to see the extracted content
160
-
161
- ## Tips for LLM Usage
162
-
163
- 1. **Choose the right search type**: Use "news" for fresh, breaking news; use "search" for general information
164
- 2. **Be specific with queries**: More specific queries yield better results
165
- 3. **Adjust result count**: Use fewer results for quick searches, more for comprehensive research
166
- 4. **Check dates**: The tool shows article dates for temporal context
167
- 5. **Follow up**: Use the extracted content to ask follow-up questions
168
-
169
- ## Limitations
170
-
171
- - Rate limited to 200 requests per hour
172
- - Extraction quality depends on website structure
173
- - Some websites may block automated access
174
- - News mode focuses on recent articles from news sources
175
- - Search mode provides diverse results but may include older content
176
 
177
  ## Troubleshooting
 
 
 
 
178
 
179
- 1. **"SERPER_API_KEY is not set"**: Ensure the environment variable is exported
180
- 2. **Rate limit errors**: Wait before making more requests
181
- 3. **No content extracted**: Some websites block scrapers; try different queries
182
- 4. **Connection errors**: Check your internet connection and firewall settings
 
7
  sdk_version: 5.36.2
8
  app_file: app.py
9
  pinned: false
10
+ short_description: Search & fetch the web with per-tool analytics
11
  thumbnail: >-
12
  https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/tfYtTMw9FgiWdyyIYz6A6.png
13
  ---
14
 
15
+ # Web MCP Server
16
 
17
+ A Model Context Protocol (MCP) server that exposes two composable toolsβ€”`search` (Serper metadata) and `fetch` (single-page extraction)β€”alongside a live analytics dashboard that tracks daily usage for each tool. The UI runs on Gradio and can be reached directly or via MCP-compatible clients like Claude Desktop and Cursor.
18
 
19
+ ## Highlights
20
+ - Dual MCP tools with shared rate limiting (`360 requests/hour`) and structured JSON responses.
21
+ - Daily analytics split by tool: the **Analytics** tab renders "Daily Search" (left) and "Daily Fetch" (right) bar charts covering the last 14 days.
22
+ - Persistent request counters keyed by UTC date and tool: `{"YYYY-MM-DD": {"search": n, "fetch": m}}`, with automatic migration from legacy totals.
23
+ - Pluggable storage: respects `ANALYTICS_DATA_DIR`, otherwise falls back to `/data` (if writable) or `./data` for local development.
24
+ - Ready-to-serve Gradio app with MCP endpoints exposed via `gr.api` for direct client consumption.
25
 
26
+ ## Requirements
27
+ - Python 3.8 or newer.
28
+ - Serper API key (`SERPER_API_KEY`) with access to the Search and News endpoints.
29
+ - Dependencies listed in `requirements.txt`, including `filelock` and `pandas` for analytics storage.
 
 
 
 
30
 
31
+ Install everything with:
32
+ ```bash
33
+ pip install -r requirements.txt
34
+ ```
 
 
 
35
 
36
+ ## Configuration
37
+ 1. Export your Serper API key:
 
 
 
 
38
  ```bash
39
+ export SERPER_API_KEY="your-api-key"
40
  ```
41
+ 2. (Optional) Override the analytics storage path:
 
42
  ```bash
43
+ export ANALYTICS_DATA_DIR="/path/to/persistent/storage"
44
  ```
45
+ If unset, the app automatically prefers `/data` when available, otherwise `./data`.
46
 
47
+ The request counters live in `<DATA_DIR>/request_counts.json`, guarded by a file lock to support concurrent MCP calls.
 
 
48
 
49
+ ## Running Locally
50
+ Launch the Gradio server (with MCP support enabled) via:
51
  ```bash
52
+ python app.py
 
 
 
 
 
53
  ```
54
+ This starts a local UI at `http://localhost:7860` and exposes the MCP SSE endpoint at `http://localhost:7860/gradio_api/mcp/sse`.
55
+
56
+ ### Connecting From MCP Clients
57
+ - **Claude Desktop** – update `claude_desktop_config.json`:
58
+ ```json
59
+ {
60
+ "mcpServers": {
61
+ "web-search": {
62
+ "command": "python",
63
+ "args": ["/absolute/path/to/app.py"],
64
+ "env": {
65
+ "SERPER_API_KEY": "your-api-key"
66
+ }
67
  }
68
  }
69
  }
70
+ ```
71
+ - **URL-based MCP clients** – run `python app.py`, then point the client to `http://localhost:7860/gradio_api/mcp/sse`.
72
+
73
+ ## Tool Reference
74
+ ### `search`
75
+ - **Purpose**: Retrieve metadata-only results from Serper (general web or news).
76
+ - **Inputs**:
77
+ - `query` *(str, required)* – search terms.
78
+ - `search_type` *("search" | "news", default "search")* – switch to `news` for recency-focused results.
79
+ - `num_results` *(int, default 4, range 1–20)* – number of hits to return.
80
+ - **Output**: JSON containing the query echo, result count, timing, and an array of entries with `position`, `title`, `link`, `domain`, and optional `source`/`date` for news.
81
+
82
+ ### `fetch`
83
+ - **Purpose**: Download a single URL and extract the readable article text via Trafilatura.
84
+ - **Inputs**:
85
+ - `url` *(str, required)* – must start with `http://` or `https://`.
86
+ - `timeout` *(int, default 20 seconds)* – client timeout for the HTTP request.
87
+ - **Output**: JSON with the original and final URL, domain, HTTP status, title, ISO timestamp of the fetch, word count, cleaned `content`, and duration.
88
+
89
+ Both tools increment their respective analytics buckets on every invocation, including validation failures and rate-limit denials, ensuring the dashboard mirrors real traffic.
90
+
91
+ ## Analytics Dashboard
92
+ Open the **Analytics** tab in the Gradio UI to inspect daily activity:
93
+ - **Daily Search Count** (left column) – bar chart for the past 14 days of `search` tool requests.
94
+ - **Daily Fetch Count** (right column) – bar chart for the past 14 days of `fetch` tool requests.
95
+ - Tooltips reveal the display label (e.g., `Sep 17`), raw count, and ISO date key.
96
+
97
+ Data is stored in JSON and can be safely externalized for long-term tracking. Existing totals in the legacy integer-only format are automatically migrated during the first write.
98
+
99
+ ## Rate Limiting & Error Handling
100
+ - Global moving-window limit of `360` requests per hour shared across both tools (powered by `limits`).
101
+ - Standardized error payloads for missing parameters, invalid URLs, Serper issues, HTTP failures, and rate-limit hits, each preserving analytics increments.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  ## Troubleshooting
104
+ - **`SERPER_API_KEY is not set`** – export the key in the environment where the server runs.
105
+ - **`Rate limit exceeded`** – pause requests or reduce client concurrency.
106
+ - **Empty extraction** – some sites block bots; try another URL.
107
+ - **Storage permissions** – ensure the chosen data directory is writable; adjust `ANALYTICS_DATA_DIR` if necessary.
108
 
109
+ ## Licensing & Contributions
110
+ Feel free to fork and adapt for your MCP workflows. Contributions are welcomeβ€”open a PR or issue with proposed analytics enhancements, additional tooling, or documentation tweaks.
 
 
analytics.py CHANGED
@@ -2,8 +2,8 @@
2
  import os
3
  import json
4
  from datetime import datetime, timedelta, timezone
5
- from filelock import FileLock # pip install filelock
6
- import pandas as pd # already available in HF images
7
 
8
  # Determine data directory based on environment
9
  # 1. Check for environment variable override
@@ -21,84 +21,85 @@ if not DATA_DIR:
21
  os.makedirs(DATA_DIR, exist_ok=True)
22
 
23
  COUNTS_FILE = os.path.join(DATA_DIR, "request_counts.json")
24
- TIMES_FILE = os.path.join(DATA_DIR, "request_times.json")
25
- LOCK_FILE = os.path.join(DATA_DIR, "analytics.lock")
26
 
27
- def _load() -> dict:
 
 
 
 
28
  if not os.path.exists(COUNTS_FILE):
29
  return {}
30
  with open(COUNTS_FILE) as f:
31
- return json.load(f)
 
 
 
 
32
 
33
- def _save(data: dict):
34
  with open(COUNTS_FILE, "w") as f:
35
  json.dump(data, f)
36
 
37
- def _load_times() -> dict:
38
- if not os.path.exists(TIMES_FILE):
39
- return {}
40
- with open(TIMES_FILE) as f:
41
- return json.load(f)
42
 
43
- def _save_times(data: dict):
44
- with open(TIMES_FILE, "w") as f:
45
- json.dump(data, f)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
- async def record_request(duration: float = None, num_results: int = None) -> None:
48
- """Increment today's counter (UTC) atomically and optionally record request duration."""
49
  today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
50
  with FileLock(LOCK_FILE):
51
- # Update counts
52
- data = _load()
53
- data[today] = data.get(today, 0) + 1
54
- _save(data)
55
-
56
- # Only record times for default requests (num_results=4)
57
- if duration is not None and (num_results is None or num_results == 4):
58
- times = _load_times()
59
- if today not in times:
60
- times[today] = []
61
- times[today].append(round(duration, 2))
62
- _save_times(times)
63
-
64
- def last_n_days_df(n: int = 30) -> pd.DataFrame:
65
- """Return a DataFrame with a row for each of the past *n* days."""
66
- now = datetime.now(timezone.utc)
67
- with FileLock(LOCK_FILE):
68
- data = _load()
69
- records = []
70
- for i in range(n):
71
- day = (now - timedelta(days=n - 1 - i))
72
- day_str = day.strftime("%Y-%m-%d")
73
- # Format date for display (MMM DD)
74
- display_date = day.strftime("%b %d")
75
- records.append({
76
- "date": display_date,
77
- "count": data.get(day_str, 0),
78
- "full_date": day_str # Keep full date for tooltip
79
- })
80
- return pd.DataFrame(records)
81
 
82
- def last_n_days_avg_time_df(n: int = 30) -> pd.DataFrame:
83
- """Return a DataFrame with average request time for each of the past *n* days."""
84
  now = datetime.now(timezone.utc)
85
  with FileLock(LOCK_FILE):
86
- times = _load_times()
 
87
  records = []
88
  for i in range(n):
89
- day = (now - timedelta(days=n - 1 - i))
90
- day_str = day.strftime("%Y-%m-%d")
91
- # Format date for display (MMM DD)
92
  display_date = day.strftime("%b %d")
93
-
94
- # Calculate average time for the day
95
- day_times = times.get(day_str, [])
96
- avg_time = round(sum(day_times) / len(day_times), 2) if day_times else 0
97
-
98
- records.append({
99
- "date": display_date,
100
- "avg_time": avg_time,
101
- "request_count": len(day_times),
102
- "full_date": day_str # Keep full date for tooltip
103
- })
104
- return pd.DataFrame(records)
 
2
  import os
3
  import json
4
  from datetime import datetime, timedelta, timezone
5
+ from filelock import FileLock # pip install filelock
6
+ import pandas as pd # already available in HF images
7
 
8
  # Determine data directory based on environment
9
  # 1. Check for environment variable override
 
21
  os.makedirs(DATA_DIR, exist_ok=True)
22
 
23
  COUNTS_FILE = os.path.join(DATA_DIR, "request_counts.json")
24
+ LOCK_FILE = os.path.join(DATA_DIR, "analytics.lock")
 
25
 
26
+
27
+ # ──────────────────────────────────────────────────────────────────────────────
28
+ # Storage helpers
29
+ # ──────────────────────────────────────────────────────────────────────────────
30
+ def _load_counts() -> dict:
31
  if not os.path.exists(COUNTS_FILE):
32
  return {}
33
  with open(COUNTS_FILE) as f:
34
+ try:
35
+ return json.load(f)
36
+ except json.JSONDecodeError:
37
+ return {}
38
+
39
 
40
+ def _save_counts(data: dict):
41
  with open(COUNTS_FILE, "w") as f:
42
  json.dump(data, f)
43
 
 
 
 
 
 
44
 
45
+ def _normalize_counts_schema(data: dict) -> dict:
46
+ """
47
+ Ensure data is {date: {"search": int, "fetch": int}}.
48
+ Backward compatible with old schema {date: int}.
49
+ """
50
+ normalized = {}
51
+ for day, value in data.items():
52
+ if isinstance(value, dict):
53
+ normalized[day] = {
54
+ "search": int(value.get("search", 0)),
55
+ "fetch": int(value.get("fetch", 0)),
56
+ }
57
+ else:
58
+ # Old schema: total count as int β†’ attribute to "search", keep fetch=0
59
+ normalized[day] = {"search": int(value or 0), "fetch": 0}
60
+ return normalized
61
+
62
+
63
+ # ──────────────────────────────────────────────────────────────────────────────
64
+ # Public API
65
+ # ──────────────────────────────────────────────────────────────────────────────
66
+ async def record_request(tool: str) -> None:
67
+ """Increment today's counter (UTC) for the given tool: 'search' or 'fetch'."""
68
+ tool = (tool or "").strip().lower()
69
+ if tool not in {"search", "fetch"}:
70
+ # Ignore unknown tool buckets to keep charts clean
71
+ tool = "search"
72
 
 
 
73
  today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
74
  with FileLock(LOCK_FILE):
75
+ data = _normalize_counts_schema(_load_counts())
76
+ if today not in data:
77
+ data[today] = {"search": 0, "fetch": 0}
78
+ data[today][tool] = int(data[today].get(tool, 0)) + 1
79
+ _save_counts(data)
80
+
81
+
82
+ def last_n_days_count_df(tool: str, n: int = 30) -> pd.DataFrame:
83
+ """Return DataFrame with a row for each of the past n days for the given tool."""
84
+ tool = (tool or "").strip().lower()
85
+ if tool not in {"search", "fetch"}:
86
+ tool = "search"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
 
 
88
  now = datetime.now(timezone.utc)
89
  with FileLock(LOCK_FILE):
90
+ data = _normalize_counts_schema(_load_counts())
91
+
92
  records = []
93
  for i in range(n):
94
+ day = now - timedelta(days=n - 1 - i)
95
+ day_key = day.strftime("%Y-%m-%d")
 
96
  display_date = day.strftime("%b %d")
97
+ counts = data.get(day_key, {"search": 0, "fetch": 0})
98
+ records.append(
99
+ {
100
+ "date": display_date,
101
+ "count": int(counts.get(tool, 0)),
102
+ "full_date": day_key,
103
+ }
104
+ )
105
+ return pd.DataFrame(records)
 
 
 
app.py CHANGED
@@ -1,8 +1,11 @@
1
  import os
2
- import asyncio
3
  import time
4
- from typing import Optional
5
- from datetime import datetime
 
 
 
 
6
  import httpx
7
  import trafilatura
8
  import gradio as gr
@@ -10,92 +13,91 @@ from dateutil import parser as dateparser
10
  from limits import parse
11
  from limits.aio.storage import MemoryStorage
12
  from limits.aio.strategies import MovingWindowRateLimiter
13
- from analytics import record_request, last_n_days_df, last_n_days_avg_time_df
14
 
 
 
 
15
  # Configuration
 
16
  SERPER_API_KEY = os.getenv("SERPER_API_KEY")
17
  SERPER_SEARCH_ENDPOINT = "https://google.serper.dev/search"
18
  SERPER_NEWS_ENDPOINT = "https://google.serper.dev/news"
19
- HEADERS = {"X-API-KEY": SERPER_API_KEY, "Content-Type": "application/json"}
20
 
21
- # Rate limiting
22
  storage = MemoryStorage()
23
  limiter = MovingWindowRateLimiter(storage)
24
- rate_limit = parse("360/hour")
25
 
26
 
27
- async def search_web(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  query: str, search_type: str = "search", num_results: Optional[int] = 4
29
- ) -> str:
30
  """
31
- Search the web for information or fresh news, returning extracted content.
32
-
33
- This tool can perform two types of searches:
34
- - "search" (default): General web search for diverse, relevant content from various sources
35
- - "news": Specifically searches for fresh news articles and breaking stories
36
-
37
- Use "news" mode when looking for:
38
- - Breaking news or very recent events
39
- - Time-sensitive information
40
- - Current affairs and latest developments
41
- - Today's/this week's happenings
42
-
43
- Use "search" mode (default) for:
44
- - General information and research
45
- - Technical documentation or guides
46
- - Historical information
47
- - Diverse perspectives from various sources
48
-
49
- Args:
50
- query (str): The search query. This is REQUIRED. Examples: "apple inc earnings",
51
- "climate change 2024", "AI developments"
52
- search_type (str): Type of search. This is OPTIONAL. Default is "search".
53
- Options: "search" (general web search) or "news" (fresh news articles).
54
- Use "news" for time-sensitive, breaking news content.
55
- num_results (int): Number of results to fetch. This is OPTIONAL. Default is 4.
56
- Range: 1-20. More results = more context but longer response time.
57
-
58
- Returns:
59
- str: Formatted text containing extracted content with metadata (title,
60
- source, date, URL, and main text) for each result, separated by dividers.
61
- Returns error message if API key is missing or search fails.
62
-
63
- Examples:
64
- - search_web("OpenAI GPT-5", "news") - Get 5 fresh news articles about OpenAI
65
- - search_web("python tutorial", "search") - Get 4 general results about Python (default count)
66
- - search_web("stock market today", "news", 10) - Get 10 news articles about today's market
67
- - search_web("machine learning basics") - Get 4 general search results (all defaults)
68
  """
69
  start_time = time.time()
70
 
71
- if not SERPER_API_KEY:
72
- await record_request(None, num_results) # Record even failed requests
73
- return "Error: SERPER_API_KEY environment variable is not set. Please set it to use this tool."
 
74
 
75
- # Validate and constrain num_results
76
  if num_results is None:
77
  num_results = 4
78
- num_results = max(1, min(30, num_results))
79
-
80
- # Validate search_type
81
  if search_type not in ["search", "news"]:
82
  search_type = "search"
83
 
 
 
 
 
 
 
 
84
  try:
85
- # Check rate limit
86
  if not await limiter.hit(rate_limit, "global"):
87
- print(f"[{datetime.now().isoformat()}] Rate limit exceeded")
88
- duration = time.time() - start_time
89
- await record_request(duration, num_results)
90
- return "Error: Rate limit exceeded. Please try again later (limit: 500 requests per hour)."
91
 
92
- # Select endpoint based on search type
93
  endpoint = (
94
  SERPER_NEWS_ENDPOINT if search_type == "news" else SERPER_SEARCH_ENDPOINT
95
  )
96
-
97
- # Prepare payload
98
- payload = {"q": query, "num": num_results}
99
  if search_type == "news":
100
  payload["type"] = "news"
101
  payload["page"] = 1
@@ -104,107 +106,119 @@ async def search_web(
104
  resp = await client.post(endpoint, headers=HEADERS, json=payload)
105
 
106
  if resp.status_code != 200:
107
- duration = time.time() - start_time
108
- await record_request(duration, num_results)
109
- return f"Error: Search API returned status {resp.status_code}. Please check your API key and try again."
110
-
111
- # Extract results based on search type
112
- if search_type == "news":
113
- results = resp.json().get("news", [])
114
- else:
115
- results = resp.json().get("organic", [])
116
-
117
- if not results:
118
- duration = time.time() - start_time
119
- await record_request(duration, num_results)
120
- return f"No {search_type} results found for query: '{query}'. Try a different search term or search type."
121
-
122
- # Fetch HTML content concurrently
123
- urls = [r["link"] for r in results]
124
- async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
125
- tasks = [client.get(u) for u in urls]
126
- responses = await asyncio.gather(*tasks, return_exceptions=True)
127
-
128
- # Extract and format content
129
- chunks = []
130
- successful_extractions = 0
131
-
132
- for meta, response in zip(results, responses):
133
- if isinstance(response, Exception):
134
- continue
135
-
136
- # Extract main text content
137
- body = trafilatura.extract(
138
- response.text, include_formatting=False, include_comments=False
139
- )
 
 
 
 
 
 
 
 
140
 
141
- if not body:
142
- continue
 
143
 
144
- successful_extractions += 1
145
- print(
146
- f"[{datetime.now().isoformat()}] Successfully extracted content from {meta['link']}"
147
- )
148
 
149
- # Format the chunk based on search type
150
- if search_type == "news":
151
- # News results have date and source
152
- try:
153
- date_str = meta.get("date", "")
154
- if date_str:
155
- date_iso = dateparser.parse(date_str, fuzzy=True).strftime(
156
- "%Y-%m-%d"
157
- )
158
- else:
159
- date_iso = "Unknown"
160
- except Exception:
161
- date_iso = "Unknown"
162
-
163
- chunk = (
164
- f"## {meta['title']}\n"
165
- f"**Source:** {meta.get('source', 'Unknown')} "
166
- f"**Date:** {date_iso}\n"
167
- f"**URL:** {meta['link']}\n\n"
168
- f"{body.strip()}\n"
169
- )
170
- else:
171
- # Search results don't have date/source but have domain
172
- domain = meta["link"].split("/")[2].replace("www.", "")
173
-
174
- chunk = (
175
- f"## {meta['title']}\n"
176
- f"**Domain:** {domain}\n"
177
- f"**URL:** {meta['link']}\n\n"
178
- f"{body.strip()}\n"
179
- )
180
-
181
- chunks.append(chunk)
182
-
183
- if not chunks:
184
- duration = time.time() - start_time
185
- await record_request(duration, num_results)
186
- return f"Found {len(results)} {search_type} results for '{query}', but couldn't extract readable content from any of them. The websites might be blocking automated access."
187
-
188
- result = "\n---\n".join(chunks)
189
- summary = f"Successfully extracted content from {successful_extractions} out of {len(results)} {search_type} results for query: '{query}'\n\n---\n\n"
190
-
191
- print(
192
- f"[{datetime.now().isoformat()}] Extraction complete: {successful_extractions}/{len(results)} successful for query '{query}'"
193
- )
194
 
195
- # Record successful request with duration
196
- duration = time.time() - start_time
197
- await record_request(duration, num_results)
 
 
 
198
 
199
- return summary + result
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
  except Exception as e:
202
- # Record failed request with duration
203
- duration = time.time() - start_time
204
- return f"Error occurred while searching: {str(e)}. Please try again or check your query."
205
 
206
 
207
- # Create Gradio interface
 
 
208
  with gr.Blocks(title="Web Search MCP Server") as demo:
209
  gr.HTML(
210
  """
@@ -217,141 +231,141 @@ with gr.Blocks(title="Web Search MCP Server") as demo:
217
  )
218
 
219
  gr.Markdown("# πŸ” Web Search MCP Server")
 
 
 
220
 
221
  with gr.Tabs():
222
  with gr.Tab("App"):
223
- gr.Markdown(
224
- """
225
- This MCP server provides web search capabilities to LLMs. It can perform general web searches
226
- or specifically search for fresh news articles, extracting the main content from results.
227
-
228
- **⚑ Speed-Focused:** Optimized to complete the entire search process - from query to
229
- fully extracted web content - in under 2 seconds. Check out the Analytics tab
230
- to see real-time performance metrics.
231
-
232
- **Search Types:**
233
- - **General Search**: Diverse results from various sources (blogs, docs, articles, etc.)
234
- - **News Search**: Fresh news articles and breaking stories from news sources
235
-
236
- **Note:** This interface is primarily designed for MCP tool usage by LLMs, but you can
237
- also test it manually below.
238
- """
239
- )
240
-
241
- gr.HTML(
242
- """
243
- <div style="margin-bottom: 24px;">
244
- <a href="https://huggingface.co/spaces/victor/websearch?view=api">
245
- <img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/use-with-mcp-lg-dark.svg"
246
- alt="Use with MCP"
247
- style="height: 36px;">
248
- </a>
249
- </div>
250
- """,
251
- padding=0,
252
- )
253
-
254
  with gr.Row():
 
255
  with gr.Column(scale=3):
 
256
  query_input = gr.Textbox(
257
  label="Search Query",
258
- placeholder='e.g. "OpenAI news", "climate change 2024", "AI developments"',
259
- info="Required: Enter your search query",
260
  )
261
- with gr.Column(scale=1):
262
  search_type_input = gr.Radio(
263
  choices=["search", "news"],
264
  value="search",
265
  label="Search Type",
266
- info="Choose search type",
 
 
 
 
 
 
 
 
 
 
 
 
267
  )
268
 
269
- with gr.Row():
270
- num_results_input = gr.Slider(
271
- minimum=1,
272
- maximum=20,
273
- value=4,
274
- step=1,
275
- label="Number of Results",
276
- info="Optional: How many results to fetch (default: 4)",
277
- )
278
-
279
- search_button = gr.Button("Search", variant="primary")
280
-
281
- output = gr.Textbox(
282
- label="Extracted Content",
283
- lines=25,
284
- max_lines=50,
285
- info="The extracted article content will appear here",
286
- )
287
 
288
- # Add examples
289
- gr.Examples(
290
- examples=[
291
- ["OpenAI GPT-5 latest developments", "news", 5],
292
- ["React hooks useState", "search", 4],
293
- ["Tesla stock price today", "news", 6],
294
- ["Apple Vision Pro reviews", "search", 4],
295
- ["best Italian restaurants NYC", "search", 4],
296
- ],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297
  inputs=[query_input, search_type_input, num_results_input],
298
- outputs=output,
299
- fn=search_web,
300
- cache_examples=False,
 
 
 
 
 
301
  )
302
 
303
  with gr.Tab("Analytics"):
304
  gr.Markdown("## Community Usage Analytics")
305
- gr.Markdown(
306
- "Track daily request counts and average response times from all community users."
307
- )
308
 
309
  with gr.Row():
310
  with gr.Column():
311
- requests_plot = gr.BarPlot(
312
- value=last_n_days_df(
313
- 14
314
- ), # Show only last 14 days for better visibility
315
  x="date",
316
  y="count",
317
- title="Daily Request Count",
318
- tooltip=["date", "count"],
319
  height=350,
320
- x_label_angle=-45, # Rotate labels to prevent overlap
321
  container=False,
322
  )
323
-
324
  with gr.Column():
325
- avg_time_plot = gr.BarPlot(
326
- value=last_n_days_avg_time_df(14), # Show only last 14 days
327
  x="date",
328
- y="avg_time",
329
- title="Average Request Time (seconds)",
330
- tooltip=["date", "avg_time", "request_count"],
331
  height=350,
332
  x_label_angle=-45,
333
  container=False,
334
  )
335
 
336
- search_button.click(
337
- fn=search_web, # Use search_web directly instead of search_and_log
338
- inputs=[query_input, search_type_input, num_results_input],
339
- outputs=output,
340
- api_name=False, # Hide this endpoint from API & MCP
341
- )
342
-
343
- # Load fresh analytics data when the page loads or Analytics tab is clicked
344
  demo.load(
345
- fn=lambda: (last_n_days_df(14), last_n_days_avg_time_df(14)),
346
- outputs=[requests_plot, avg_time_plot],
 
 
 
347
  api_name=False,
348
  )
349
 
350
- # Expose search_web as the only MCP tool
351
- gr.api(search_web, api_name="search_web")
 
352
 
353
 
354
  if __name__ == "__main__":
355
  # Launch with MCP server enabled
356
- # The MCP endpoint will be available at: http://localhost:7860/gradio_api/mcp/sse
357
  demo.launch(mcp_server=True, show_api=True)
 
1
  import os
 
2
  import time
3
+ import re
4
+ import html
5
+ from typing import Optional, Dict, Any, List
6
+ from urllib.parse import urlsplit
7
+ from datetime import datetime, timezone
8
+
9
  import httpx
10
  import trafilatura
11
  import gradio as gr
 
13
  from limits import parse
14
  from limits.aio.storage import MemoryStorage
15
  from limits.aio.strategies import MovingWindowRateLimiter
 
16
 
17
+ from analytics import record_request, last_n_days_count_df
18
+
19
+ # ──────────────────────────────────────────────────────────────────────────────
20
  # Configuration
21
+ # ──────────────────────────────────────────────────────────────────────────────
22
  SERPER_API_KEY = os.getenv("SERPER_API_KEY")
23
  SERPER_SEARCH_ENDPOINT = "https://google.serper.dev/search"
24
  SERPER_NEWS_ENDPOINT = "https://google.serper.dev/news"
25
+ HEADERS = {"X-API-KEY": SERPER_API_KEY or "", "Content-Type": "application/json"}
26
 
27
+ # Rate limiting (shared by both tools)
28
  storage = MemoryStorage()
29
  limiter = MovingWindowRateLimiter(storage)
30
+ rate_limit = parse("360/hour") # shared global limit across search + fetch
31
 
32
 
33
+ # ──────────────────────────────────────────────────────────────────────────────
34
+ # Helpers
35
+ # ──────────────────────────────────────────────────────────────────────────────
36
+ def _domain_from_url(url: str) -> str:
37
+ try:
38
+ netloc = urlsplit(url).netloc
39
+ return netloc.replace("www.", "")
40
+ except Exception:
41
+ return ""
42
+
43
+
44
+ def _iso_date_or_unknown(date_str: Optional[str]) -> Optional[str]:
45
+ if not date_str:
46
+ return None
47
+ try:
48
+ return dateparser.parse(date_str, fuzzy=True).strftime("%Y-%m-%d")
49
+ except Exception:
50
+ return None
51
+
52
+
53
+ def _extract_title_from_html(html_text: str) -> Optional[str]:
54
+ m = re.search(r"<title[^>]*>(.*?)</title>", html_text, re.IGNORECASE | re.DOTALL)
55
+ if not m:
56
+ return None
57
+ title = re.sub(r"\s+", " ", m.group(1)).strip()
58
+ return html.unescape(title) if title else None
59
+
60
+
61
+ # ──────────────────────────────────────────────────────────────────────────────
62
+ # Tool: search (metadata only)
63
+ # ──────────────────────────────────────────────────────────────────────────────
64
+ async def search(
65
  query: str, search_type: str = "search", num_results: Optional[int] = 4
66
+ ) -> Dict[str, Any]:
67
  """
68
+ Perform a web or news search via Serper and return metadata ONLY.
69
+ Does NOT fetch or extract content from result URLs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  """
71
  start_time = time.time()
72
 
73
+ # Validate inputs
74
+ if not query or not query.strip():
75
+ await record_request("search")
76
+ return {"error": "Missing 'query'. Please provide a search query string."}
77
 
 
78
  if num_results is None:
79
  num_results = 4
80
+ num_results = max(1, min(20, int(num_results)))
 
 
81
  if search_type not in ["search", "news"]:
82
  search_type = "search"
83
 
84
+ # Check API key
85
+ if not SERPER_API_KEY:
86
+ await record_request("search")
87
+ return {
88
+ "error": "SERPER_API_KEY is not set. Export SERPER_API_KEY and try again."
89
+ }
90
+
91
  try:
92
+ # Rate limit
93
  if not await limiter.hit(rate_limit, "global"):
94
+ await record_request("search")
95
+ return {"error": "Rate limit exceeded. Limit: 360 requests/hour."}
 
 
96
 
 
97
  endpoint = (
98
  SERPER_NEWS_ENDPOINT if search_type == "news" else SERPER_SEARCH_ENDPOINT
99
  )
100
+ payload: Dict[str, Any] = {"q": query, "num": num_results}
 
 
101
  if search_type == "news":
102
  payload["type"] = "news"
103
  payload["page"] = 1
 
106
  resp = await client.post(endpoint, headers=HEADERS, json=payload)
107
 
108
  if resp.status_code != 200:
109
+ await record_request("search")
110
+ return {
111
+ "error": f"Search API returned status {resp.status_code}. Check your API key and query."
112
+ }
113
+
114
+ data = resp.json()
115
+ raw_results: List[Dict[str, Any]] = (
116
+ data.get("news", []) if search_type == "news" else data.get("organic", [])
117
+ )
118
+ if not raw_results:
119
+ await record_request("search")
120
+ return {
121
+ "query": query,
122
+ "search_type": search_type,
123
+ "count": 0,
124
+ "results": [],
125
+ "message": f"No {search_type} results found.",
126
+ }
127
+
128
+ formatted: List[Dict[str, Any]] = []
129
+ for idx, r in enumerate(raw_results[:num_results], start=1):
130
+ item = {
131
+ "position": idx,
132
+ "title": r.get("title"),
133
+ "link": r.get("link"),
134
+ "domain": _domain_from_url(r.get("link", "")),
135
+ "snippet": r.get("snippet") or r.get("description"),
136
+ }
137
+ if search_type == "news":
138
+ item["source"] = r.get("source")
139
+ item["date"] = _iso_date_or_unknown(r.get("date"))
140
+ formatted.append(item)
141
+
142
+ await record_request("search")
143
+ return {
144
+ "query": query,
145
+ "search_type": search_type,
146
+ "count": len(formatted),
147
+ "results": formatted,
148
+ "duration_s": round(time.time() - start_time, 2),
149
+ }
150
 
151
+ except Exception as e:
152
+ await record_request("search")
153
+ return {"error": f"Search failed: {str(e)}"}
154
 
 
 
 
 
155
 
156
+ # ──────────────────────────────────────────────────────────────────────────────
157
+ # Tool: fetch (single URL fetch + extraction)
158
+ # ──────────────────────────────────────────────────────────────────────────────
159
+ async def fetch(url: str, timeout: int = 20) -> Dict[str, Any]:
160
+ """
161
+ Fetch a single URL and extract the main readable content.
162
+ """
163
+ start_time = time.time()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
+ if not url or not isinstance(url, str):
166
+ await record_request("fetch")
167
+ return {"error": "Missing 'url'. Please provide a valid URL string."}
168
+ if not url.lower().startswith(("http://", "https://")):
169
+ await record_request("fetch")
170
+ return {"error": "URL must start with http:// or https://."}
171
 
172
+ try:
173
+ # Rate limit
174
+ if not await limiter.hit(rate_limit, "global"):
175
+ await record_request("fetch")
176
+ return {"error": "Rate limit exceeded. Limit: 360 requests/hour."}
177
+
178
+ async with httpx.AsyncClient(timeout=timeout, follow_redirects=True) as client:
179
+ resp = await client.get(url)
180
+
181
+ text = resp.text or ""
182
+ content = (
183
+ trafilatura.extract(
184
+ text,
185
+ include_formatting=False,
186
+ include_comments=False,
187
+ )
188
+ or ""
189
+ )
190
 
191
+ title = _extract_title_from_html(text) or ""
192
+ final_url_str = str(resp.url) if hasattr(resp, "url") else url
193
+ domain = _domain_from_url(final_url_str)
194
+ word_count = len(content.split()) if content else 0
195
+
196
+ result = {
197
+ "url": url,
198
+ "final_url": final_url_str,
199
+ "domain": domain,
200
+ "status_code": resp.status_code,
201
+ "title": title,
202
+ "fetched_at": datetime.now(timezone.utc).isoformat(),
203
+ "word_count": word_count,
204
+ "content": content.strip(),
205
+ "duration_s": round(time.time() - start_time, 2),
206
+ }
207
+
208
+ await record_request("fetch")
209
+ return result
210
+
211
+ except httpx.HTTPError as e:
212
+ await record_request("fetch")
213
+ return {"error": f"Network error while fetching: {str(e)}"}
214
  except Exception as e:
215
+ await record_request("fetch")
216
+ return {"error": f"Unexpected error while fetching: {str(e)}"}
 
217
 
218
 
219
+ # ──────────────────────────────────────────────────────────────────────────────
220
+ # Gradio UI
221
+ # ──────────────────────────────────────────────────────────────────────────────
222
  with gr.Blocks(title="Web Search MCP Server") as demo:
223
  gr.HTML(
224
  """
 
231
  )
232
 
233
  gr.Markdown("# πŸ” Web Search MCP Server")
234
+ gr.Markdown(
235
+ "This server provides two composable MCP tools: **search** (metadata only) and **fetch** (single-URL extraction)."
236
+ )
237
 
238
  with gr.Tabs():
239
  with gr.Tab("App"):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
  with gr.Row():
241
+ # ── Search panel ───────────────────────────────────────────────
242
  with gr.Column(scale=3):
243
+ gr.Markdown("## Search (metadata only)")
244
  query_input = gr.Textbox(
245
  label="Search Query",
246
+ placeholder='e.g. "OpenAI news", "climate change 2024", "React hooks useState"',
247
+ info="Required",
248
  )
 
249
  search_type_input = gr.Radio(
250
  choices=["search", "news"],
251
  value="search",
252
  label="Search Type",
253
+ info="Choose general web search or news",
254
+ )
255
+ num_results_input = gr.Slider(
256
+ minimum=1,
257
+ maximum=20,
258
+ value=4,
259
+ step=1,
260
+ label="Number of Results",
261
+ info="Optional (default 4)",
262
+ )
263
+ search_button = gr.Button("Run Search", variant="primary")
264
+ search_output = gr.JSON(
265
+ label="Search Results (metadata only)",
266
  )
267
 
268
+ gr.Examples(
269
+ examples=[
270
+ ["OpenAI GPT-5 latest developments", "news", 5],
271
+ ["React hooks useState", "search", 4],
272
+ ["Apple Vision Pro reviews", "search", 4],
273
+ ["Tesla stock price today", "news", 6],
274
+ ],
275
+ inputs=[query_input, search_type_input, num_results_input],
276
+ outputs=search_output,
277
+ fn=search,
278
+ cache_examples=False,
279
+ )
 
 
 
 
 
 
280
 
281
+ # ── Fetch panel ────────────────────────────────────────────────
282
+ with gr.Column(scale=2):
283
+ gr.Markdown("## Fetch (single URL β†’ extracted content)")
284
+ url_input = gr.Textbox(
285
+ label="URL",
286
+ placeholder="https://example.com/article",
287
+ info="Required: the URL to fetch and extract",
288
+ )
289
+ timeout_input = gr.Slider(
290
+ minimum=5,
291
+ maximum=60,
292
+ value=20,
293
+ step=1,
294
+ label="Timeout (seconds)",
295
+ info="Optional (default 20)",
296
+ )
297
+ fetch_button = gr.Button("Fetch & Extract", variant="primary")
298
+ fetch_output = gr.JSON(label="Fetched Content (structured)")
299
+
300
+ gr.Examples(
301
+ examples=[
302
+ ["https://news.ycombinator.com/"],
303
+ ["https://www.python.org/dev/peps/pep-0008/"],
304
+ ["https://en.wikipedia.org/wiki/Model_Context_Protocol"],
305
+ ],
306
+ inputs=[url_input],
307
+ outputs=fetch_output,
308
+ fn=fetch,
309
+ cache_examples=False,
310
+ )
311
+
312
+ # Wire up buttons
313
+ search_button.click(
314
+ fn=search,
315
  inputs=[query_input, search_type_input, num_results_input],
316
+ outputs=search_output,
317
+ api_name=False,
318
+ )
319
+ fetch_button.click(
320
+ fn=fetch,
321
+ inputs=[url_input, timeout_input],
322
+ outputs=fetch_output,
323
+ api_name=False,
324
  )
325
 
326
  with gr.Tab("Analytics"):
327
  gr.Markdown("## Community Usage Analytics")
328
+ gr.Markdown("Daily request counts (UTC), split by tool.")
 
 
329
 
330
  with gr.Row():
331
  with gr.Column():
332
+ search_plot = gr.BarPlot(
333
+ value=last_n_days_count_df("search", 14),
 
 
334
  x="date",
335
  y="count",
336
+ title="Daily Search Count",
337
+ tooltip=["date", "count", "full_date"],
338
  height=350,
339
+ x_label_angle=-45,
340
  container=False,
341
  )
 
342
  with gr.Column():
343
+ fetch_plot = gr.BarPlot(
344
+ value=last_n_days_count_df("fetch", 14),
345
  x="date",
346
+ y="count",
347
+ title="Daily Fetch Count",
348
+ tooltip=["date", "count", "full_date"],
349
  height=350,
350
  x_label_angle=-45,
351
  container=False,
352
  )
353
 
354
+ # Refresh analytics on load
 
 
 
 
 
 
 
355
  demo.load(
356
+ fn=lambda: (
357
+ last_n_days_count_df("search", 14),
358
+ last_n_days_count_df("fetch", 14),
359
+ ),
360
+ outputs=[search_plot, fetch_plot],
361
  api_name=False,
362
  )
363
 
364
+ # Expose MCP tools
365
+ gr.api(search, api_name="search")
366
+ gr.api(fetch, api_name="fetch")
367
 
368
 
369
  if __name__ == "__main__":
370
  # Launch with MCP server enabled
 
371
  demo.launch(mcp_server=True, show_api=True)