File size: 8,816 Bytes
5aaaef8
830ace3
5010512
830ace3
5aaaef8
 
7f6faa9
3ef1838
7f6faa9
5aaaef8
 
8eb6710
830ace3
 
 
5aaaef8
 
64e67e1
5aaaef8
3d81235
5aaaef8
64e67e1
11bba08
64e67e1
5aaaef8
64e67e1
 
 
 
 
 
 
 
11bba08
21bc165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64e67e1
5aaaef8
64e67e1
5aaaef8
64e67e1
5aaaef8
64e67e1
 
 
 
5aaaef8
64e67e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca96eb9
2b910cc
64e67e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5aaaef8
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: πŸ“Š
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
  - building-mcp-track-enterprise
  - building-mcp-track-consumer
---

# πŸ“Š HuggingFace EDA MCP Server

> πŸŽ‰ Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)

An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.

Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.

**Use cases:**
- **Dataset discovery**:
  - Inspect metadata, schemas, and samples to evaluate datasets before use
  - Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
- **Exploratory Data analysis**:
  - Analyze feature distributions, detect missing values, and review statistics
  - Ask your AI assistant to build reports and visualizations
- **Content search**: Find specific examples in datasets using text search

<p align="center">
  <a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
    <img src="https://img.shields.io/badge/▢️_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
  </a>
  &nbsp;
  <a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
    <img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
  </a>
  &nbsp;
  <a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
    <img src="https://img.shields.io/badge/πŸ€—_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
  </a>
</p>

## MCP Client Configuration

Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.

**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`

### With URL

```json
{
  "mcpServers": {
    "hf-eda-mcp": {
      "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
      "headers": {
        "hf-api-token": "<HF_TOKEN>"
      }
    }
  }
}
```

### With mcp-remote

```json
{
  "mcpServers": {
    "hf-eda-mcp": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
        "--transport",
        "streamable-http",
        "--header",
        "hf-api-token: <HF_TOKEN>"
      ]
    }
  }
}
```

## Available Tools

### `get_dataset_metadata`

Retrieve comprehensive metadata about a HuggingFace dataset.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |

**Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.

---

### `get_dataset_sample`

Retrieve sample rows from a dataset for quick exploration.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to sample from |
| `num_samples` | int | ❌ | `10` | Number of samples to retrieve (max: 10,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
| `streaming` | bool | ❌ | `True` | Use streaming mode for efficient loading |

**Returns:** Sample data rows with schema information and sampling metadata.

---

### `analyze_dataset_features`

Perform exploratory data analysis on dataset features with automatic optimization.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to analyze |
| `sample_size` | int | ❌ | `1000` | Number of samples for analysis (max: 50,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |

**Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.

---

### `search_text_in_dataset`

Search for text in dataset columns using the Dataset Viewer API.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | βœ… | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) |
| `config_name` | string | βœ… | - | Configuration name |
| `split` | string | βœ… | - | Split name |
| `query` | string | βœ… | - | Search query |
| `offset` | int | ❌ | `0` | Pagination offset |
| `length` | int | ❌ | `10` | Number of results to return |

**Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns.

---

## How It Works

### API Integrations

The server leverages multiple HuggingFace APIs:

| API | Used For |
|-----|----------|
| **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats |
| **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access |
| **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction |

### Data Loading Strategy

- **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
- **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
- **Fallback**: If statistics aren't available, analysis falls back to sample-based computation.

### Caching

Results are cached locally to reduce API calls:

| Cache Type | TTL | Location |
|------------|-----|----------|
| Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` |
| Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` |
| Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` |

### Parquet Requirements

Some features require datasets with `builder_name="parquet"`:
- **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable
- **Full statistics**: Pre-computed stats are only available for parquet datasets

### Error Handling

- Automatic retry with exponential backoff for transient network errors
- Graceful fallback from statistics API to sample-based analysis
- Descriptive error messages with suggestions for common issues


## Project Structure

```
src/hf_eda_mcp/
β”œβ”€β”€ server.py                 # Gradio app with MCP server setup
β”œβ”€β”€ config.py                 # Server configuration (env vars, defaults)
β”œβ”€β”€ validation.py             # Input validation for all tools
β”œβ”€β”€ error_handling.py         # Retry logic, error formatting
β”œβ”€β”€ tools/                    # MCP tools (exposed via Gradio)
β”‚   β”œβ”€β”€ metadata.py           # get_dataset_metadata
β”‚   β”œβ”€β”€ sampling.py           # get_dataset_sample
β”‚   β”œβ”€β”€ analysis.py           # analyze_dataset_features
β”‚   └── search.py             # search_text_in_dataset
β”œβ”€β”€ services/                 # Business logic layer
β”‚   β”œβ”€β”€ dataset_service.py    # Caching, data loading, statistics
└── integrations/
    └── dataset_viewer_adapter.py  # Dataset Viewer API client
    └── hf_client.py          # HuggingFace Hub API wrapper (HfApi)
```

## Local Development

### Setup

```bash
# Install pdm
brew install pdm

# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp

# Install dependencies
pdm install

# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)

# Run the server
pdm run hf-eda-mcp
```

The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.

## License

Apache License 2.0