Spaces:
Running
Running
File size: 8,816 Bytes
5aaaef8 830ace3 5010512 830ace3 5aaaef8 7f6faa9 3ef1838 7f6faa9 5aaaef8 8eb6710 830ace3 5aaaef8 64e67e1 5aaaef8 3d81235 5aaaef8 64e67e1 11bba08 64e67e1 5aaaef8 64e67e1 11bba08 21bc165 64e67e1 5aaaef8 64e67e1 5aaaef8 64e67e1 5aaaef8 64e67e1 5aaaef8 64e67e1 ca96eb9 2b910cc 64e67e1 5aaaef8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
---
title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: π
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
- building-mcp-track-enterprise
- building-mcp-track-consumer
---
# π HuggingFace EDA MCP Server
> π Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)
An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.
Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.
**Use cases:**
- **Dataset discovery**:
- Inspect metadata, schemas, and samples to evaluate datasets before use
- Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
- **Exploratory Data analysis**:
- Analyze feature distributions, detect missing values, and review statistics
- Ask your AI assistant to build reports and visualizations
- **Content search**: Find specific examples in datasets using text search
<p align="center">
<a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
<img src="https://img.shields.io/badge/βΆοΈ_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
</a>
<a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
<img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
</a>
<a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
<img src="https://img.shields.io/badge/π€_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
</a>
</p>
## MCP Client Configuration
Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.
**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`
### With URL
```json
{
"mcpServers": {
"hf-eda-mcp": {
"url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"headers": {
"hf-api-token": "<HF_TOKEN>"
}
}
}
}
```
### With mcp-remote
```json
{
"mcpServers": {
"hf-eda-mcp": {
"command": "npx",
"args": [
"mcp-remote",
"https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
"--transport",
"streamable-http",
"--header",
"hf-api-token: <HF_TOKEN>"
]
}
}
}
```
## Available Tools
### `get_dataset_metadata`
Retrieve comprehensive metadata about a HuggingFace dataset.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | β
| - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) |
| `config_name` | string | β | `None` | Configuration name for multi-config datasets |
**Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.
---
### `get_dataset_sample`
Retrieve sample rows from a dataset for quick exploration.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | β
| - | HuggingFace dataset identifier |
| `split` | string | β | `train` | Dataset split to sample from |
| `num_samples` | int | β | `10` | Number of samples to retrieve (max: 10,000) |
| `config_name` | string | β | `None` | Configuration name for multi-config datasets |
| `streaming` | bool | β | `True` | Use streaming mode for efficient loading |
**Returns:** Sample data rows with schema information and sampling metadata.
---
### `analyze_dataset_features`
Perform exploratory data analysis on dataset features with automatic optimization.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | β
| - | HuggingFace dataset identifier |
| `split` | string | β | `train` | Dataset split to analyze |
| `sample_size` | int | β | `1000` | Number of samples for analysis (max: 50,000) |
| `config_name` | string | β | `None` | Configuration name for multi-config datasets |
**Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.
---
### `search_text_in_dataset`
Search for text in dataset columns using the Dataset Viewer API.
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | β
| - | Full dataset identifier (e.g., `stanfordnlp/imdb`) |
| `config_name` | string | β
| - | Configuration name |
| `split` | string | β
| - | Split name |
| `query` | string | β
| - | Search query |
| `offset` | int | β | `0` | Pagination offset |
| `length` | int | β | `10` | Number of results to return |
**Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns.
---
## How It Works
### API Integrations
The server leverages multiple HuggingFace APIs:
| API | Used For |
|-----|----------|
| **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats |
| **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access |
| **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction |
### Data Loading Strategy
- **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
- **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
- **Fallback**: If statistics aren't available, analysis falls back to sample-based computation.
### Caching
Results are cached locally to reduce API calls:
| Cache Type | TTL | Location |
|------------|-----|----------|
| Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` |
| Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` |
| Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` |
### Parquet Requirements
Some features require datasets with `builder_name="parquet"`:
- **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable
- **Full statistics**: Pre-computed stats are only available for parquet datasets
### Error Handling
- Automatic retry with exponential backoff for transient network errors
- Graceful fallback from statistics API to sample-based analysis
- Descriptive error messages with suggestions for common issues
## Project Structure
```
src/hf_eda_mcp/
βββ server.py # Gradio app with MCP server setup
βββ config.py # Server configuration (env vars, defaults)
βββ validation.py # Input validation for all tools
βββ error_handling.py # Retry logic, error formatting
βββ tools/ # MCP tools (exposed via Gradio)
β βββ metadata.py # get_dataset_metadata
β βββ sampling.py # get_dataset_sample
β βββ analysis.py # analyze_dataset_features
β βββ search.py # search_text_in_dataset
βββ services/ # Business logic layer
β βββ dataset_service.py # Caching, data loading, statistics
βββ integrations/
βββ dataset_viewer_adapter.py # Dataset Viewer API client
βββ hf_client.py # HuggingFace Hub API wrapper (HfApi)
```
## Local Development
### Setup
```bash
# Install pdm
brew install pdm
# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp
# Install dependencies
pdm install
# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)
# Run the server
pdm run hf-eda-mcp
```
The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.
## License
Apache License 2.0
|