hf-eda-mcp

Running

File size: 8,816 Bytes

---
title: HuggingFace EDA MCP Server
short_description: MCP server to explore and analyze HuggingFace datasets
emoji: 📊
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.0.0
app_file: src/app.py
pinned: false
license: apache-2.0
app_port: 7860
tags:
  - building-mcp-track-enterprise
  - building-mcp-track-consumer
---

# 📊 HuggingFace EDA MCP Server

> 🎉 Submission for the [Gradio MCP 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)

An MCP server that gives AI assistants the ability to explore and analyze any of the 500,000+ datasets on the HuggingFace Hub.

Whether you're a ML engineer, data scientist, or researcher, dataset exploration is a critical part of the workflow. This server automates the tedious parts such as fetching metadata, sampling data, computing statistics, so you can focus on what matters: finding and understanding the right data for your task.

**Use cases:**
- **Dataset discovery**:
  - Inspect metadata, schemas, and samples to evaluate datasets before use
  - Use it in conjunction with HuggingFace MCP `search_dataset` for even more powerful dataset discovery
- **Exploratory Data analysis**:
  - Analyze feature distributions, detect missing values, and review statistics
  - Ask your AI assistant to build reports and visualizations
- **Content search**: Find specific examples in datasets using text search

<p align="center">
  <a href="https://www.youtube.com/watch?v=XdP7zGSb81k">
    <img src="https://img.shields.io/badge/▶️_Demo_Video-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="Demo Video">
  </a>
  &nbsp;
  <a href="https://www.linkedin.com/posts/khalil-guetari-00a61415a_mcp-server-for-huggingface-datasets-discovery-activity-7400587711838842880-2K8p">
    <img src="https://img.shields.io/badge/LinkedIn_Post-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Post">
  </a>
  &nbsp;
  <a href="https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp">
    <img src="https://img.shields.io/badge/🤗_Try_it_on_HF_Spaces-FFD21E?style=for-the-badge" alt="HF Space">
  </a>
</p>

## MCP Client Configuration

Connect your MCP client to the hosted server. A HuggingFace token is required to access private/gated datasets and to use the Dataset Viewer API.

**Hosted endpoint:** `https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/`

### With URL

```json
{
  "mcpServers": {
    "hf-eda-mcp": {
      "url": "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
      "headers": {
        "hf-api-token": "<HF_TOKEN>"
      }
    }
  }
}
```

### With mcp-remote

```json
{
  "mcpServers": {
    "hf-eda-mcp": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://mcp-1st-birthday-hf-eda-mcp.hf.space/gradio_api/mcp/",
        "--transport",
        "streamable-http",
        "--header",
        "hf-api-token: <HF_TOKEN>"
      ]
    }
  }
}
```

## Available Tools

### `get_dataset_metadata`

Retrieve comprehensive metadata about a HuggingFace dataset.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier (e.g., `imdb`, `squad`, `glue`) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |

**Returns:** Dataset size, features schema, splits info, configurations, download stats, tags, download size, description and more.

---

### `get_dataset_sample`

Retrieve sample rows from a dataset for quick exploration.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to sample from |
| `num_samples` | int | ❌ | `10` | Number of samples to retrieve (max: 10,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |
| `streaming` | bool | ❌ | `True` | Use streaming mode for efficient loading |

**Returns:** Sample data rows with schema information and sampling metadata.

---

### `analyze_dataset_features`

Perform exploratory data analysis on dataset features with automatic optimization.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | HuggingFace dataset identifier |
| `split` | string | ❌ | `train` | Dataset split to analyze |
| `sample_size` | int | ❌ | `1000` | Number of samples for analysis (max: 50,000) |
| `config_name` | string | ❌ | `None` | Configuration name for multi-config datasets |

**Returns:** Feature types, statistics (mean, std, min, max for numerical), distributions, histograms, and missing value analysis. Supports numerical, categorical, text, image, and audio data types.

---

### `search_text_in_dataset`

Search for text in dataset columns using the Dataset Viewer API.

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `dataset_id` | string | ✅ | - | Full dataset identifier (e.g., `stanfordnlp/imdb`) |
| `config_name` | string | ✅ | - | Configuration name |
| `split` | string | ✅ | - | Split name |
| `query` | string | ✅ | - | Search query |
| `offset` | int | ❌ | `0` | Pagination offset |
| `length` | int | ❌ | `10` | Number of results to return |

**Returns:** Matching rows with highlighted search results. Only works on parquet datasets with text columns.

---

## How It Works

### API Integrations

The server leverages multiple HuggingFace APIs:

| API | Used For |
|-----|----------|
| **[Hub API](https://huggingface.co/docs/huggingface_hub/guides/hf_api)** | Dataset metadata, repository info, download stats |
| **[Dataset Viewer API](https://huggingface.co/docs/dataset-viewer)** | Full dataset statistics, text search, parquet row access |
| **[datasets library](https://huggingface.co/docs/datasets)** | Streaming data loading, sample extraction |

### Data Loading Strategy

- **Streaming mode** (default): Uses `datasets.load_dataset(..., streaming=True)` to avoid downloading entire datasets. Samples are taken from an iterator, minimizing memory footprint.
- **Statistics API**: For parquet datasets, `analyze_dataset_features` first attempts to fetch pre-computed statistics from the Dataset Viewer API (`/statistics` endpoint), providing full dataset coverage without sampling.
- **Fallback**: If statistics aren't available, analysis falls back to sample-based computation.

### Caching

Results are cached locally to reduce API calls:

| Cache Type | TTL | Location |
|------------|-----|----------|
| Metadata | 1 hour | `~/.cache/hf_eda_mcp/metadata/` |
| Samples | 1 hour | `~/.cache/hf_eda_mcp/samples/` |
| Statistics | 1 hour | `~/.cache/hf_eda_mcp/statistics/` |

### Parquet Requirements

Some features require datasets with `builder_name="parquet"`:
- **Text search** (`search_text_in_dataset`): Only parquet datasets are searchable
- **Full statistics**: Pre-computed stats are only available for parquet datasets

### Error Handling

- Automatic retry with exponential backoff for transient network errors
- Graceful fallback from statistics API to sample-based analysis
- Descriptive error messages with suggestions for common issues


## Project Structure

```
src/hf_eda_mcp/
├── server.py                 # Gradio app with MCP server setup
├── config.py                 # Server configuration (env vars, defaults)
├── validation.py             # Input validation for all tools
├── error_handling.py         # Retry logic, error formatting
├── tools/                    # MCP tools (exposed via Gradio)
│   ├── metadata.py           # get_dataset_metadata
│   ├── sampling.py           # get_dataset_sample
│   ├── analysis.py           # analyze_dataset_features
│   └── search.py             # search_text_in_dataset
├── services/                 # Business logic layer
│   ├── dataset_service.py    # Caching, data loading, statistics
└── integrations/
    └── dataset_viewer_adapter.py  # Dataset Viewer API client
    └── hf_client.py          # HuggingFace Hub API wrapper (HfApi)
```

## Local Development

### Setup

```bash
# Install pdm
brew install pdm

# Clone the repository
git clone https://huggingface.co/spaces/MCP-1st-Birthday/hf-eda-mcp
cd hf-eda-mcp

# Install dependencies
pdm install

# Set your HuggingFace token
export HF_TOKEN=hf_xxx
# or create a .env file with HF_TOKEN=hf_xxx (see config.example.env)

# Run the server
pdm run hf-eda-mcp
```

The server starts at `http://localhost:7860` with MCP endpoint at `/gradio_api/mcp/`.

## License

Apache License 2.0