hf-eda-mcp

Running

File size: 9,976 Bytes

# MCP Server Usage Guide

## Overview

The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).

## Available MCP Tools

The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:

### 1. `get_dataset_metadata`
Retrieve comprehensive metadata for a HuggingFace dataset.

**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier (e.g., 'imdb', 'squad')
- `config_name` (string, optional): Configuration name for multi-config datasets

**Returns:** JSON object with dataset metadata including size, features, splits, and configuration details.

### 2. `get_dataset_sample`
Retrieve a sample of rows from a HuggingFace dataset.

**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier
- `split` (string, default: 'train'): Dataset split to sample from
- `num_samples` (number, default: 10): Number of samples to retrieve (max: 10000)
- `config_name` (string, optional): Configuration name for multi-config datasets

**Returns:** JSON object with sampled data and metadata.

### 3. `analyze_dataset_features`
Perform exploratory analysis on dataset features with automatic optimization.

**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier
- `split` (string, default: 'train'): Dataset split to analyze
- `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
- `config_name` (string, optional): Configuration name for multi-config datasets

**Returns:** JSON object with comprehensive feature analysis including:
- Feature types (numerical, categorical, text, image, audio)
- Statistical measures (mean, median, std, histograms)
- Missing value analysis
- Unique value counts
- Sample values

**Analysis Methods:**
- **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
  - Analyzes the full dataset without downloading data
  - Provides complete statistics with histograms
  - More efficient and accurate
- **Fallback**: Sample-based analysis for non-parquet datasets
  - Downloads and analyzes a sample of the dataset
  - Computes statistics locally

### 4. `search_text_in_dataset`
Search for text in text columns of a dataset using the Dataset Viewer API.

**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier
- `config_name` (string): Configuration name (required for search)
- `split` (string): Dataset split to search in
- `query` (string): Search query text
- `offset` (number, default: 0): Offset for pagination
- `length` (number, default: 10): Number of results to return (max: 100)

**Returns:** JSON object with search results including:
- `features`: List of features from the dataset, including column names and data types
- `rows`: List of matching rows with content from each column
- `num_rows_total`: Total number of examples in the split
- `num_rows_per_page`: Number of examples in the current page
- `partial`: Whether the response is partial (true if the dataset is too large to search completely)

**Limitations:**
- Only text columns are searched
- Only parquet datasets are supported (builder_name="parquet")
- Search is performed by the Dataset Viewer API, not locally

**Validation:**
- The tool validates that the dataset is in parquet format before attempting search
- The tool validates that the dataset has at least one text/string column
- If validation fails, a descriptive error message is returned with suggestions

## MCP Client Configuration

### Using with Claude Desktop

Add this configuration to your MCP settings:

```json
{
  "mcpServers": {
    "hf-eda-mcp-server": {
      "command": "pdm",
      "args": ["run", "hf-eda-mcp"],
      "env": {
        "HF_TOKEN": "your_huggingface_token_here"
      }
    }
  }
}
```

### Using with Hosted Server

If the server is running on a remote host:

```json
{
  "mcpServers": {
    "hf-eda-mcp-server": {
      "url": "https://your-server.com/gradio_api/mcp/sse"
      "headers": {
        "hf-api-token": "your_huggingface_token_here"
      }
    }
  }
}
```

## Starting the Server

### Local Development
```bash
# Start with MCP server enabled (default)
pdm run hf-eda-mcp

# Start on custom port
pdm run hf-eda-mcp --port 8080

# Start with verbose logging
pdm run hf-eda-mcp --verbose

# Start without MCP server functionality
pdm run hf-eda-mcp --no-mcp

# Start with custom host (listen on all interfaces)
pdm run hf-eda-mcp --host 0.0.0.0

# Start with public sharing enabled
pdm run hf-eda-mcp --share

# Start with custom cache directory
pdm run hf-eda-mcp --cache-dir /path/to/cache

# Start with custom maximum sample size
pdm run hf-eda-mcp --max-sample-size 100000
```

### Server Modes

The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.

### Environment Variables

The server supports comprehensive configuration via environment variables:

#### Authentication
- `HF_TOKEN`: HuggingFace access token for private datasets (optional)

#### Server Configuration
- `HF_EDA_PORT`: Server port (default: 7860)
- `HF_EDA_HOST`: Server host (default: 127.0.0.1)
- `HF_EDA_MCP_ENABLED`: Enable MCP server functionality (default: true)
- `HF_EDA_SHARE`: Enable public sharing via Gradio (default: false)

#### Logging Configuration
- `HF_EDA_LOG_LEVEL`: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)

#### Performance and Caching
- `HF_EDA_CACHE_DIR`: Directory for caching datasets (optional)
- `HF_EDA_MAX_CACHE_SIZE`: Maximum cache size in MB (default: 1000)
- `HF_EDA_MAX_SAMPLE_SIZE`: Maximum sample size for analysis (default: 50000)
- `HF_EDA_MAX_CONCURRENT`: Maximum concurrent requests (default: 10)
- `HF_EDA_REQUEST_TIMEOUT`: Request timeout in seconds (default: 300)

### Configuration Examples

#### Production Configuration
```bash
export HF_TOKEN="your_token_here"
export HF_EDA_HOST="0.0.0.0"
export HF_EDA_PORT="8080"
export HF_EDA_LOG_LEVEL="WARNING"
export HF_EDA_CACHE_DIR="/var/cache/hf-eda"
export HF_EDA_MAX_CONCURRENT="20"
pdm run hf-eda-mcp
```

#### Development Configuration
```bash
export HF_TOKEN="your_token_here"
export HF_EDA_LOG_LEVEL="DEBUG"
export HF_EDA_CACHE_DIR="./cache"
pdm run hf-eda-mcp --verbose
```

## Dataset Viewer Statistics Integration

The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:

### Benefits
- **Full Dataset Analysis**: Analyzes entire datasets instead of samples
- **No Download Required**: Statistics are pre-computed by HuggingFace
- **Richer Statistics**: Includes histograms, frequencies, and multi-modal support
- **Better Performance**: Faster response times with caching

### Supported Datasets
Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
1. Checks if Dataset Viewer statistics are available
2. Uses full dataset statistics when available
3. Falls back to sample-based analysis for other datasets

### Supported Data Types
The analysis tool provides comprehensive statistics for multiple data types:
- **Numerical** (int, float): min, max, mean, median, std, histograms
- **Categorical** (class_label, string_label): frequencies, unique counts
- **Boolean** (bool): True/False distributions
- **Text** (string_text): character length statistics, histograms
- **Image** (image): dimension statistics, histograms
- **Audio** (audio): duration statistics (seconds), histograms
- **List** (list): length statistics, histograms

### Response Indicators
Check the `sample_info` field in the response:
- `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
- `sampling_method: "sequential_head"` - Using sample-based analysis
- `represents_full_dataset: true/false` - Whether analysis covers the full dataset

## Example Usage

Once connected to an MCP client, you can use the tools like this:

```
# Get metadata for the IMDB dataset
Use the get_dataset_metadata tool with dataset_id="imdb"

# Sample 5 rows from the training split
Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_samples=5

# Analyze features of the GLUE dataset (CoLA configuration)
Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500

# Search for text in the IMDB dataset
Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10

# Search for a specific term in the SQuAD dataset
Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
```

## API Endpoints

When the server is running, you can also access the tools via HTTP API:

- **MCP Schema**: `http://localhost:7860/gradio_api/mcp/schema`
- **API Documentation**: `http://localhost:7860/?view=api`
- **Web Interface**: `http://localhost:7860`

## Troubleshooting

### Authentication Issues
- Ensure `HF_TOKEN` environment variable is set for private datasets
- Check that your HuggingFace token has appropriate permissions

### Dataset Not Found
- Verify the dataset ID is correct and exists on HuggingFace Hub
- Check if the dataset requires authentication

### Performance Issues
- Reduce `sample_size` for large datasets
- Use streaming mode (enabled by default) for better memory efficiency

### Search Tool Issues
- **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
- **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first