Spaces:
Running
Running
| # MCP Server Usage Guide | |
| ## Overview | |
| The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP). | |
| ## Available MCP Tools | |
| The following 4 tools are automatically exposed by Gradio when `mcp_server=True`: | |
| ### 1. `get_dataset_metadata` | |
| Retrieve comprehensive metadata for a HuggingFace dataset. | |
| **Parameters:** | |
| - `dataset_id` (string): HuggingFace dataset identifier (e.g., 'imdb', 'squad') | |
| - `config_name` (string, optional): Configuration name for multi-config datasets | |
| **Returns:** JSON object with dataset metadata including size, features, splits, and configuration details. | |
| ### 2. `get_dataset_sample` | |
| Retrieve a sample of rows from a HuggingFace dataset. | |
| **Parameters:** | |
| - `dataset_id` (string): HuggingFace dataset identifier | |
| - `split` (string, default: 'train'): Dataset split to sample from | |
| - `num_samples` (number, default: 10): Number of samples to retrieve (max: 10000) | |
| - `config_name` (string, optional): Configuration name for multi-config datasets | |
| **Returns:** JSON object with sampled data and metadata. | |
| ### 3. `analyze_dataset_features` | |
| Perform exploratory analysis on dataset features with automatic optimization. | |
| **Parameters:** | |
| - `dataset_id` (string): HuggingFace dataset identifier | |
| - `split` (string, default: 'train'): Dataset split to analyze | |
| - `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback) | |
| - `config_name` (string, optional): Configuration name for multi-config datasets | |
| **Returns:** JSON object with comprehensive feature analysis including: | |
| - Feature types (numerical, categorical, text, image, audio) | |
| - Statistical measures (mean, median, std, histograms) | |
| - Missing value analysis | |
| - Unique value counts | |
| - Sample values | |
| **Analysis Methods:** | |
| - **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets) | |
| - Analyzes the full dataset without downloading data | |
| - Provides complete statistics with histograms | |
| - More efficient and accurate | |
| - **Fallback**: Sample-based analysis for non-parquet datasets | |
| - Downloads and analyzes a sample of the dataset | |
| - Computes statistics locally | |
| ### 4. `search_text_in_dataset` | |
| Search for text in text columns of a dataset using the Dataset Viewer API. | |
| **Parameters:** | |
| - `dataset_id` (string): HuggingFace dataset identifier | |
| - `config_name` (string): Configuration name (required for search) | |
| - `split` (string): Dataset split to search in | |
| - `query` (string): Search query text | |
| - `offset` (number, default: 0): Offset for pagination | |
| - `length` (number, default: 10): Number of results to return (max: 100) | |
| **Returns:** JSON object with search results including: | |
| - `features`: List of features from the dataset, including column names and data types | |
| - `rows`: List of matching rows with content from each column | |
| - `num_rows_total`: Total number of examples in the split | |
| - `num_rows_per_page`: Number of examples in the current page | |
| - `partial`: Whether the response is partial (true if the dataset is too large to search completely) | |
| **Limitations:** | |
| - Only text columns are searched | |
| - Only parquet datasets are supported (builder_name="parquet") | |
| - Search is performed by the Dataset Viewer API, not locally | |
| **Validation:** | |
| - The tool validates that the dataset is in parquet format before attempting search | |
| - The tool validates that the dataset has at least one text/string column | |
| - If validation fails, a descriptive error message is returned with suggestions | |
| ## MCP Client Configuration | |
| ### Using with Claude Desktop | |
| Add this configuration to your MCP settings: | |
| ```json | |
| { | |
| "mcpServers": { | |
| "hf-eda-mcp-server": { | |
| "command": "pdm", | |
| "args": ["run", "hf-eda-mcp"], | |
| "env": { | |
| "HF_TOKEN": "your_huggingface_token_here" | |
| } | |
| } | |
| } | |
| } | |
| ``` | |
| ### Using with Hosted Server | |
| If the server is running on a remote host: | |
| ```json | |
| { | |
| "mcpServers": { | |
| "hf-eda-mcp-server": { | |
| "url": "https://your-server.com/gradio_api/mcp/sse" | |
| "headers": { | |
| "hf-api-token": "your_huggingface_token_here" | |
| } | |
| } | |
| } | |
| } | |
| ``` | |
| ## Starting the Server | |
| ### Local Development | |
| ```bash | |
| # Start with MCP server enabled (default) | |
| pdm run hf-eda-mcp | |
| # Start on custom port | |
| pdm run hf-eda-mcp --port 8080 | |
| # Start with verbose logging | |
| pdm run hf-eda-mcp --verbose | |
| # Start without MCP server functionality | |
| pdm run hf-eda-mcp --no-mcp | |
| # Start with custom host (listen on all interfaces) | |
| pdm run hf-eda-mcp --host 0.0.0.0 | |
| # Start with public sharing enabled | |
| pdm run hf-eda-mcp --share | |
| # Start with custom cache directory | |
| pdm run hf-eda-mcp --cache-dir /path/to/cache | |
| # Start with custom maximum sample size | |
| pdm run hf-eda-mcp --max-sample-size 100000 | |
| ``` | |
| ### Server Modes | |
| The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction. | |
| ### Environment Variables | |
| The server supports comprehensive configuration via environment variables: | |
| #### Authentication | |
| - `HF_TOKEN`: HuggingFace access token for private datasets (optional) | |
| #### Server Configuration | |
| - `HF_EDA_PORT`: Server port (default: 7860) | |
| - `HF_EDA_HOST`: Server host (default: 127.0.0.1) | |
| - `HF_EDA_MCP_ENABLED`: Enable MCP server functionality (default: true) | |
| - `HF_EDA_SHARE`: Enable public sharing via Gradio (default: false) | |
| #### Logging Configuration | |
| - `HF_EDA_LOG_LEVEL`: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO) | |
| #### Performance and Caching | |
| - `HF_EDA_CACHE_DIR`: Directory for caching datasets (optional) | |
| - `HF_EDA_MAX_CACHE_SIZE`: Maximum cache size in MB (default: 1000) | |
| - `HF_EDA_MAX_SAMPLE_SIZE`: Maximum sample size for analysis (default: 50000) | |
| - `HF_EDA_MAX_CONCURRENT`: Maximum concurrent requests (default: 10) | |
| - `HF_EDA_REQUEST_TIMEOUT`: Request timeout in seconds (default: 300) | |
| ### Configuration Examples | |
| #### Production Configuration | |
| ```bash | |
| export HF_TOKEN="your_token_here" | |
| export HF_EDA_HOST="0.0.0.0" | |
| export HF_EDA_PORT="8080" | |
| export HF_EDA_LOG_LEVEL="WARNING" | |
| export HF_EDA_CACHE_DIR="/var/cache/hf-eda" | |
| export HF_EDA_MAX_CONCURRENT="20" | |
| pdm run hf-eda-mcp | |
| ``` | |
| #### Development Configuration | |
| ```bash | |
| export HF_TOKEN="your_token_here" | |
| export HF_EDA_LOG_LEVEL="DEBUG" | |
| export HF_EDA_CACHE_DIR="./cache" | |
| pdm run hf-eda-mcp --verbose | |
| ``` | |
| ## Dataset Viewer Statistics Integration | |
| The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits: | |
| ### Benefits | |
| - **Full Dataset Analysis**: Analyzes entire datasets instead of samples | |
| - **No Download Required**: Statistics are pre-computed by HuggingFace | |
| - **Richer Statistics**: Includes histograms, frequencies, and multi-modal support | |
| - **Better Performance**: Faster response times with caching | |
| ### Supported Datasets | |
| Statistics are available for datasets with `builder_name="parquet"`. The tool automatically: | |
| 1. Checks if Dataset Viewer statistics are available | |
| 2. Uses full dataset statistics when available | |
| 3. Falls back to sample-based analysis for other datasets | |
| ### Supported Data Types | |
| The analysis tool provides comprehensive statistics for multiple data types: | |
| - **Numerical** (int, float): min, max, mean, median, std, histograms | |
| - **Categorical** (class_label, string_label): frequencies, unique counts | |
| - **Boolean** (bool): True/False distributions | |
| - **Text** (string_text): character length statistics, histograms | |
| - **Image** (image): dimension statistics, histograms | |
| - **Audio** (audio): duration statistics (seconds), histograms | |
| - **List** (list): length statistics, histograms | |
| ### Response Indicators | |
| Check the `sample_info` field in the response: | |
| - `sampling_method: "dataset_viewer_api"` - Using full dataset statistics | |
| - `sampling_method: "sequential_head"` - Using sample-based analysis | |
| - `represents_full_dataset: true/false` - Whether analysis covers the full dataset | |
| ## Example Usage | |
| Once connected to an MCP client, you can use the tools like this: | |
| ``` | |
| # Get metadata for the IMDB dataset | |
| Use the get_dataset_metadata tool with dataset_id="imdb" | |
| # Sample 5 rows from the training split | |
| Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_samples=5 | |
| # Analyze features of the GLUE dataset (CoLA configuration) | |
| Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500 | |
| # Search for text in the IMDB dataset | |
| Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10 | |
| # Search for a specific term in the SQuAD dataset | |
| Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5 | |
| ``` | |
| ## API Endpoints | |
| When the server is running, you can also access the tools via HTTP API: | |
| - **MCP Schema**: `http://localhost:7860/gradio_api/mcp/schema` | |
| - **API Documentation**: `http://localhost:7860/?view=api` | |
| - **Web Interface**: `http://localhost:7860` | |
| ## Troubleshooting | |
| ### Authentication Issues | |
| - Ensure `HF_TOKEN` environment variable is set for private datasets | |
| - Check that your HuggingFace token has appropriate permissions | |
| ### Dataset Not Found | |
| - Verify the dataset ID is correct and exists on HuggingFace Hub | |
| - Check if the dataset requires authentication | |
| ### Performance Issues | |
| - Reduce `sample_size` for large datasets | |
| - Use streaming mode (enabled by default) for better memory efficiency | |
| ### Search Tool Issues | |
| - **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration | |
| - **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first |