hf-eda-mcp

Running

App Files Files Community

hf-eda-mcp / docs /MCP_USAGE.md

KhalilGuetari

Add a search text in dataset tool

ca96eb9 18 days ago

preview code

raw

history blame

9.98 kB

	# MCP Server Usage Guide

	## Overview

	The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).

	## Available MCP Tools

	The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:

	### 1. `get_dataset_metadata`
	Retrieve comprehensive metadata for a HuggingFace dataset.

	Parameters:
	- `dataset_id` (string): HuggingFace dataset identifier (e.g., 'imdb', 'squad')
	- `config_name` (string, optional): Configuration name for multi-config datasets

	Returns: JSON object with dataset metadata including size, features, splits, and configuration details.

	### 2. `get_dataset_sample`
	Retrieve a sample of rows from a HuggingFace dataset.

	Parameters:
	- `dataset_id` (string): HuggingFace dataset identifier
	- `split` (string, default: 'train'): Dataset split to sample from
	- `num_samples` (number, default: 10): Number of samples to retrieve (max: 10000)
	- `config_name` (string, optional): Configuration name for multi-config datasets

	Returns: JSON object with sampled data and metadata.

	### 3. `analyze_dataset_features`
	Perform exploratory analysis on dataset features with automatic optimization.

	Parameters:
	- `dataset_id` (string): HuggingFace dataset identifier
	- `split` (string, default: 'train'): Dataset split to analyze
	- `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
	- `config_name` (string, optional): Configuration name for multi-config datasets

	Returns: JSON object with comprehensive feature analysis including:
	- Feature types (numerical, categorical, text, image, audio)
	- Statistical measures (mean, median, std, histograms)
	- Missing value analysis
	- Unique value counts
	- Sample values

	Analysis Methods:
	- Primary: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
	- Analyzes the full dataset without downloading data
	- Provides complete statistics with histograms
	- More efficient and accurate
	- Fallback: Sample-based analysis for non-parquet datasets
	- Downloads and analyzes a sample of the dataset
	- Computes statistics locally

	### 4. `search_text_in_dataset`
	Search for text in text columns of a dataset using the Dataset Viewer API.

	Parameters:
	- `dataset_id` (string): HuggingFace dataset identifier
	- `config_name` (string): Configuration name (required for search)
	- `split` (string): Dataset split to search in
	- `query` (string): Search query text
	- `offset` (number, default: 0): Offset for pagination
	- `length` (number, default: 10): Number of results to return (max: 100)

	Returns: JSON object with search results including:
	- `features`: List of features from the dataset, including column names and data types
	- `rows`: List of matching rows with content from each column
	- `num_rows_total`: Total number of examples in the split
	- `num_rows_per_page`: Number of examples in the current page
	- `partial`: Whether the response is partial (true if the dataset is too large to search completely)

	Limitations:
	- Only text columns are searched
	- Only parquet datasets are supported (builder_name="parquet")
	- Search is performed by the Dataset Viewer API, not locally

	Validation:
	- The tool validates that the dataset is in parquet format before attempting search
	- The tool validates that the dataset has at least one text/string column
	- If validation fails, a descriptive error message is returned with suggestions

	## MCP Client Configuration

	### Using with Claude Desktop

	Add this configuration to your MCP settings:

	```json
	{
	"mcpServers": {
	"hf-eda-mcp-server": {
	"command": "pdm",
	"args": ["run", "hf-eda-mcp"],
	"env": {
	"HF_TOKEN": "your_huggingface_token_here"
	}
	}
	}
	}
	```

	### Using with Hosted Server

	If the server is running on a remote host:

	```json
	{
	"mcpServers": {
	"hf-eda-mcp-server": {
	"url": "https://your-server.com/gradio_api/mcp/sse"
	"headers": {
	"hf-api-token": "your_huggingface_token_here"
	}
	}
	}
	}
	```

	## Starting the Server

	### Local Development
	```bash
	# Start with MCP server enabled (default)
	pdm run hf-eda-mcp

	# Start on custom port
	pdm run hf-eda-mcp --port 8080

	# Start with verbose logging
	pdm run hf-eda-mcp --verbose

	# Start without MCP server functionality
	pdm run hf-eda-mcp --no-mcp

	# Start with custom host (listen on all interfaces)
	pdm run hf-eda-mcp --host 0.0.0.0

	# Start with public sharing enabled
	pdm run hf-eda-mcp --share

	# Start with custom cache directory
	pdm run hf-eda-mcp --cache-dir /path/to/cache

	# Start with custom maximum sample size
	pdm run hf-eda-mcp --max-sample-size 100000
	```

	### Server Modes

	The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.

	### Environment Variables

	The server supports comprehensive configuration via environment variables:

	#### Authentication
	- `HF_TOKEN`: HuggingFace access token for private datasets (optional)

	#### Server Configuration
	- `HF_EDA_PORT`: Server port (default: 7860)
	- `HF_EDA_HOST`: Server host (default: 127.0.0.1)
	- `HF_EDA_MCP_ENABLED`: Enable MCP server functionality (default: true)
	- `HF_EDA_SHARE`: Enable public sharing via Gradio (default: false)

	#### Logging Configuration
	- `HF_EDA_LOG_LEVEL`: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)

	#### Performance and Caching
	- `HF_EDA_CACHE_DIR`: Directory for caching datasets (optional)
	- `HF_EDA_MAX_CACHE_SIZE`: Maximum cache size in MB (default: 1000)
	- `HF_EDA_MAX_SAMPLE_SIZE`: Maximum sample size for analysis (default: 50000)
	- `HF_EDA_MAX_CONCURRENT`: Maximum concurrent requests (default: 10)
	- `HF_EDA_REQUEST_TIMEOUT`: Request timeout in seconds (default: 300)

	### Configuration Examples

	#### Production Configuration
	```bash
	export HF_TOKEN="your_token_here"
	export HF_EDA_HOST="0.0.0.0"
	export HF_EDA_PORT="8080"
	export HF_EDA_LOG_LEVEL="WARNING"
	export HF_EDA_CACHE_DIR="/var/cache/hf-eda"
	export HF_EDA_MAX_CONCURRENT="20"
	pdm run hf-eda-mcp
	```

	#### Development Configuration
	```bash
	export HF_TOKEN="your_token_here"
	export HF_EDA_LOG_LEVEL="DEBUG"
	export HF_EDA_CACHE_DIR="./cache"
	pdm run hf-eda-mcp --verbose
	```

	## Dataset Viewer Statistics Integration

	The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:

	### Benefits
	- Full Dataset Analysis: Analyzes entire datasets instead of samples
	- No Download Required: Statistics are pre-computed by HuggingFace
	- Richer Statistics: Includes histograms, frequencies, and multi-modal support
	- Better Performance: Faster response times with caching

	### Supported Datasets
	Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
	1. Checks if Dataset Viewer statistics are available
	2. Uses full dataset statistics when available
	3. Falls back to sample-based analysis for other datasets

	### Supported Data Types
	The analysis tool provides comprehensive statistics for multiple data types:
	- Numerical (int, float): min, max, mean, median, std, histograms
	- Categorical (class_label, string_label): frequencies, unique counts
	- Boolean (bool): True/False distributions
	- Text (string_text): character length statistics, histograms
	- Image (image): dimension statistics, histograms
	- Audio (audio): duration statistics (seconds), histograms
	- List (list): length statistics, histograms

	### Response Indicators
	Check the `sample_info` field in the response:
	- `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
	- `sampling_method: "sequential_head"` - Using sample-based analysis
	- `represents_full_dataset: true/false` - Whether analysis covers the full dataset

	## Example Usage

	Once connected to an MCP client, you can use the tools like this:

	```
	# Get metadata for the IMDB dataset
	Use the get_dataset_metadata tool with dataset_id="imdb"

	# Sample 5 rows from the training split
	Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_samples=5

	# Analyze features of the GLUE dataset (CoLA configuration)
	Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500

	# Search for text in the IMDB dataset
	Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10

	# Search for a specific term in the SQuAD dataset
	Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
	```

	## API Endpoints

	When the server is running, you can also access the tools via HTTP API:

	- MCP Schema: `http://localhost:7860/gradio_api/mcp/schema`
	- API Documentation: `http://localhost:7860/?view=api`
	- Web Interface: `http://localhost:7860`

	## Troubleshooting

	### Authentication Issues
	- Ensure `HF_TOKEN` environment variable is set for private datasets
	- Check that your HuggingFace token has appropriate permissions

	### Dataset Not Found
	- Verify the dataset ID is correct and exists on HuggingFace Hub
	- Check if the dataset requires authentication

	### Performance Issues
	- Reduce `sample_size` for large datasets
	- Use streaming mode (enabled by default) for better memory efficiency

	### Search Tool Issues
	- Dataset not in parquet format: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
	- No text columns found: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first