hf-eda-mcp / docs /MCP_USAGE.md
KhalilGuetari's picture
Add a search text in dataset tool
ca96eb9
|
raw
history blame
9.98 kB
# MCP Server Usage Guide
## Overview
The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
## Available MCP Tools
The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:
### 1. `get_dataset_metadata`
Retrieve comprehensive metadata for a HuggingFace dataset.
**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier (e.g., 'imdb', 'squad')
- `config_name` (string, optional): Configuration name for multi-config datasets
**Returns:** JSON object with dataset metadata including size, features, splits, and configuration details.
### 2. `get_dataset_sample`
Retrieve a sample of rows from a HuggingFace dataset.
**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier
- `split` (string, default: 'train'): Dataset split to sample from
- `num_samples` (number, default: 10): Number of samples to retrieve (max: 10000)
- `config_name` (string, optional): Configuration name for multi-config datasets
**Returns:** JSON object with sampled data and metadata.
### 3. `analyze_dataset_features`
Perform exploratory analysis on dataset features with automatic optimization.
**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier
- `split` (string, default: 'train'): Dataset split to analyze
- `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
- `config_name` (string, optional): Configuration name for multi-config datasets
**Returns:** JSON object with comprehensive feature analysis including:
- Feature types (numerical, categorical, text, image, audio)
- Statistical measures (mean, median, std, histograms)
- Missing value analysis
- Unique value counts
- Sample values
**Analysis Methods:**
- **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
- Analyzes the full dataset without downloading data
- Provides complete statistics with histograms
- More efficient and accurate
- **Fallback**: Sample-based analysis for non-parquet datasets
- Downloads and analyzes a sample of the dataset
- Computes statistics locally
### 4. `search_text_in_dataset`
Search for text in text columns of a dataset using the Dataset Viewer API.
**Parameters:**
- `dataset_id` (string): HuggingFace dataset identifier
- `config_name` (string): Configuration name (required for search)
- `split` (string): Dataset split to search in
- `query` (string): Search query text
- `offset` (number, default: 0): Offset for pagination
- `length` (number, default: 10): Number of results to return (max: 100)
**Returns:** JSON object with search results including:
- `features`: List of features from the dataset, including column names and data types
- `rows`: List of matching rows with content from each column
- `num_rows_total`: Total number of examples in the split
- `num_rows_per_page`: Number of examples in the current page
- `partial`: Whether the response is partial (true if the dataset is too large to search completely)
**Limitations:**
- Only text columns are searched
- Only parquet datasets are supported (builder_name="parquet")
- Search is performed by the Dataset Viewer API, not locally
**Validation:**
- The tool validates that the dataset is in parquet format before attempting search
- The tool validates that the dataset has at least one text/string column
- If validation fails, a descriptive error message is returned with suggestions
## MCP Client Configuration
### Using with Claude Desktop
Add this configuration to your MCP settings:
```json
{
"mcpServers": {
"hf-eda-mcp-server": {
"command": "pdm",
"args": ["run", "hf-eda-mcp"],
"env": {
"HF_TOKEN": "your_huggingface_token_here"
}
}
}
}
```
### Using with Hosted Server
If the server is running on a remote host:
```json
{
"mcpServers": {
"hf-eda-mcp-server": {
"url": "https://your-server.com/gradio_api/mcp/sse"
"headers": {
"hf-api-token": "your_huggingface_token_here"
}
}
}
}
```
## Starting the Server
### Local Development
```bash
# Start with MCP server enabled (default)
pdm run hf-eda-mcp
# Start on custom port
pdm run hf-eda-mcp --port 8080
# Start with verbose logging
pdm run hf-eda-mcp --verbose
# Start without MCP server functionality
pdm run hf-eda-mcp --no-mcp
# Start with custom host (listen on all interfaces)
pdm run hf-eda-mcp --host 0.0.0.0
# Start with public sharing enabled
pdm run hf-eda-mcp --share
# Start with custom cache directory
pdm run hf-eda-mcp --cache-dir /path/to/cache
# Start with custom maximum sample size
pdm run hf-eda-mcp --max-sample-size 100000
```
### Server Modes
The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.
### Environment Variables
The server supports comprehensive configuration via environment variables:
#### Authentication
- `HF_TOKEN`: HuggingFace access token for private datasets (optional)
#### Server Configuration
- `HF_EDA_PORT`: Server port (default: 7860)
- `HF_EDA_HOST`: Server host (default: 127.0.0.1)
- `HF_EDA_MCP_ENABLED`: Enable MCP server functionality (default: true)
- `HF_EDA_SHARE`: Enable public sharing via Gradio (default: false)
#### Logging Configuration
- `HF_EDA_LOG_LEVEL`: Logging level - DEBUG, INFO, WARNING, ERROR (default: INFO)
#### Performance and Caching
- `HF_EDA_CACHE_DIR`: Directory for caching datasets (optional)
- `HF_EDA_MAX_CACHE_SIZE`: Maximum cache size in MB (default: 1000)
- `HF_EDA_MAX_SAMPLE_SIZE`: Maximum sample size for analysis (default: 50000)
- `HF_EDA_MAX_CONCURRENT`: Maximum concurrent requests (default: 10)
- `HF_EDA_REQUEST_TIMEOUT`: Request timeout in seconds (default: 300)
### Configuration Examples
#### Production Configuration
```bash
export HF_TOKEN="your_token_here"
export HF_EDA_HOST="0.0.0.0"
export HF_EDA_PORT="8080"
export HF_EDA_LOG_LEVEL="WARNING"
export HF_EDA_CACHE_DIR="/var/cache/hf-eda"
export HF_EDA_MAX_CONCURRENT="20"
pdm run hf-eda-mcp
```
#### Development Configuration
```bash
export HF_TOKEN="your_token_here"
export HF_EDA_LOG_LEVEL="DEBUG"
export HF_EDA_CACHE_DIR="./cache"
pdm run hf-eda-mcp --verbose
```
## Dataset Viewer Statistics Integration
The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:
### Benefits
- **Full Dataset Analysis**: Analyzes entire datasets instead of samples
- **No Download Required**: Statistics are pre-computed by HuggingFace
- **Richer Statistics**: Includes histograms, frequencies, and multi-modal support
- **Better Performance**: Faster response times with caching
### Supported Datasets
Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
1. Checks if Dataset Viewer statistics are available
2. Uses full dataset statistics when available
3. Falls back to sample-based analysis for other datasets
### Supported Data Types
The analysis tool provides comprehensive statistics for multiple data types:
- **Numerical** (int, float): min, max, mean, median, std, histograms
- **Categorical** (class_label, string_label): frequencies, unique counts
- **Boolean** (bool): True/False distributions
- **Text** (string_text): character length statistics, histograms
- **Image** (image): dimension statistics, histograms
- **Audio** (audio): duration statistics (seconds), histograms
- **List** (list): length statistics, histograms
### Response Indicators
Check the `sample_info` field in the response:
- `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
- `sampling_method: "sequential_head"` - Using sample-based analysis
- `represents_full_dataset: true/false` - Whether analysis covers the full dataset
## Example Usage
Once connected to an MCP client, you can use the tools like this:
```
# Get metadata for the IMDB dataset
Use the get_dataset_metadata tool with dataset_id="imdb"
# Sample 5 rows from the training split
Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_samples=5
# Analyze features of the GLUE dataset (CoLA configuration)
Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
# Search for text in the IMDB dataset
Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10
# Search for a specific term in the SQuAD dataset
Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
```
## API Endpoints
When the server is running, you can also access the tools via HTTP API:
- **MCP Schema**: `http://localhost:7860/gradio_api/mcp/schema`
- **API Documentation**: `http://localhost:7860/?view=api`
- **Web Interface**: `http://localhost:7860`
## Troubleshooting
### Authentication Issues
- Ensure `HF_TOKEN` environment variable is set for private datasets
- Check that your HuggingFace token has appropriate permissions
### Dataset Not Found
- Verify the dataset ID is correct and exists on HuggingFace Hub
- Check if the dataset requires authentication
### Performance Issues
- Reduce `sample_size` for large datasets
- Use streaming mode (enabled by default) for better memory efficiency
### Search Tool Issues
- **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
- **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first