hf-eda-mcp

Running

App Files Files Community

KhalilGuetari commited on 18 days ago

Commit

ca96eb9

1 Parent(s): 3e3178a

Add a search text in dataset tool

Browse files

Files changed (8) hide show

README.md +21 -16
docs/MCP_USAGE.md +94 -7
docs/STATISTICS_ENDPOINT.md +427 -0
src/hf_eda_mcp/server.py +67 -11
src/hf_eda_mcp/services/dataset_service.py +116 -6
src/hf_eda_mcp/services/dataset_viewer_adapter.py +70 -4
src/hf_eda_mcp/tools/metadata.py +4 -1
src/hf_eda_mcp/tools/search.py +130 -0

README.md CHANGED Viewed

@@ -24,25 +24,15 @@ An MCP (Model Context Protocol) server that provides tools for Exploratory Data
   - Automatic fallback to sample-based analysis
   - Supports multiple data types: numerical, categorical, text, image, audio
   - Includes histograms, distributions, and missing value analysis
 ## Usage
 This Space runs as an MCP server that can be accessed by MCP-compatible AI assistants.
-### MCP Client Configuration
-Add this server to your MCP client configuration:
-```json
-{
-  "mcpServers": {
-    "hf-eda-mcp": {
-      "url": "https://YOUR-USERNAME-hf-eda-mcp.hf.space/gradio_api/mcp/sse"
-    }
-  }
-}
-```
 Replace `YOUR-USERNAME` with your HuggingFace username.
 ### Available Tools
@@ -53,15 +43,30 @@ Replace `YOUR-USERNAME` with your HuggingFace username.
    - Automatically uses Dataset Viewer API statistics for parquet datasets (full dataset analysis)
    - Falls back to sample-based analysis for other formats
    - Returns feature types, statistics, histograms, and missing value analysis
-## Authentication
 ## To Do List
 [ ] Security: Do not cache when a dataset is private or gated
-[ ] Complete MCP server configuration and documentation
 ## License

   - Automatic fallback to sample-based analysis
   - Supports multiple data types: numerical, categorical, text, image, audio
   - Includes histograms, distributions, and missing value analysis
+- **Text Search**: Search for text in dataset columns using the Dataset Viewer API
+  - Only text columns are searched
+  - Only parquet datasets are supported
+  - Supports pagination with offset and length parameters
 ## Usage
 This Space runs as an MCP server that can be accessed by MCP-compatible AI assistants.
 Replace `YOUR-USERNAME` with your HuggingFace username.
 ### Available Tools
    - Automatically uses Dataset Viewer API statistics for parquet datasets (full dataset analysis)
    - Falls back to sample-based analysis for other formats
    - Returns feature types, statistics, histograms, and missing value analysis
+4. **search_text_in_dataset**: Search for text in dataset columns
+   - Search text in text columns using the Dataset Viewer API
+   - Only parquet datasets are supported
+   - Supports pagination for large result sets
+## MCP Client Configuration
+Under the hood, tools use DatasetViewer and HfApi to get information on datasets. A HuggingFace Token `hf-api-token` is necessary to use those.
+- **Gradio UI** on the HF space, the token used is a token set in the space's secrets
+- **MCP server**: set up your HF Token in the MCP configuration headers like in the following example:
+```json
+"headers": {
+  "hf-api-token": "hf_token_here"
+},
+```
 ## To Do List
 [ ] Security: Do not cache when a dataset is private or gated
+[x] Complete MCP server configuration and documentation
+[x] Add a search in text tool https://huggingface.co/docs/dataset-viewer/search
+[ ] Add MCP prompts to guide use cases like reports generation?
 ## License

docs/MCP_USAGE.md CHANGED Viewed

@@ -2,11 +2,11 @@
 ## Overview
-The HF EDA MCP Server provides three main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
 ## Available MCP Tools
-The following 3 tools are automatically exposed by Gradio when `mcp_server=True`:
 ### 1. `get_dataset_metadata`
 Retrieve comprehensive metadata for a HuggingFace dataset.
@@ -29,15 +29,57 @@ Retrieve a sample of rows from a HuggingFace dataset.
 **Returns:** JSON object with sampled data and metadata.
 ### 3. `analyze_dataset_features`
-Perform basic exploratory analysis on dataset features.
 **Parameters:**
 - `dataset_id` (string): HuggingFace dataset identifier
 - `split` (string, default: 'train'): Dataset split to analyze
-- `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000)
 - `config_name` (string, optional): Configuration name for multi-config datasets
-**Returns:** JSON object with feature analysis results including statistics, missing values, and data quality assessment.
 ## MCP Client Configuration
@@ -68,6 +110,9 @@ If the server is running on a remote host:
   "mcpServers": {
     "hf-eda-mcp-server": {
       "url": "https://your-server.com/gradio_api/mcp/sse"
     }
   }
 }
@@ -104,7 +149,7 @@ pdm run hf-eda-mcp --max-sample-size 100000
 ### Server Modes
-The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 3 EDA functions as MCP tools while still providing the web interface for direct interaction.
 ### Environment Variables
@@ -150,6 +195,38 @@ export HF_EDA_CACHE_DIR="./cache"
 pdm run hf-eda-mcp --verbose
 ```
 ## Example Usage
 Once connected to an MCP client, you can use the tools like this:
@@ -163,6 +240,12 @@ Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_sampl
 # Analyze features of the GLUE dataset (CoLA configuration)
 Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
 ```
 ## API Endpoints
@@ -185,4 +268,8 @@ When the server is running, you can also access the tools via HTTP API:
 ### Performance Issues
 - Reduce `sample_size` for large datasets
-- Use streaming mode (enabled by default) for better memory efficiency

 ## Overview
+The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
 ## Available MCP Tools
+The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:
 ### 1. `get_dataset_metadata`
 Retrieve comprehensive metadata for a HuggingFace dataset.
 **Returns:** JSON object with sampled data and metadata.
 ### 3. `analyze_dataset_features`
+Perform exploratory analysis on dataset features with automatic optimization.
 **Parameters:**
 - `dataset_id` (string): HuggingFace dataset identifier
 - `split` (string, default: 'train'): Dataset split to analyze
+- `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
 - `config_name` (string, optional): Configuration name for multi-config datasets
+**Returns:** JSON object with comprehensive feature analysis including:
+- Feature types (numerical, categorical, text, image, audio)
+- Statistical measures (mean, median, std, histograms)
+- Missing value analysis
+- Unique value counts
+- Sample values
+**Analysis Methods:**
+- **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
+  - Analyzes the full dataset without downloading data
+  - Provides complete statistics with histograms
+  - More efficient and accurate
+- **Fallback**: Sample-based analysis for non-parquet datasets
+  - Downloads and analyzes a sample of the dataset
+  - Computes statistics locally
+### 4. `search_text_in_dataset`
+Search for text in text columns of a dataset using the Dataset Viewer API.
+**Parameters:**
+- `dataset_id` (string): HuggingFace dataset identifier
+- `config_name` (string): Configuration name (required for search)
+- `split` (string): Dataset split to search in
+- `query` (string): Search query text
+- `offset` (number, default: 0): Offset for pagination
+- `length` (number, default: 10): Number of results to return (max: 100)
+**Returns:** JSON object with search results including:
+- `features`: List of features from the dataset, including column names and data types
+- `rows`: List of matching rows with content from each column
+- `num_rows_total`: Total number of examples in the split
+- `num_rows_per_page`: Number of examples in the current page
+- `partial`: Whether the response is partial (true if the dataset is too large to search completely)
+**Limitations:**
+- Only text columns are searched
+- Only parquet datasets are supported (builder_name="parquet")
+- Search is performed by the Dataset Viewer API, not locally
+**Validation:**
+- The tool validates that the dataset is in parquet format before attempting search
+- The tool validates that the dataset has at least one text/string column
+- If validation fails, a descriptive error message is returned with suggestions
 ## MCP Client Configuration
   "mcpServers": {
     "hf-eda-mcp-server": {
       "url": "https://your-server.com/gradio_api/mcp/sse"
+      "headers": {
+        "hf-api-token": "your_huggingface_token_here"
+      }
     }
   }
 }
 ### Server Modes
+The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.
 ### Environment Variables
 pdm run hf-eda-mcp --verbose
 ```
+## Dataset Viewer Statistics Integration
+The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:
+### Benefits
+- **Full Dataset Analysis**: Analyzes entire datasets instead of samples
+- **No Download Required**: Statistics are pre-computed by HuggingFace
+- **Richer Statistics**: Includes histograms, frequencies, and multi-modal support
+- **Better Performance**: Faster response times with caching
+### Supported Datasets
+Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
+1. Checks if Dataset Viewer statistics are available
+2. Uses full dataset statistics when available
+3. Falls back to sample-based analysis for other datasets
+### Supported Data Types
+The analysis tool provides comprehensive statistics for multiple data types:
+- **Numerical** (int, float): min, max, mean, median, std, histograms
+- **Categorical** (class_label, string_label): frequencies, unique counts
+- **Boolean** (bool): True/False distributions
+- **Text** (string_text): character length statistics, histograms
+- **Image** (image): dimension statistics, histograms
+- **Audio** (audio): duration statistics (seconds), histograms
+- **List** (list): length statistics, histograms
+### Response Indicators
+Check the `sample_info` field in the response:
+- `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
+- `sampling_method: "sequential_head"` - Using sample-based analysis
+- `represents_full_dataset: true/false` - Whether analysis covers the full dataset
 ## Example Usage
 Once connected to an MCP client, you can use the tools like this:
 # Analyze features of the GLUE dataset (CoLA configuration)
 Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
+# Search for text in the IMDB dataset
+Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10
+# Search for a specific term in the SQuAD dataset
+Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
 ```
 ## API Endpoints
 ### Performance Issues
 - Reduce `sample_size` for large datasets
+- Use streaming mode (enabled by default) for better memory efficiency
+### Search Tool Issues
+- **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
+- **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first

docs/STATISTICS_ENDPOINT.md ADDED Viewed

	@@ -0,0 +1,427 @@

+# Dataset Viewer Statistics Endpoint Integration
+## Overview
+The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis.
+## Key Benefits
+### 1. Full Dataset Coverage
+- **Before**: Analysis based on samples (default 1,000 examples)
+- **After**: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)
+### 2. No Data Download Required
+- **Before**: Download and process samples from the dataset
+- **After**: Retrieve pre-computed statistics via API call
+### 3. More Complete Statistics
+The endpoint provides detailed statistics for multiple modalities:
+#### Numerical Features (int, float)
+- **Basic statistics**: min, max, mean, median, std
+- **Missing values**: nan_count, nan_proportion
+- **Distribution**: histogram with bin_edges and hist counts
+Example response:
+```json
+{
+  "column_type": "float",
+  "column_statistics": {
+    "nan_count": 0,
+    "nan_proportion": 0,
+    "min": 0,
+    "max": 2,
+    "mean": 1.67206,
+    "median": 1.8,
+    "std": 0.38714,
+    "histogram": {
+      "hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
+      "bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
+    }
+  }
+}
+```
+#### Categorical Features (class_label, string_label)
+- **Unique values**: n_unique count
+- **Frequencies**: Complete frequency distribution for all categories
+- **Missing values**: nan_count, nan_proportion
+- **No label tracking**: no_label_count, no_label_proportion (for class_label)
+Example response:
+```json
+{
+  "column_type": "class_label",
+  "column_statistics": {
+    "nan_count": 0,
+    "nan_proportion": 0,
+    "no_label_count": 0,
+    "no_label_proportion": 0,
+    "n_unique": 2,
+    "frequencies": {
+      "unacceptable": 2528,
+      "acceptable": 6023
+    }
+  }
+}
+```
+#### Text Features (string_text)
+- **Length statistics**: min, max, mean, median, std (character count)
+- **Missing values**: nan_count, nan_proportion
+- **Distribution**: histogram of text lengths
+Example response:
+```json
+{
+  "column_type": "string_text",
+  "column_statistics": {
+    "nan_count": 0,
+    "nan_proportion": 0,
+    "min": 6,
+    "max": 231,
+    "mean": 40.70074,
+    "median": 37,
+    "std": 19.14431,
+    "histogram": {
+      "hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
+      "bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
+    }
+  }
+}
+```
+#### Boolean Features (bool)
+- **Frequencies**: Distribution of True/False values
+- **Missing values**: nan_count, nan_proportion
+Example response:
+```json
+{
+  "column_type": "bool",
+  "column_statistics": {
+    "nan_count": 3,
+    "nan_proportion": 0.15,
+    "frequencies": {
+      "False": 7,
+      "True": 10
+    }
+  }
+}
+```
+#### Image Features (image)
+- **Dimension statistics**: min, max, mean, median, std (for width/height)
+- **Missing values**: nan_count, nan_proportion
+- **Distribution**: histogram of image dimensions
+Example response:
+```json
+{
+  "column_type": "image",
+  "column_statistics": {
+    "nan_count": 0,
+    "nan_proportion": 0.0,
+    "min": 256,
+    "max": 873,
+    "mean": 327.99339,
+    "median": 341.0,
+    "std": 60.07286,
+    "histogram": {
+      "hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
+      "bin_edges": [256, 318, 380, 442, 504, ...]
+    }
+  }
+}
+```
+#### Audio Features (audio)
+- **Duration statistics**: min, max, mean, median, std (in seconds)
+- **Missing values**: nan_count, nan_proportion
+- **Distribution**: histogram of audio durations
+Example response:
+```json
+{
+  "column_type": "audio",
+  "column_statistics": {
+    "nan_count": 0,
+    "nan_proportion": 0,
+    "min": 1.02,
+    "max": 15,
+    "mean": 13.93042,
+    "median": 14.77,
+    "std": 2.63734,
+    "histogram": {
+      "hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
+      "bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
+    }
+  }
+}
+```
+#### List Features (list)
+- **Length statistics**: min, max, mean, median, std (list length)
+- **Missing values**: nan_count, nan_proportion
+- **Distribution**: histogram of list lengths
+Example response:
+```json
+{
+  "column_type": "list",
+  "column_statistics": {
+    "nan_count": 0,
+    "nan_proportion": 0.0,
+    "min": 1,
+    "max": 3,
+    "mean": 1.01741,
+    "median": 1.0,
+    "std": 0.13146,
+    "histogram": {
+      "hist": [11177, 196, 1],
+      "bin_edges": [1, 2, 3, 3]
+    }
+  }
+}
+```
+## Implementation
+### Architecture
+```
+analyze_dataset_features()
+    ↓
+    Try: get_dataset_statistics() [Dataset Viewer API]
+    ↓
+    If available (parquet format):
+        → Use full dataset statistics
+        → Cache results
+        → Return converted analysis
+    ↓
+    If not available:
+        → Fall back to sample-based analysis
+        → Load samples via streaming
+        → Compute statistics locally
+```
+### Key Components
+#### 1. DatasetViewerAdapter
+- `get_dataset_statistics()`: Fetch statistics from API
+- `check_statistics_availability()`: Check if statistics are available for a dataset
+#### 2. DatasetService
+- `get_dataset_statistics()`: Wrapper with caching and error handling
+- Automatic fallback to sample-based analysis
+- Statistics cache directory: `cache/statistics/`
+#### 3. Analysis Tool
+- `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format
+- Seamless integration with existing analysis pipeline
+### Caching Strategy
+Statistics are cached with the same TTL as other metadata (default: 1 hour):
+```
+cache/
+├── metadata/          # Dataset metadata
+├── samples/           # Sample data
+└── statistics/        # Dataset Viewer statistics
+    └── {dataset}_{config}_{split}_stats.json
+```
+## Usage Examples
+### Automatic Selection
+```python
+from hf_eda_mcp.tools.analysis import analyze_dataset_features
+# Automatically uses Dataset Viewer statistics if available
+result = analyze_dataset_features(
+    dataset_id="stanfordnlp/imdb",
+    split="train"
+)
+# Check which method was used
+print(result['sample_info']['sampling_method'])
+# Output: "dataset_viewer_api" or "sequential_head"
+print(result['sample_info']['represents_full_dataset'])
+# Output: True (full dataset) or False (sample)
+```
+### Check Availability
+```python
+from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
+adapter = DatasetViewerAdapter(token="your_token")
+availability = adapter.check_statistics_availability("stanfordnlp/imdb")
+print(availability)
+# {
+#   'available': True,
+#   'configs': ['plain_text'],
+#   'reason': 'Statistics available for 1 config(s)'
+# }
+```
+### Direct Statistics Access
+```python
+from hf_eda_mcp.services.dataset_service import DatasetService
+service = DatasetService(token="your_token")
+stats = service.get_dataset_statistics(
+    dataset_id="stanfordnlp/imdb",
+    split="train",
+    config_name="plain_text"
+)
+if stats:
+    print(f"Full dataset: {stats['num_examples']} examples")
+    print(f"Columns: {len(stats['statistics'])}")
+else:
+    print("Statistics not available, use sample-based analysis")
+```
+## Comparison: Before vs After
+### IMDB Dataset Example
+#### Before (Sample-based)
+```python
+{
+  'dataset_info': {
+    'sample_size_used': 1000,
+    'sample_size_requested': 1000,
+  },
+  'sample_info': {
+    'sampling_method': 'sequential_head',
+    'represents_full_dataset': True,  # Only if sample >= requested
+  },
+  'features': {
+    'text': {
+      'feature_type': 'text',
+      'statistics': {
+        'count': 1000,
+        'avg_length': 1311.289,
+        'min_length': 65,
+        'max_length': 6103,
+        # Limited to sample
+      }
+    }
+  },
+  'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text'
+}
+```
+#### After (Dataset Viewer)
+```python
+{
+  'dataset_info': {
+    'sample_size_used': 25000,  # Full dataset
+    'sample_size_requested': 25000,
+  },
+  'sample_info': {
+    'sampling_method': 'dataset_viewer_api',
+    'represents_full_dataset': True,  # Always true
+    'partial': False
+  },
+  'features': {
+    'text': {
+      'feature_type': 'text',
+      'statistics': {
+        'count': 25000,  # Full dataset
+        'mean_length': 1325.06964,
+        'min_length': 52,
+        'max_length': 13704,
+        'histogram': {
+          'bin_edges': [52, 1418, 2784, ...],
+          'hist': [17426, 5384, 1490, ...]
+        }
+      }
+    }
+  },
+  'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text'
+}
+```
+## Supported Data Types
+The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:
+| Data Type | Feature Type | Statistics Provided |
+|-----------|--------------|---------------------|
+| `int`, `float` | numerical | min, max, mean, median, std, histogram |
+| `class_label`, `string_label` | categorical | frequencies, n_unique, no_label tracking |
+| `bool` | boolean | True/False frequencies |
+| `string_text` | text | character length stats (min, max, mean, median, std), histogram |
+| `image` | image | dimension statistics, histogram |
+| `audio` | audio | duration statistics (seconds), histogram |
+| `list` | list | length statistics, histogram |
+### Data Type Mapping
+Our analysis tool automatically maps Dataset Viewer types to our internal types:
+```python
+Dataset Viewer Type → Our Feature Type
+─────────────────────────────────────
+int, float          → numerical
+class_label         → categorical
+string_label        → categorical
+bool                → boolean
+string_text         → text
+image               → image
+audio               → audio
+list                → list
+```
+## Limitations
+### Dataset Requirements
+- Only works for datasets with `builder_name="parquet"`
+- Not all datasets on HuggingFace Hub have this format
+- Automatic fallback to sample-based analysis for other formats
+### API Availability
+- Requires internet connection
+- Subject to HuggingFace API rate limits
+- May fail for private datasets without proper authentication
+## Error Handling
+The implementation includes robust error handling:
+1. **Check availability first**: Verify dataset supports statistics
+2. **Graceful fallback**: Automatically use sample-based analysis if unavailable
+3. **Caching**: Reduce API calls and improve performance
+4. **Logging**: Clear messages about which method is being used
+## Performance Impact
+### API Call Overhead
+- Initial call: ~1-2 seconds
+- Cached calls: <10ms
+- No data download required
+### Sample-based Analysis
+- Download time: Varies by dataset size
+- Processing time: ~1-5 seconds for 1000 samples
+- Network bandwidth: Depends on sample size
+## Future Enhancements
+1. **Parallel requests**: Fetch statistics for multiple splits simultaneously
+2. **Partial statistics**: Support datasets with partial statistics
+3. **Custom aggregations**: Add more statistical measures
+4. **Visualization**: Generate plots from histogram data
+## References
+- [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info)
+- [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics)

src/hf_eda_mcp/server.py CHANGED Viewed

@@ -12,6 +12,7 @@ from typing import Optional
 from hf_eda_mcp.tools.metadata import get_dataset_metadata
 from hf_eda_mcp.tools.sampling import get_dataset_sample
 from hf_eda_mcp.tools.analysis import analyze_dataset_features
 from hf_eda_mcp.config import ServerConfig, setup_logging, validate_config, set_config
@@ -25,10 +26,10 @@ def create_gradio_app(config: ServerConfig) -> gr.Blocks:
         gr.Markdown(
             """
             # 🤗 HuggingFace EDA MCP Server
             **Model Context Protocol server for exploratory data analysis of HuggingFace datasets**
-            This server provides three main tools for dataset exploration that are automatically
             exposed as MCP tools when `mcp_server=True` is enabled.
             """
         )
@@ -142,26 +143,80 @@ def create_gradio_app(config: ServerConfig) -> gr.Blocks:
                 ],
             )
         with gr.Tab("ℹ️ About"):
             gr.Markdown(
                 f"""
                 ## About HF EDA MCP Server
-                This server implements the Model Context Protocol (MCP) to provide AI assistants
                 with tools for exploring and analyzing HuggingFace datasets.
                 ### Available MCP Tools
                 1. **get_dataset_metadata**: Retrieve comprehensive dataset information
                 2. **get_dataset_sample**: Sample data from datasets with configurable parameters
                 3. **analyze_dataset_features**: Perform exploratory data analysis
                 ### MCP Server Configuration
                 ### Server Status
-                - **MCP Tools**: 3 tools available
                 - **Authentication**: {"✅ Token configured" if config.hf_token else "⚠️ No token (public datasets only)"}
                 - **MCP Schema**: Available at `/gradio_api/mcp/schema`
                 - **Cache Directory**: {config.cache_dir or "Default system cache"}
@@ -269,6 +324,7 @@ def launch_server(
         logger.info("  - get_dataset_metadata: Retrieve dataset information")
         logger.info("  - get_dataset_sample: Sample data from datasets")
         logger.info("  - analyze_dataset_features: Perform EDA analysis")
         logger.info(
             f"🌐 MCP schema available at: http://{config.host}:{config.port}/gradio_api/mcp/schema"
         )

 from hf_eda_mcp.tools.metadata import get_dataset_metadata
 from hf_eda_mcp.tools.sampling import get_dataset_sample
 from hf_eda_mcp.tools.analysis import analyze_dataset_features
+from hf_eda_mcp.tools.search import search_text_in_dataset
 from hf_eda_mcp.config import ServerConfig, setup_logging, validate_config, set_config
         gr.Markdown(
             """
             # 🤗 HuggingFace EDA MCP Server
             **Model Context Protocol server for exploratory data analysis of HuggingFace datasets**
+            This server provides four main tools for dataset exploration that are automatically
             exposed as MCP tools when `mcp_server=True` is enabled.
             """
         )
                 ],
             )
+        with gr.Tab("🔎 Text Search"):
+            gr.Interface(
+                fn=search_text_in_dataset,
+                inputs=[
+                    gr.Textbox(
+                        label="dataset_id",
+                        placeholder="e.g., imdb, squad, glue",
+                        info="HuggingFace dataset identifier",
+                    ),
+                    gr.Textbox(
+                        label="config_name",
+                        placeholder="e.g., cola, sst2",
+                        info="Configuration name (required for search)",
+                    ),
+                    gr.Dropdown(
+                        choices=["train", "validation", "test", "dev", "val"],
+                        value="train",
+                        label="split",
+                        info="Dataset split to search in",
+                        allow_custom_value=True,
+                    ),
+                    gr.Textbox(
+                        label="query",
+                        placeholder="Enter search query...",
+                        info="Text to search for in the dataset",
+                    ),
+                    gr.Slider(
+                        minimum=0,
+                        maximum=1000,
+                        value=0,
+                        step=10,
+                        label="offset",
+                        info="Offset for pagination",
+                    ),
+                    gr.Slider(
+                        minimum=1,
+                        maximum=100,
+                        value=10,
+                        step=1,
+                        label="length",
+                        info="Number of results to return",
+                    ),
+                ],
+                outputs=gr.JSON(label="Search Results"),
+                title="Search Text in Dataset",
+                description="Search for text in text columns of a dataset. Only text columns are searched and only parquet datasets are supported.",
+                examples=[
+                    ["stanfordnlp/imdb", "plain_text", "train", "great movie", 0, 10],
+                    ["rajpurkar/squad", "plain_text", "train", "president", 0, 5],
+                    ["nyu-mll/glue", "cola", "train", "friends", 0, 10],
+                ],
+            )
         with gr.Tab("ℹ️ About"):
             gr.Markdown(
                 f"""
                 ## About HF EDA MCP Server
+                This server implements the Model Context Protocol (MCP) to provide AI assistants
                 with tools for exploring and analyzing HuggingFace datasets.
                 ### Available MCP Tools
                 1. **get_dataset_metadata**: Retrieve comprehensive dataset information
                 2. **get_dataset_sample**: Sample data from datasets with configurable parameters
                 3. **analyze_dataset_features**: Perform exploratory data analysis
+                4. **search_text_in_dataset**: Search for text in dataset columns
                 ### MCP Server Configuration
                 ### Server Status
+                - **MCP Tools**: 4 tools available
                 - **Authentication**: {"✅ Token configured" if config.hf_token else "⚠️ No token (public datasets only)"}
                 - **MCP Schema**: Available at `/gradio_api/mcp/schema`
                 - **Cache Directory**: {config.cache_dir or "Default system cache"}
         logger.info("  - get_dataset_metadata: Retrieve dataset information")
         logger.info("  - get_dataset_sample: Sample data from datasets")
         logger.info("  - analyze_dataset_features: Perform EDA analysis")
+        logger.info("  - search_text_in_dataset: Search for text in datasets")
         logger.info(
             f"🌐 MCP schema available at: http://{config.host}:{config.port}/gradio_api/mcp/schema"
         )

src/hf_eda_mcp/services/dataset_service.py CHANGED Viewed

@@ -44,6 +44,16 @@ class CacheError(DatasetServiceError):
     pass
 class DatasetService:
     """
     Centralized service for dataset operations with caching support.
@@ -806,33 +816,33 @@ class DatasetService:
         return self.hf_client.validate_dataset_access(dataset_id, config_name)
     def _check_statistics_availability(
-        self,
         dataset_name: str,
         config_name: Optional[str] = None
     ) -> dict:
         """
         Check if statistics are available for a dataset.
         Statistics are only available for datasets with builder_name="parquet".
         This method checks the dataset information to determine availability.
         Args:
             dataset_name: HuggingFace dataset identifier
             config_name: Optional configuration name
         Returns:
             Dictionary with availability information:
             - available: Boolean indicating if statistics are available
             - configs: List of configs with statistics support
             - reason: Explanation if statistics are not available
         Raises:
             DatasetViewerError: If the API request fails
         """
         try:
             dataset_info = self.load_dataset_info(dataset_name, config_name)
             full_dataset_id = dataset_info.get('id', dataset_name)
             if len(dataset_info["configs"]) == 1:
                 # Single config format
                 builder_name = dataset_info.get('builder_name', '')
@@ -869,6 +879,106 @@ class DatasetService:
             logger.error(error_msg)
             raise DatasetServiceError(error_msg) from e
 def get_dataset_service(hf_api_token: str) -> DatasetService:
     """Get or create the global dataset service instance using current config."""

     pass
+class DatasetNotParquetError(DatasetServiceError):
+    """Raised when a dataset is not in parquet format but parquet is required."""
+    pass
+class NoTextColumnsError(DatasetServiceError):
+    """Raised when a dataset has no text columns for search."""
+    pass
 class DatasetService:
     """
     Centralized service for dataset operations with caching support.
         return self.hf_client.validate_dataset_access(dataset_id, config_name)
     def _check_statistics_availability(
+        self,
         dataset_name: str,
         config_name: Optional[str] = None
     ) -> dict:
         """
         Check if statistics are available for a dataset.
         Statistics are only available for datasets with builder_name="parquet".
         This method checks the dataset information to determine availability.
         Args:
             dataset_name: HuggingFace dataset identifier
             config_name: Optional configuration name
         Returns:
             Dictionary with availability information:
             - available: Boolean indicating if statistics are available
             - configs: List of configs with statistics support
             - reason: Explanation if statistics are not available
         Raises:
             DatasetViewerError: If the API request fails
         """
         try:
             dataset_info = self.load_dataset_info(dataset_name, config_name)
             full_dataset_id = dataset_info.get('id', dataset_name)
             if len(dataset_info["configs"]) == 1:
                 # Single config format
                 builder_name = dataset_info.get('builder_name', '')
             logger.error(error_msg)
             raise DatasetServiceError(error_msg) from e
+    def search_text_in_dataset(
+        self,
+        dataset_id: str,
+        config_name: str,
+        split_name: str,
+        query: str,
+        offset: int = 0,
+        length: int = 50
+    ) -> Dict[str, Any]:
+        """
+        Search for text in text columns of a dataset using the Dataset Viewer API.
+        This method delegates to the DatasetViewerAdapter to perform the search.
+        Only text columns are searched and only parquet datasets are supported.
+        Args:
+            dataset_id: HuggingFace dataset identifier
+            config_name: Configuration name (required)
+            split_name: Split name (required)
+            query: Search query (required)
+            offset: Offset for pagination (default: 0)
+            length: Number of examples to return (default: 50)
+        Returns:
+            Dictionary containing search results from the Dataset Viewer API
+        Raises:
+            DatasetNotParquetError: If the dataset is not in parquet format
+            NoTextColumnsError: If the dataset has no text columns
+            DatasetServiceError: If the search operation fails
+        """
+        try:
+            # Check if dataset is in parquet format and has text columns
+            dataset_info = self.load_dataset_info(dataset_id, config_name)
+            # Check builder_name for parquet format
+            # Also check tags as a fallback since builder_name might not be available
+            builder_name = dataset_info.get('builder_name', '')
+            tags = dataset_info.get('tags', [])
+            is_parquet = builder_name == 'parquet' or 'format:parquet' in tags
+            if not is_parquet:
+                error_msg = (
+                    f"Search is only supported for parquet datasets. "
+                    f"Dataset '{dataset_id}' has builder_name='{builder_name}' "
+                    f"and tags={tags}. "
+                    f"Please use a dataset in parquet format."
+                )
+                logger.warning(error_msg)
+                raise DatasetNotParquetError(error_msg)
+            # Check if dataset has text columns
+            features = dataset_info.get('features', {})
+            if not features:
+                error_msg = f"No features found for dataset '{dataset_id}'"
+                logger.warning(error_msg)
+                raise DatasetServiceError(error_msg)
+            # Check for text/string columns
+            has_text_columns = False
+            for _, feature_info in features.items():
+                # Check for various text types
+                if isinstance(feature_info, dict):
+                    feature_type = feature_info.get('dtype', '')
+                elif isinstance(feature_info, str):
+                    feature_type = feature_info
+                else:
+                    continue
+                # Check if it's a text column (string, text, or Value with string dtype)
+                if any(text_type in str(feature_type).lower() for text_type in ['string', 'text']):
+                    has_text_columns = True
+                    break
+            if not has_text_columns:
+                error_msg = (
+                    f"No text columns found in dataset '{dataset_id}'. "
+                    f"Search requires at least one text/string column. "
+                    f"Available features: {list(features.keys())}"
+                )
+                logger.warning(error_msg)
+                raise NoTextColumnsError(error_msg)
+            # Perform the search
+            return self.dataset_viewer.search_text_in_dataset(
+                dataset_name=dataset_id,
+                config_name=config_name,
+                split_name=split_name,
+                query=query,
+                offset=offset,
+                length=length
+            )
+        except (DatasetNotParquetError, NoTextColumnsError):
+            # Re-raise our custom exceptions
+            raise
+        except Exception as e:
+            error_msg = f"Failed to search in dataset: {str(e)}"
+            logger.error(error_msg)
+            raise DatasetServiceError(error_msg) from e
 def get_dataset_service(hf_api_token: str) -> DatasetService:
     """Get or create the global dataset service instance using current config."""

src/hf_eda_mcp/services/dataset_viewer_adapter.py CHANGED Viewed

@@ -25,12 +25,11 @@ class DatasetViewerAdapter():
     ):
         """
         Initialize dataset service with optional caching and authentication.
         Args:
             token: HuggingFace authentication token
         """
-        if token:
-            self.token = token
         self.base_url = "https://datasets-server.huggingface.co/"
     def _api_get(self, route: str, params: dict, extra_headers: Optional[dict] = None) -> dict:
@@ -48,7 +47,9 @@ class DatasetViewerAdapter():
         Raises:
             DatasetViewerError: If request fails after retries
         """
-        headers = {"Authorization": f"Bearer {self.token}"}
         if extra_headers:
             headers.update(extra_headers)
@@ -216,3 +217,68 @@ class DatasetViewerAdapter():
             error_msg = f"Unexpected error fetching dataset statistics: {str(e)}"
             logger.error(error_msg)
             raise DatasetViewerError(error_msg) from e

     ):
         """
         Initialize dataset service with optional caching and authentication.
         Args:
             token: HuggingFace authentication token
         """
+        self.token = token
         self.base_url = "https://datasets-server.huggingface.co/"
     def _api_get(self, route: str, params: dict, extra_headers: Optional[dict] = None) -> dict:
         Raises:
             DatasetViewerError: If request fails after retries
         """
+        headers = {}
+        if self.token:
+            headers["Authorization"] = f"Bearer {self.token}"
         if extra_headers:
             headers.update(extra_headers)
             error_msg = f"Unexpected error fetching dataset statistics: {str(e)}"
             logger.error(error_msg)
             raise DatasetViewerError(error_msg) from e
+    def search_text_in_dataset(
+        self,
+        dataset_name: str,
+        config_name: str,
+        split_name: str,
+        query: str,
+        offset: int = 0,
+        length: int = 50
+    ) -> dict:
+        """
+        Search for text in a dataset split using the Dataset Viewer API.
+        Args:
+            dataset_name: HuggingFace dataset identifier
+            config_name: Configuration name (required)
+            split_name: Split name (required)
+            query: Search query (required)
+            offset: Offset for pagination (default: 0)
+            length: Number of examples to return (default: 50)
+        Returns:
+            Dictionary containing search results including:
+            - features: List of features from the dataset, including column names and data types
+            - rows: List of slice of rows of a dataset and the content contained in each column of a specific row.
+            - num_rows_total: Total number of examples in the split
+            - num_rows_per_page: Number of examples in the current page
+            - partial: Whether the response is partial. If True, it means that the search couldn’t be run on the full dataset because it’s too big.
+        Raises:
+            DatasetViewerError: If the API request fails
+        """
+        params = {
+            "dataset": dataset_name,
+            "config": config_name,
+            "split": split_name,
+            "query": query,
+            "offset": offset,
+            "length": length,
+        }
+        logger.info(f"Searching text {query} in dataset split: {dataset_name}/{config_name}/{split_name}_{offset}-{offset+length}")
+        try:
+            result = self._api_get(
+                route="search",
+                params=params,
+            )
+            # Check for errors in response
+            if result.get('failed'):
+                logger.warning(f"Dataset Viewer API returned failures: {result['failed']}")
+            if result.get('partial'):
+                logger.warning("Dataset Viewer API returned partial data")
+            return result
+        except DatasetViewerError:
+            # Re-raise with context
+            raise
+        except Exception as e:
+            error_msg = f"Unexpected error searching in dataset: {str(e)}"
+            logger.error(error_msg)
+            raise DatasetViewerError(error_msg) from e

src/hf_eda_mcp/tools/metadata.py CHANGED Viewed

@@ -33,7 +33,6 @@ def get_dataset_metadata(dataset_id: str, config_name: Optional[str] = None, hf_
     Args:
         dataset_id: HuggingFace dataset identifier (e.g., 'squad', 'glue', 'imdb')
         config_name: Optional configuration name for multi-config datasets
-        hf_api_token: Header parsed by Gradio when hf_api_token is provided in MCP configuration headers
     Returns:
         Dictionary containing comprehensive dataset metadata:
@@ -43,12 +42,16 @@ def get_dataset_metadata(dataset_id: str, config_name: Optional[str] = None, hf_
         - features: Dictionary of feature names and types
         - splits: Dictionary of split names and their sizes
         - configs: List of available configurations
         - size_bytes: Dataset size in bytes
         - downloads: Number of downloads
         - likes: Number of likes
         - tags: List of dataset tags
         - created_at: Creation timestamp
         - last_modified: Last modification timestamp
     Raises:
         ValueError: If dataset_id is empty or invalid

     Args:
         dataset_id: HuggingFace dataset identifier (e.g., 'squad', 'glue', 'imdb')
         config_name: Optional configuration name for multi-config datasets
     Returns:
         Dictionary containing comprehensive dataset metadata:
         - features: Dictionary of feature names and types
         - splits: Dictionary of split names and their sizes
         - configs: List of available configurations
+        - config_details: List of dictionaries containing detailed information for each config
         - size_bytes: Dataset size in bytes
+        - size_human: Human-readable size of dataset
         - downloads: Number of downloads
         - likes: Number of likes
         - tags: List of dataset tags
         - created_at: Creation timestamp
         - last_modified: Last modification timestamp
+        - summary: Human-readable summary of dataset information
+        - builder_name: Builder name of the dataset. If builder_name is "parquet", others tools like search_text_in_dataset are available.
     Raises:
         ValueError: If dataset_id is empty or invalid

src/hf_eda_mcp/tools/search.py ADDED Viewed

	@@ -0,0 +1,130 @@

+import logging
+import gradio as gr
+from typing import Dict, Any
+from hf_eda_mcp.services.dataset_service import (
+    DatasetServiceError,
+    DatasetNotParquetError,
+    NoTextColumnsError,
+    get_dataset_service
+)
+from hf_eda_mcp.integrations.hf_client import DatasetNotFoundError, AuthenticationError, NetworkError
+from hf_eda_mcp.validation import (
+    validate_dataset_id,
+    validate_config_name,
+    validate_split_name,
+    ValidationError,
+    format_validation_error,
+)
+from hf_eda_mcp.error_handling import format_error_response, log_error_with_context
+logger = logging.getLogger(__name__)
+def search_text_in_dataset(
+    dataset_id: str,
+    config_name: str,
+    split: str,
+    query: str,
+    offset: int = 0,
+    length: int = 10,
+    hf_api_token: gr.Header = "",
+) -> Dict[str, Any]:
+    """
+    Search for text in text columns of a dataset using the Dataset Viewer API.
+    Only text columns are searched and only parquet datasets are supported (builder_name="parquet")
+    Useful for finding relevant examples or debugging issues.
+    Args:
+        dataset_id: HuggingFace full dataset identifier (e.g., 'stanfordnlp/imdb', 'rajpurkar/squad', 'nyu-mll/glue')
+        config_name: Configuration name
+        split: Split name
+        query: Search query
+        offset: Offset for pagination (default: 0)
+        length: Number of examples to return (default: 50). Means that we search in [offset, offset+length[
+        hf_api_token: Header parsed by Gradio when hf_api_token is provided in MCP configuration headers
+    Returns:
+        Dictionary containing search results including:
+        - features: List of features from the dataset, including column names and data types
+        - rows: List of slice of rows of a dataset and the content contained in each column of a specific row.
+        - num_rows_total: Total number of examples in the split
+        - num_rows_per_page: Number of examples in the current page
+        - partial: Whether the response is partial. If True, it means that the search couldn’t be run on the full dataset because it’s too big.
+    """
+    # Handle empty strings from Gradio (convert to None)
+    if config_name == "":
+        config_name = None
+    # Input validation using centralized validation
+    try:
+        dataset_id = validate_dataset_id(dataset_id)
+        config_name = validate_config_name(config_name)
+        split = validate_split_name(split)
+    except ValidationError as e:
+        logger.error(f"Validation error: {format_validation_error(e)}")
+        raise ValueError(format_validation_error(e))
+    context = {
+        "dataset_id": dataset_id,
+        "config_name": config_name,
+        "split": split,
+        "query": query,
+        "offset": offset,
+        "length": length,
+        "operation": "search_text_in_dataset"
+    }
+    logger.info(
+        f"Searching text {query} in dataset: {dataset_id}, split: {split}, "
+        f"config: {config_name}, offset: {offset}, length: {length}"
+    )
+    try:
+        # Get dataset service
+        service = get_dataset_service(hf_api_token=hf_api_token)
+        # Search in dataset
+        search_results = service.search_text_in_dataset(
+            dataset_id=dataset_id,
+            config_name=config_name,
+            split_name=split,
+            query=query,
+            offset=offset,
+            length=length
+        )
+        return search_results
+    except DatasetNotParquetError as e:
+        log_error_with_context(e, context, level=logging.WARNING)
+        logger.info(f"Dataset is not in parquet format: {str(e)}")
+        raise ValueError(str(e)) from e
+    except NoTextColumnsError as e:
+        log_error_with_context(e, context, level=logging.WARNING)
+        logger.info(f"Dataset has no text columns: {str(e)}")
+        raise ValueError(str(e)) from e
+    except DatasetNotFoundError as e:
+        log_error_with_context(e, context, level=logging.WARNING)
+        error_response = format_error_response(e, context)
+        logger.info(f"Dataset/split not found suggestions: {error_response.get('suggestions', [])}")
+        raise
+    except AuthenticationError as e:
+        log_error_with_context(e, context, level=logging.WARNING)
+        error_response = format_error_response(e, context)
+        logger.info(f"Authentication error guidance: {error_response.get('suggestions', [])}")
+        raise
+    except NetworkError as e:
+        log_error_with_context(e, context)
+        error_response = format_error_response(e, context)
+        logger.info(f"Network error guidance: {error_response.get('suggestions', [])}")
+        raise
+    except Exception as e:
+        log_error_with_context(e, context)
+        raise DatasetServiceError(f"Failed to search in dataset: {str(e)}") from e