KhalilGuetari commited on
Commit
ca96eb9
·
1 Parent(s): 3e3178a

Add a search text in dataset tool

Browse files
README.md CHANGED
@@ -24,25 +24,15 @@ An MCP (Model Context Protocol) server that provides tools for Exploratory Data
24
  - Automatic fallback to sample-based analysis
25
  - Supports multiple data types: numerical, categorical, text, image, audio
26
  - Includes histograms, distributions, and missing value analysis
 
 
 
 
27
 
28
  ## Usage
29
 
30
  This Space runs as an MCP server that can be accessed by MCP-compatible AI assistants.
31
 
32
- ### MCP Client Configuration
33
-
34
- Add this server to your MCP client configuration:
35
-
36
- ```json
37
- {
38
- "mcpServers": {
39
- "hf-eda-mcp": {
40
- "url": "https://YOUR-USERNAME-hf-eda-mcp.hf.space/gradio_api/mcp/sse"
41
- }
42
- }
43
- }
44
- ```
45
-
46
  Replace `YOUR-USERNAME` with your HuggingFace username.
47
 
48
  ### Available Tools
@@ -53,15 +43,30 @@ Replace `YOUR-USERNAME` with your HuggingFace username.
53
  - Automatically uses Dataset Viewer API statistics for parquet datasets (full dataset analysis)
54
  - Falls back to sample-based analysis for other formats
55
  - Returns feature types, statistics, histograms, and missing value analysis
 
 
 
 
 
 
56
 
57
- ## Authentication
58
 
 
 
59
 
 
 
 
 
 
60
 
61
  ## To Do List
62
 
63
  [ ] Security: Do not cache when a dataset is private or gated
64
- [ ] Complete MCP server configuration and documentation
 
 
65
 
66
 
67
  ## License
 
24
  - Automatic fallback to sample-based analysis
25
  - Supports multiple data types: numerical, categorical, text, image, audio
26
  - Includes histograms, distributions, and missing value analysis
27
+ - **Text Search**: Search for text in dataset columns using the Dataset Viewer API
28
+ - Only text columns are searched
29
+ - Only parquet datasets are supported
30
+ - Supports pagination with offset and length parameters
31
 
32
  ## Usage
33
 
34
  This Space runs as an MCP server that can be accessed by MCP-compatible AI assistants.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  Replace `YOUR-USERNAME` with your HuggingFace username.
37
 
38
  ### Available Tools
 
43
  - Automatically uses Dataset Viewer API statistics for parquet datasets (full dataset analysis)
44
  - Falls back to sample-based analysis for other formats
45
  - Returns feature types, statistics, histograms, and missing value analysis
46
+ 4. **search_text_in_dataset**: Search for text in dataset columns
47
+ - Search text in text columns using the Dataset Viewer API
48
+ - Only parquet datasets are supported
49
+ - Supports pagination for large result sets
50
+
51
+ ## MCP Client Configuration
52
 
53
+ Under the hood, tools use DatasetViewer and HfApi to get information on datasets. A HuggingFace Token `hf-api-token` is necessary to use those.
54
 
55
+ - **Gradio UI** on the HF space, the token used is a token set in the space's secrets
56
+ - **MCP server**: set up your HF Token in the MCP configuration headers like in the following example:
57
 
58
+ ```json
59
+ "headers": {
60
+ "hf-api-token": "hf_token_here"
61
+ },
62
+ ```
63
 
64
  ## To Do List
65
 
66
  [ ] Security: Do not cache when a dataset is private or gated
67
+ [x] Complete MCP server configuration and documentation
68
+ [x] Add a search in text tool https://huggingface.co/docs/dataset-viewer/search
69
+ [ ] Add MCP prompts to guide use cases like reports generation?
70
 
71
 
72
  ## License
docs/MCP_USAGE.md CHANGED
@@ -2,11 +2,11 @@
2
 
3
  ## Overview
4
 
5
- The HF EDA MCP Server provides three main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
6
 
7
  ## Available MCP Tools
8
 
9
- The following 3 tools are automatically exposed by Gradio when `mcp_server=True`:
10
 
11
  ### 1. `get_dataset_metadata`
12
  Retrieve comprehensive metadata for a HuggingFace dataset.
@@ -29,15 +29,57 @@ Retrieve a sample of rows from a HuggingFace dataset.
29
  **Returns:** JSON object with sampled data and metadata.
30
 
31
  ### 3. `analyze_dataset_features`
32
- Perform basic exploratory analysis on dataset features.
33
 
34
  **Parameters:**
35
  - `dataset_id` (string): HuggingFace dataset identifier
36
  - `split` (string, default: 'train'): Dataset split to analyze
37
- - `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000)
38
  - `config_name` (string, optional): Configuration name for multi-config datasets
39
 
40
- **Returns:** JSON object with feature analysis results including statistics, missing values, and data quality assessment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## MCP Client Configuration
43
 
@@ -68,6 +110,9 @@ If the server is running on a remote host:
68
  "mcpServers": {
69
  "hf-eda-mcp-server": {
70
  "url": "https://your-server.com/gradio_api/mcp/sse"
 
 
 
71
  }
72
  }
73
  }
@@ -104,7 +149,7 @@ pdm run hf-eda-mcp --max-sample-size 100000
104
 
105
  ### Server Modes
106
 
107
- The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 3 EDA functions as MCP tools while still providing the web interface for direct interaction.
108
 
109
  ### Environment Variables
110
 
@@ -150,6 +195,38 @@ export HF_EDA_CACHE_DIR="./cache"
150
  pdm run hf-eda-mcp --verbose
151
  ```
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ## Example Usage
154
 
155
  Once connected to an MCP client, you can use the tools like this:
@@ -163,6 +240,12 @@ Use the get_dataset_sample tool with dataset_id="imdb", split="train", num_sampl
163
 
164
  # Analyze features of the GLUE dataset (CoLA configuration)
165
  Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
 
 
 
 
 
 
166
  ```
167
 
168
  ## API Endpoints
@@ -185,4 +268,8 @@ When the server is running, you can also access the tools via HTTP API:
185
 
186
  ### Performance Issues
187
  - Reduce `sample_size` for large datasets
188
- - Use streaming mode (enabled by default) for better memory efficiency
 
 
 
 
 
2
 
3
  ## Overview
4
 
5
+ The HF EDA MCP Server provides four main tools for exploratory data analysis of HuggingFace datasets via the Model Context Protocol (MCP).
6
 
7
  ## Available MCP Tools
8
 
9
+ The following 4 tools are automatically exposed by Gradio when `mcp_server=True`:
10
 
11
  ### 1. `get_dataset_metadata`
12
  Retrieve comprehensive metadata for a HuggingFace dataset.
 
29
  **Returns:** JSON object with sampled data and metadata.
30
 
31
  ### 3. `analyze_dataset_features`
32
+ Perform exploratory analysis on dataset features with automatic optimization.
33
 
34
  **Parameters:**
35
  - `dataset_id` (string): HuggingFace dataset identifier
36
  - `split` (string, default: 'train'): Dataset split to analyze
37
+ - `sample_size` (number, default: 1000): Number of samples for analysis (max: 50000, only used for fallback)
38
  - `config_name` (string, optional): Configuration name for multi-config datasets
39
 
40
+ **Returns:** JSON object with comprehensive feature analysis including:
41
+ - Feature types (numerical, categorical, text, image, audio)
42
+ - Statistical measures (mean, median, std, histograms)
43
+ - Missing value analysis
44
+ - Unique value counts
45
+ - Sample values
46
+
47
+ **Analysis Methods:**
48
+ - **Primary**: Uses HuggingFace Dataset Viewer API statistics when available (parquet datasets)
49
+ - Analyzes the full dataset without downloading data
50
+ - Provides complete statistics with histograms
51
+ - More efficient and accurate
52
+ - **Fallback**: Sample-based analysis for non-parquet datasets
53
+ - Downloads and analyzes a sample of the dataset
54
+ - Computes statistics locally
55
+
56
+ ### 4. `search_text_in_dataset`
57
+ Search for text in text columns of a dataset using the Dataset Viewer API.
58
+
59
+ **Parameters:**
60
+ - `dataset_id` (string): HuggingFace dataset identifier
61
+ - `config_name` (string): Configuration name (required for search)
62
+ - `split` (string): Dataset split to search in
63
+ - `query` (string): Search query text
64
+ - `offset` (number, default: 0): Offset for pagination
65
+ - `length` (number, default: 10): Number of results to return (max: 100)
66
+
67
+ **Returns:** JSON object with search results including:
68
+ - `features`: List of features from the dataset, including column names and data types
69
+ - `rows`: List of matching rows with content from each column
70
+ - `num_rows_total`: Total number of examples in the split
71
+ - `num_rows_per_page`: Number of examples in the current page
72
+ - `partial`: Whether the response is partial (true if the dataset is too large to search completely)
73
+
74
+ **Limitations:**
75
+ - Only text columns are searched
76
+ - Only parquet datasets are supported (builder_name="parquet")
77
+ - Search is performed by the Dataset Viewer API, not locally
78
+
79
+ **Validation:**
80
+ - The tool validates that the dataset is in parquet format before attempting search
81
+ - The tool validates that the dataset has at least one text/string column
82
+ - If validation fails, a descriptive error message is returned with suggestions
83
 
84
  ## MCP Client Configuration
85
 
 
110
  "mcpServers": {
111
  "hf-eda-mcp-server": {
112
  "url": "https://your-server.com/gradio_api/mcp/sse"
113
+ "headers": {
114
+ "hf-api-token": "your_huggingface_token_here"
115
+ }
116
  }
117
  }
118
  }
 
149
 
150
  ### Server Modes
151
 
152
+ The server provides both a web interface and MCP server functionality in a single application. When MCP is enabled, Gradio automatically exposes the 4 EDA functions as MCP tools while still providing the web interface for direct interaction.
153
 
154
  ### Environment Variables
155
 
 
195
  pdm run hf-eda-mcp --verbose
196
  ```
197
 
198
+ ## Dataset Viewer Statistics Integration
199
+
200
+ The `analyze_dataset_features` tool automatically uses HuggingFace's Dataset Viewer API when available, providing significant benefits:
201
+
202
+ ### Benefits
203
+ - **Full Dataset Analysis**: Analyzes entire datasets instead of samples
204
+ - **No Download Required**: Statistics are pre-computed by HuggingFace
205
+ - **Richer Statistics**: Includes histograms, frequencies, and multi-modal support
206
+ - **Better Performance**: Faster response times with caching
207
+
208
+ ### Supported Datasets
209
+ Statistics are available for datasets with `builder_name="parquet"`. The tool automatically:
210
+ 1. Checks if Dataset Viewer statistics are available
211
+ 2. Uses full dataset statistics when available
212
+ 3. Falls back to sample-based analysis for other datasets
213
+
214
+ ### Supported Data Types
215
+ The analysis tool provides comprehensive statistics for multiple data types:
216
+ - **Numerical** (int, float): min, max, mean, median, std, histograms
217
+ - **Categorical** (class_label, string_label): frequencies, unique counts
218
+ - **Boolean** (bool): True/False distributions
219
+ - **Text** (string_text): character length statistics, histograms
220
+ - **Image** (image): dimension statistics, histograms
221
+ - **Audio** (audio): duration statistics (seconds), histograms
222
+ - **List** (list): length statistics, histograms
223
+
224
+ ### Response Indicators
225
+ Check the `sample_info` field in the response:
226
+ - `sampling_method: "dataset_viewer_api"` - Using full dataset statistics
227
+ - `sampling_method: "sequential_head"` - Using sample-based analysis
228
+ - `represents_full_dataset: true/false` - Whether analysis covers the full dataset
229
+
230
  ## Example Usage
231
 
232
  Once connected to an MCP client, you can use the tools like this:
 
240
 
241
  # Analyze features of the GLUE dataset (CoLA configuration)
242
  Use the analyze_dataset_features tool with dataset_id="glue", config_name="cola", sample_size=500
243
+
244
+ # Search for text in the IMDB dataset
245
+ Use the search_text_in_dataset tool with dataset_id="imdb", config_name="plain_text", split="train", query="great movie", offset=0, length=10
246
+
247
+ # Search for a specific term in the SQuAD dataset
248
+ Use the search_text_in_dataset tool with dataset_id="squad", config_name="plain_text", split="train", query="president", offset=0, length=5
249
  ```
250
 
251
  ## API Endpoints
 
268
 
269
  ### Performance Issues
270
  - Reduce `sample_size` for large datasets
271
+ - Use streaming mode (enabled by default) for better memory efficiency
272
+
273
+ ### Search Tool Issues
274
+ - **Dataset not in parquet format**: The search tool only works with parquet datasets. If you get a "DatasetNotParquetError", try using a different dataset or check if the dataset has a parquet configuration
275
+ - **No text columns found**: The search tool requires at least one text/string column. If you get a "NoTextColumnsError", verify that the dataset has text columns by checking the dataset metadata first
docs/STATISTICS_ENDPOINT.md ADDED
@@ -0,0 +1,427 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dataset Viewer Statistics Endpoint Integration
2
+
3
+ ## Overview
4
+
5
+ The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis.
6
+
7
+ ## Key Benefits
8
+
9
+ ### 1. Full Dataset Coverage
10
+ - **Before**: Analysis based on samples (default 1,000 examples)
11
+ - **After**: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)
12
+
13
+ ### 2. No Data Download Required
14
+ - **Before**: Download and process samples from the dataset
15
+ - **After**: Retrieve pre-computed statistics via API call
16
+
17
+ ### 3. More Complete Statistics
18
+ The endpoint provides detailed statistics for multiple modalities:
19
+
20
+ #### Numerical Features (int, float)
21
+ - **Basic statistics**: min, max, mean, median, std
22
+ - **Missing values**: nan_count, nan_proportion
23
+ - **Distribution**: histogram with bin_edges and hist counts
24
+
25
+ Example response:
26
+ ```json
27
+ {
28
+ "column_type": "float",
29
+ "column_statistics": {
30
+ "nan_count": 0,
31
+ "nan_proportion": 0,
32
+ "min": 0,
33
+ "max": 2,
34
+ "mean": 1.67206,
35
+ "median": 1.8,
36
+ "std": 0.38714,
37
+ "histogram": {
38
+ "hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
39
+ "bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
40
+ }
41
+ }
42
+ }
43
+ ```
44
+
45
+ #### Categorical Features (class_label, string_label)
46
+ - **Unique values**: n_unique count
47
+ - **Frequencies**: Complete frequency distribution for all categories
48
+ - **Missing values**: nan_count, nan_proportion
49
+ - **No label tracking**: no_label_count, no_label_proportion (for class_label)
50
+
51
+ Example response:
52
+ ```json
53
+ {
54
+ "column_type": "class_label",
55
+ "column_statistics": {
56
+ "nan_count": 0,
57
+ "nan_proportion": 0,
58
+ "no_label_count": 0,
59
+ "no_label_proportion": 0,
60
+ "n_unique": 2,
61
+ "frequencies": {
62
+ "unacceptable": 2528,
63
+ "acceptable": 6023
64
+ }
65
+ }
66
+ }
67
+ ```
68
+
69
+ #### Text Features (string_text)
70
+ - **Length statistics**: min, max, mean, median, std (character count)
71
+ - **Missing values**: nan_count, nan_proportion
72
+ - **Distribution**: histogram of text lengths
73
+
74
+ Example response:
75
+ ```json
76
+ {
77
+ "column_type": "string_text",
78
+ "column_statistics": {
79
+ "nan_count": 0,
80
+ "nan_proportion": 0,
81
+ "min": 6,
82
+ "max": 231,
83
+ "mean": 40.70074,
84
+ "median": 37,
85
+ "std": 19.14431,
86
+ "histogram": {
87
+ "hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
88
+ "bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
89
+ }
90
+ }
91
+ }
92
+ ```
93
+
94
+ #### Boolean Features (bool)
95
+ - **Frequencies**: Distribution of True/False values
96
+ - **Missing values**: nan_count, nan_proportion
97
+
98
+ Example response:
99
+ ```json
100
+ {
101
+ "column_type": "bool",
102
+ "column_statistics": {
103
+ "nan_count": 3,
104
+ "nan_proportion": 0.15,
105
+ "frequencies": {
106
+ "False": 7,
107
+ "True": 10
108
+ }
109
+ }
110
+ }
111
+ ```
112
+
113
+ #### Image Features (image)
114
+ - **Dimension statistics**: min, max, mean, median, std (for width/height)
115
+ - **Missing values**: nan_count, nan_proportion
116
+ - **Distribution**: histogram of image dimensions
117
+
118
+ Example response:
119
+ ```json
120
+ {
121
+ "column_type": "image",
122
+ "column_statistics": {
123
+ "nan_count": 0,
124
+ "nan_proportion": 0.0,
125
+ "min": 256,
126
+ "max": 873,
127
+ "mean": 327.99339,
128
+ "median": 341.0,
129
+ "std": 60.07286,
130
+ "histogram": {
131
+ "hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
132
+ "bin_edges": [256, 318, 380, 442, 504, ...]
133
+ }
134
+ }
135
+ }
136
+ ```
137
+
138
+ #### Audio Features (audio)
139
+ - **Duration statistics**: min, max, mean, median, std (in seconds)
140
+ - **Missing values**: nan_count, nan_proportion
141
+ - **Distribution**: histogram of audio durations
142
+
143
+ Example response:
144
+ ```json
145
+ {
146
+ "column_type": "audio",
147
+ "column_statistics": {
148
+ "nan_count": 0,
149
+ "nan_proportion": 0,
150
+ "min": 1.02,
151
+ "max": 15,
152
+ "mean": 13.93042,
153
+ "median": 14.77,
154
+ "std": 2.63734,
155
+ "histogram": {
156
+ "hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
157
+ "bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
158
+ }
159
+ }
160
+ }
161
+ ```
162
+
163
+ #### List Features (list)
164
+ - **Length statistics**: min, max, mean, median, std (list length)
165
+ - **Missing values**: nan_count, nan_proportion
166
+ - **Distribution**: histogram of list lengths
167
+
168
+ Example response:
169
+ ```json
170
+ {
171
+ "column_type": "list",
172
+ "column_statistics": {
173
+ "nan_count": 0,
174
+ "nan_proportion": 0.0,
175
+ "min": 1,
176
+ "max": 3,
177
+ "mean": 1.01741,
178
+ "median": 1.0,
179
+ "std": 0.13146,
180
+ "histogram": {
181
+ "hist": [11177, 196, 1],
182
+ "bin_edges": [1, 2, 3, 3]
183
+ }
184
+ }
185
+ }
186
+ ```
187
+
188
+ ## Implementation
189
+
190
+ ### Architecture
191
+
192
+ ```
193
+ analyze_dataset_features()
194
+
195
+ Try: get_dataset_statistics() [Dataset Viewer API]
196
+
197
+ If available (parquet format):
198
+ → Use full dataset statistics
199
+ → Cache results
200
+ → Return converted analysis
201
+
202
+ If not available:
203
+ → Fall back to sample-based analysis
204
+ → Load samples via streaming
205
+ → Compute statistics locally
206
+ ```
207
+
208
+ ### Key Components
209
+
210
+ #### 1. DatasetViewerAdapter
211
+ - `get_dataset_statistics()`: Fetch statistics from API
212
+ - `check_statistics_availability()`: Check if statistics are available for a dataset
213
+
214
+ #### 2. DatasetService
215
+ - `get_dataset_statistics()`: Wrapper with caching and error handling
216
+ - Automatic fallback to sample-based analysis
217
+ - Statistics cache directory: `cache/statistics/`
218
+
219
+ #### 3. Analysis Tool
220
+ - `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format
221
+ - Seamless integration with existing analysis pipeline
222
+
223
+ ### Caching Strategy
224
+
225
+ Statistics are cached with the same TTL as other metadata (default: 1 hour):
226
+
227
+ ```
228
+ cache/
229
+ ├── metadata/ # Dataset metadata
230
+ ├── samples/ # Sample data
231
+ └── statistics/ # Dataset Viewer statistics
232
+ └── {dataset}_{config}_{split}_stats.json
233
+ ```
234
+
235
+ ## Usage Examples
236
+
237
+ ### Automatic Selection
238
+
239
+ ```python
240
+ from hf_eda_mcp.tools.analysis import analyze_dataset_features
241
+
242
+ # Automatically uses Dataset Viewer statistics if available
243
+ result = analyze_dataset_features(
244
+ dataset_id="stanfordnlp/imdb",
245
+ split="train"
246
+ )
247
+
248
+ # Check which method was used
249
+ print(result['sample_info']['sampling_method'])
250
+ # Output: "dataset_viewer_api" or "sequential_head"
251
+
252
+ print(result['sample_info']['represents_full_dataset'])
253
+ # Output: True (full dataset) or False (sample)
254
+ ```
255
+
256
+ ### Check Availability
257
+
258
+ ```python
259
+ from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
260
+
261
+ adapter = DatasetViewerAdapter(token="your_token")
262
+ availability = adapter.check_statistics_availability("stanfordnlp/imdb")
263
+
264
+ print(availability)
265
+ # {
266
+ # 'available': True,
267
+ # 'configs': ['plain_text'],
268
+ # 'reason': 'Statistics available for 1 config(s)'
269
+ # }
270
+ ```
271
+
272
+ ### Direct Statistics Access
273
+
274
+ ```python
275
+ from hf_eda_mcp.services.dataset_service import DatasetService
276
+
277
+ service = DatasetService(token="your_token")
278
+ stats = service.get_dataset_statistics(
279
+ dataset_id="stanfordnlp/imdb",
280
+ split="train",
281
+ config_name="plain_text"
282
+ )
283
+
284
+ if stats:
285
+ print(f"Full dataset: {stats['num_examples']} examples")
286
+ print(f"Columns: {len(stats['statistics'])}")
287
+ else:
288
+ print("Statistics not available, use sample-based analysis")
289
+ ```
290
+
291
+ ## Comparison: Before vs After
292
+
293
+ ### IMDB Dataset Example
294
+
295
+ #### Before (Sample-based)
296
+ ```python
297
+ {
298
+ 'dataset_info': {
299
+ 'sample_size_used': 1000,
300
+ 'sample_size_requested': 1000,
301
+ },
302
+ 'sample_info': {
303
+ 'sampling_method': 'sequential_head',
304
+ 'represents_full_dataset': True, # Only if sample >= requested
305
+ },
306
+ 'features': {
307
+ 'text': {
308
+ 'feature_type': 'text',
309
+ 'statistics': {
310
+ 'count': 1000,
311
+ 'avg_length': 1311.289,
312
+ 'min_length': 65,
313
+ 'max_length': 6103,
314
+ # Limited to sample
315
+ }
316
+ }
317
+ },
318
+ 'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text'
319
+ }
320
+ ```
321
+
322
+ #### After (Dataset Viewer)
323
+ ```python
324
+ {
325
+ 'dataset_info': {
326
+ 'sample_size_used': 25000, # Full dataset
327
+ 'sample_size_requested': 25000,
328
+ },
329
+ 'sample_info': {
330
+ 'sampling_method': 'dataset_viewer_api',
331
+ 'represents_full_dataset': True, # Always true
332
+ 'partial': False
333
+ },
334
+ 'features': {
335
+ 'text': {
336
+ 'feature_type': 'text',
337
+ 'statistics': {
338
+ 'count': 25000, # Full dataset
339
+ 'mean_length': 1325.06964,
340
+ 'min_length': 52,
341
+ 'max_length': 13704,
342
+ 'histogram': {
343
+ 'bin_edges': [52, 1418, 2784, ...],
344
+ 'hist': [17426, 5384, 1490, ...]
345
+ }
346
+ }
347
+ }
348
+ },
349
+ 'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text'
350
+ }
351
+ ```
352
+
353
+ ## Supported Data Types
354
+
355
+ The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:
356
+
357
+ | Data Type | Feature Type | Statistics Provided |
358
+ |-----------|--------------|---------------------|
359
+ | `int`, `float` | numerical | min, max, mean, median, std, histogram |
360
+ | `class_label`, `string_label` | categorical | frequencies, n_unique, no_label tracking |
361
+ | `bool` | boolean | True/False frequencies |
362
+ | `string_text` | text | character length stats (min, max, mean, median, std), histogram |
363
+ | `image` | image | dimension statistics, histogram |
364
+ | `audio` | audio | duration statistics (seconds), histogram |
365
+ | `list` | list | length statistics, histogram |
366
+
367
+ ### Data Type Mapping
368
+
369
+ Our analysis tool automatically maps Dataset Viewer types to our internal types:
370
+
371
+ ```python
372
+ Dataset Viewer Type → Our Feature Type
373
+ ─────────────────────────────────────
374
+ int, float → numerical
375
+ class_label → categorical
376
+ string_label → categorical
377
+ bool → boolean
378
+ string_text → text
379
+ image → image
380
+ audio → audio
381
+ list → list
382
+ ```
383
+
384
+ ## Limitations
385
+
386
+ ### Dataset Requirements
387
+ - Only works for datasets with `builder_name="parquet"`
388
+ - Not all datasets on HuggingFace Hub have this format
389
+ - Automatic fallback to sample-based analysis for other formats
390
+
391
+ ### API Availability
392
+ - Requires internet connection
393
+ - Subject to HuggingFace API rate limits
394
+ - May fail for private datasets without proper authentication
395
+
396
+ ## Error Handling
397
+
398
+ The implementation includes robust error handling:
399
+
400
+ 1. **Check availability first**: Verify dataset supports statistics
401
+ 2. **Graceful fallback**: Automatically use sample-based analysis if unavailable
402
+ 3. **Caching**: Reduce API calls and improve performance
403
+ 4. **Logging**: Clear messages about which method is being used
404
+
405
+ ## Performance Impact
406
+
407
+ ### API Call Overhead
408
+ - Initial call: ~1-2 seconds
409
+ - Cached calls: <10ms
410
+ - No data download required
411
+
412
+ ### Sample-based Analysis
413
+ - Download time: Varies by dataset size
414
+ - Processing time: ~1-5 seconds for 1000 samples
415
+ - Network bandwidth: Depends on sample size
416
+
417
+ ## Future Enhancements
418
+
419
+ 1. **Parallel requests**: Fetch statistics for multiple splits simultaneously
420
+ 2. **Partial statistics**: Support datasets with partial statistics
421
+ 3. **Custom aggregations**: Add more statistical measures
422
+ 4. **Visualization**: Generate plots from histogram data
423
+
424
+ ## References
425
+
426
+ - [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info)
427
+ - [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics)
src/hf_eda_mcp/server.py CHANGED
@@ -12,6 +12,7 @@ from typing import Optional
12
  from hf_eda_mcp.tools.metadata import get_dataset_metadata
13
  from hf_eda_mcp.tools.sampling import get_dataset_sample
14
  from hf_eda_mcp.tools.analysis import analyze_dataset_features
 
15
  from hf_eda_mcp.config import ServerConfig, setup_logging, validate_config, set_config
16
 
17
 
@@ -25,10 +26,10 @@ def create_gradio_app(config: ServerConfig) -> gr.Blocks:
25
  gr.Markdown(
26
  """
27
  # 🤗 HuggingFace EDA MCP Server
28
-
29
  **Model Context Protocol server for exploratory data analysis of HuggingFace datasets**
30
-
31
- This server provides three main tools for dataset exploration that are automatically
32
  exposed as MCP tools when `mcp_server=True` is enabled.
33
  """
34
  )
@@ -142,26 +143,80 @@ def create_gradio_app(config: ServerConfig) -> gr.Blocks:
142
  ],
143
  )
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  with gr.Tab("ℹ️ About"):
146
  gr.Markdown(
147
  f"""
148
  ## About HF EDA MCP Server
149
-
150
- This server implements the Model Context Protocol (MCP) to provide AI assistants
151
  with tools for exploring and analyzing HuggingFace datasets.
152
-
153
  ### Available MCP Tools
154
-
155
  1. **get_dataset_metadata**: Retrieve comprehensive dataset information
156
  2. **get_dataset_sample**: Sample data from datasets with configurable parameters
157
  3. **analyze_dataset_features**: Perform exploratory data analysis
158
-
 
159
  ### MCP Server Configuration
160
 
161
-
162
  ### Server Status
163
-
164
- - **MCP Tools**: 3 tools available
165
  - **Authentication**: {"✅ Token configured" if config.hf_token else "⚠️ No token (public datasets only)"}
166
  - **MCP Schema**: Available at `/gradio_api/mcp/schema`
167
  - **Cache Directory**: {config.cache_dir or "Default system cache"}
@@ -269,6 +324,7 @@ def launch_server(
269
  logger.info(" - get_dataset_metadata: Retrieve dataset information")
270
  logger.info(" - get_dataset_sample: Sample data from datasets")
271
  logger.info(" - analyze_dataset_features: Perform EDA analysis")
 
272
  logger.info(
273
  f"🌐 MCP schema available at: http://{config.host}:{config.port}/gradio_api/mcp/schema"
274
  )
 
12
  from hf_eda_mcp.tools.metadata import get_dataset_metadata
13
  from hf_eda_mcp.tools.sampling import get_dataset_sample
14
  from hf_eda_mcp.tools.analysis import analyze_dataset_features
15
+ from hf_eda_mcp.tools.search import search_text_in_dataset
16
  from hf_eda_mcp.config import ServerConfig, setup_logging, validate_config, set_config
17
 
18
 
 
26
  gr.Markdown(
27
  """
28
  # 🤗 HuggingFace EDA MCP Server
29
+
30
  **Model Context Protocol server for exploratory data analysis of HuggingFace datasets**
31
+
32
+ This server provides four main tools for dataset exploration that are automatically
33
  exposed as MCP tools when `mcp_server=True` is enabled.
34
  """
35
  )
 
143
  ],
144
  )
145
 
146
+ with gr.Tab("🔎 Text Search"):
147
+ gr.Interface(
148
+ fn=search_text_in_dataset,
149
+ inputs=[
150
+ gr.Textbox(
151
+ label="dataset_id",
152
+ placeholder="e.g., imdb, squad, glue",
153
+ info="HuggingFace dataset identifier",
154
+ ),
155
+ gr.Textbox(
156
+ label="config_name",
157
+ placeholder="e.g., cola, sst2",
158
+ info="Configuration name (required for search)",
159
+ ),
160
+ gr.Dropdown(
161
+ choices=["train", "validation", "test", "dev", "val"],
162
+ value="train",
163
+ label="split",
164
+ info="Dataset split to search in",
165
+ allow_custom_value=True,
166
+ ),
167
+ gr.Textbox(
168
+ label="query",
169
+ placeholder="Enter search query...",
170
+ info="Text to search for in the dataset",
171
+ ),
172
+ gr.Slider(
173
+ minimum=0,
174
+ maximum=1000,
175
+ value=0,
176
+ step=10,
177
+ label="offset",
178
+ info="Offset for pagination",
179
+ ),
180
+ gr.Slider(
181
+ minimum=1,
182
+ maximum=100,
183
+ value=10,
184
+ step=1,
185
+ label="length",
186
+ info="Number of results to return",
187
+ ),
188
+ ],
189
+ outputs=gr.JSON(label="Search Results"),
190
+ title="Search Text in Dataset",
191
+ description="Search for text in text columns of a dataset. Only text columns are searched and only parquet datasets are supported.",
192
+ examples=[
193
+ ["stanfordnlp/imdb", "plain_text", "train", "great movie", 0, 10],
194
+ ["rajpurkar/squad", "plain_text", "train", "president", 0, 5],
195
+ ["nyu-mll/glue", "cola", "train", "friends", 0, 10],
196
+ ],
197
+ )
198
+
199
  with gr.Tab("ℹ️ About"):
200
  gr.Markdown(
201
  f"""
202
  ## About HF EDA MCP Server
203
+
204
+ This server implements the Model Context Protocol (MCP) to provide AI assistants
205
  with tools for exploring and analyzing HuggingFace datasets.
206
+
207
  ### Available MCP Tools
208
+
209
  1. **get_dataset_metadata**: Retrieve comprehensive dataset information
210
  2. **get_dataset_sample**: Sample data from datasets with configurable parameters
211
  3. **analyze_dataset_features**: Perform exploratory data analysis
212
+ 4. **search_text_in_dataset**: Search for text in dataset columns
213
+
214
  ### MCP Server Configuration
215
 
216
+
217
  ### Server Status
218
+
219
+ - **MCP Tools**: 4 tools available
220
  - **Authentication**: {"✅ Token configured" if config.hf_token else "⚠️ No token (public datasets only)"}
221
  - **MCP Schema**: Available at `/gradio_api/mcp/schema`
222
  - **Cache Directory**: {config.cache_dir or "Default system cache"}
 
324
  logger.info(" - get_dataset_metadata: Retrieve dataset information")
325
  logger.info(" - get_dataset_sample: Sample data from datasets")
326
  logger.info(" - analyze_dataset_features: Perform EDA analysis")
327
+ logger.info(" - search_text_in_dataset: Search for text in datasets")
328
  logger.info(
329
  f"🌐 MCP schema available at: http://{config.host}:{config.port}/gradio_api/mcp/schema"
330
  )
src/hf_eda_mcp/services/dataset_service.py CHANGED
@@ -44,6 +44,16 @@ class CacheError(DatasetServiceError):
44
  pass
45
 
46
 
 
 
 
 
 
 
 
 
 
 
47
  class DatasetService:
48
  """
49
  Centralized service for dataset operations with caching support.
@@ -806,33 +816,33 @@ class DatasetService:
806
  return self.hf_client.validate_dataset_access(dataset_id, config_name)
807
 
808
  def _check_statistics_availability(
809
- self,
810
  dataset_name: str,
811
  config_name: Optional[str] = None
812
  ) -> dict:
813
  """
814
  Check if statistics are available for a dataset.
815
-
816
  Statistics are only available for datasets with builder_name="parquet".
817
  This method checks the dataset information to determine availability.
818
-
819
  Args:
820
  dataset_name: HuggingFace dataset identifier
821
  config_name: Optional configuration name
822
-
823
  Returns:
824
  Dictionary with availability information:
825
  - available: Boolean indicating if statistics are available
826
  - configs: List of configs with statistics support
827
  - reason: Explanation if statistics are not available
828
-
829
  Raises:
830
  DatasetViewerError: If the API request fails
831
  """
832
  try:
833
  dataset_info = self.load_dataset_info(dataset_name, config_name)
834
  full_dataset_id = dataset_info.get('id', dataset_name)
835
-
836
  if len(dataset_info["configs"]) == 1:
837
  # Single config format
838
  builder_name = dataset_info.get('builder_name', '')
@@ -869,6 +879,106 @@ class DatasetService:
869
  logger.error(error_msg)
870
  raise DatasetServiceError(error_msg) from e
871
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
872
 
873
  def get_dataset_service(hf_api_token: str) -> DatasetService:
874
  """Get or create the global dataset service instance using current config."""
 
44
  pass
45
 
46
 
47
+ class DatasetNotParquetError(DatasetServiceError):
48
+ """Raised when a dataset is not in parquet format but parquet is required."""
49
+ pass
50
+
51
+
52
+ class NoTextColumnsError(DatasetServiceError):
53
+ """Raised when a dataset has no text columns for search."""
54
+ pass
55
+
56
+
57
  class DatasetService:
58
  """
59
  Centralized service for dataset operations with caching support.
 
816
  return self.hf_client.validate_dataset_access(dataset_id, config_name)
817
 
818
  def _check_statistics_availability(
819
+ self,
820
  dataset_name: str,
821
  config_name: Optional[str] = None
822
  ) -> dict:
823
  """
824
  Check if statistics are available for a dataset.
825
+
826
  Statistics are only available for datasets with builder_name="parquet".
827
  This method checks the dataset information to determine availability.
828
+
829
  Args:
830
  dataset_name: HuggingFace dataset identifier
831
  config_name: Optional configuration name
832
+
833
  Returns:
834
  Dictionary with availability information:
835
  - available: Boolean indicating if statistics are available
836
  - configs: List of configs with statistics support
837
  - reason: Explanation if statistics are not available
838
+
839
  Raises:
840
  DatasetViewerError: If the API request fails
841
  """
842
  try:
843
  dataset_info = self.load_dataset_info(dataset_name, config_name)
844
  full_dataset_id = dataset_info.get('id', dataset_name)
845
+
846
  if len(dataset_info["configs"]) == 1:
847
  # Single config format
848
  builder_name = dataset_info.get('builder_name', '')
 
879
  logger.error(error_msg)
880
  raise DatasetServiceError(error_msg) from e
881
 
882
+ def search_text_in_dataset(
883
+ self,
884
+ dataset_id: str,
885
+ config_name: str,
886
+ split_name: str,
887
+ query: str,
888
+ offset: int = 0,
889
+ length: int = 50
890
+ ) -> Dict[str, Any]:
891
+ """
892
+ Search for text in text columns of a dataset using the Dataset Viewer API.
893
+
894
+ This method delegates to the DatasetViewerAdapter to perform the search.
895
+ Only text columns are searched and only parquet datasets are supported.
896
+
897
+ Args:
898
+ dataset_id: HuggingFace dataset identifier
899
+ config_name: Configuration name (required)
900
+ split_name: Split name (required)
901
+ query: Search query (required)
902
+ offset: Offset for pagination (default: 0)
903
+ length: Number of examples to return (default: 50)
904
+
905
+ Returns:
906
+ Dictionary containing search results from the Dataset Viewer API
907
+
908
+ Raises:
909
+ DatasetNotParquetError: If the dataset is not in parquet format
910
+ NoTextColumnsError: If the dataset has no text columns
911
+ DatasetServiceError: If the search operation fails
912
+ """
913
+ try:
914
+ # Check if dataset is in parquet format and has text columns
915
+ dataset_info = self.load_dataset_info(dataset_id, config_name)
916
+
917
+ # Check builder_name for parquet format
918
+ # Also check tags as a fallback since builder_name might not be available
919
+ builder_name = dataset_info.get('builder_name', '')
920
+ tags = dataset_info.get('tags', [])
921
+ is_parquet = builder_name == 'parquet' or 'format:parquet' in tags
922
+
923
+ if not is_parquet:
924
+ error_msg = (
925
+ f"Search is only supported for parquet datasets. "
926
+ f"Dataset '{dataset_id}' has builder_name='{builder_name}' "
927
+ f"and tags={tags}. "
928
+ f"Please use a dataset in parquet format."
929
+ )
930
+ logger.warning(error_msg)
931
+ raise DatasetNotParquetError(error_msg)
932
+
933
+ # Check if dataset has text columns
934
+ features = dataset_info.get('features', {})
935
+ if not features:
936
+ error_msg = f"No features found for dataset '{dataset_id}'"
937
+ logger.warning(error_msg)
938
+ raise DatasetServiceError(error_msg)
939
+
940
+ # Check for text/string columns
941
+ has_text_columns = False
942
+ for _, feature_info in features.items():
943
+ # Check for various text types
944
+ if isinstance(feature_info, dict):
945
+ feature_type = feature_info.get('dtype', '')
946
+ elif isinstance(feature_info, str):
947
+ feature_type = feature_info
948
+ else:
949
+ continue
950
+
951
+ # Check if it's a text column (string, text, or Value with string dtype)
952
+ if any(text_type in str(feature_type).lower() for text_type in ['string', 'text']):
953
+ has_text_columns = True
954
+ break
955
+
956
+ if not has_text_columns:
957
+ error_msg = (
958
+ f"No text columns found in dataset '{dataset_id}'. "
959
+ f"Search requires at least one text/string column. "
960
+ f"Available features: {list(features.keys())}"
961
+ )
962
+ logger.warning(error_msg)
963
+ raise NoTextColumnsError(error_msg)
964
+
965
+ # Perform the search
966
+ return self.dataset_viewer.search_text_in_dataset(
967
+ dataset_name=dataset_id,
968
+ config_name=config_name,
969
+ split_name=split_name,
970
+ query=query,
971
+ offset=offset,
972
+ length=length
973
+ )
974
+ except (DatasetNotParquetError, NoTextColumnsError):
975
+ # Re-raise our custom exceptions
976
+ raise
977
+ except Exception as e:
978
+ error_msg = f"Failed to search in dataset: {str(e)}"
979
+ logger.error(error_msg)
980
+ raise DatasetServiceError(error_msg) from e
981
+
982
 
983
  def get_dataset_service(hf_api_token: str) -> DatasetService:
984
  """Get or create the global dataset service instance using current config."""
src/hf_eda_mcp/services/dataset_viewer_adapter.py CHANGED
@@ -25,12 +25,11 @@ class DatasetViewerAdapter():
25
  ):
26
  """
27
  Initialize dataset service with optional caching and authentication.
28
-
29
  Args:
30
  token: HuggingFace authentication token
31
  """
32
- if token:
33
- self.token = token
34
  self.base_url = "https://datasets-server.huggingface.co/"
35
 
36
  def _api_get(self, route: str, params: dict, extra_headers: Optional[dict] = None) -> dict:
@@ -48,7 +47,9 @@ class DatasetViewerAdapter():
48
  Raises:
49
  DatasetViewerError: If request fails after retries
50
  """
51
- headers = {"Authorization": f"Bearer {self.token}"}
 
 
52
  if extra_headers:
53
  headers.update(extra_headers)
54
 
@@ -216,3 +217,68 @@ class DatasetViewerAdapter():
216
  error_msg = f"Unexpected error fetching dataset statistics: {str(e)}"
217
  logger.error(error_msg)
218
  raise DatasetViewerError(error_msg) from e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ):
26
  """
27
  Initialize dataset service with optional caching and authentication.
28
+
29
  Args:
30
  token: HuggingFace authentication token
31
  """
32
+ self.token = token
 
33
  self.base_url = "https://datasets-server.huggingface.co/"
34
 
35
  def _api_get(self, route: str, params: dict, extra_headers: Optional[dict] = None) -> dict:
 
47
  Raises:
48
  DatasetViewerError: If request fails after retries
49
  """
50
+ headers = {}
51
+ if self.token:
52
+ headers["Authorization"] = f"Bearer {self.token}"
53
  if extra_headers:
54
  headers.update(extra_headers)
55
 
 
217
  error_msg = f"Unexpected error fetching dataset statistics: {str(e)}"
218
  logger.error(error_msg)
219
  raise DatasetViewerError(error_msg) from e
220
+
221
+ def search_text_in_dataset(
222
+ self,
223
+ dataset_name: str,
224
+ config_name: str,
225
+ split_name: str,
226
+ query: str,
227
+ offset: int = 0,
228
+ length: int = 50
229
+ ) -> dict:
230
+ """
231
+ Search for text in a dataset split using the Dataset Viewer API.
232
+
233
+ Args:
234
+ dataset_name: HuggingFace dataset identifier
235
+ config_name: Configuration name (required)
236
+ split_name: Split name (required)
237
+ query: Search query (required)
238
+ offset: Offset for pagination (default: 0)
239
+ length: Number of examples to return (default: 50)
240
+
241
+ Returns:
242
+ Dictionary containing search results including:
243
+ - features: List of features from the dataset, including column names and data types
244
+ - rows: List of slice of rows of a dataset and the content contained in each column of a specific row.
245
+ - num_rows_total: Total number of examples in the split
246
+ - num_rows_per_page: Number of examples in the current page
247
+ - partial: Whether the response is partial. If True, it means that the search couldn’t be run on the full dataset because it’s too big.
248
+
249
+ Raises:
250
+ DatasetViewerError: If the API request fails
251
+ """
252
+ params = {
253
+ "dataset": dataset_name,
254
+ "config": config_name,
255
+ "split": split_name,
256
+ "query": query,
257
+ "offset": offset,
258
+ "length": length,
259
+ }
260
+
261
+ logger.info(f"Searching text {query} in dataset split: {dataset_name}/{config_name}/{split_name}_{offset}-{offset+length}")
262
+
263
+ try:
264
+ result = self._api_get(
265
+ route="search",
266
+ params=params,
267
+ )
268
+
269
+ # Check for errors in response
270
+ if result.get('failed'):
271
+ logger.warning(f"Dataset Viewer API returned failures: {result['failed']}")
272
+
273
+ if result.get('partial'):
274
+ logger.warning("Dataset Viewer API returned partial data")
275
+
276
+ return result
277
+
278
+ except DatasetViewerError:
279
+ # Re-raise with context
280
+ raise
281
+ except Exception as e:
282
+ error_msg = f"Unexpected error searching in dataset: {str(e)}"
283
+ logger.error(error_msg)
284
+ raise DatasetViewerError(error_msg) from e
src/hf_eda_mcp/tools/metadata.py CHANGED
@@ -33,7 +33,6 @@ def get_dataset_metadata(dataset_id: str, config_name: Optional[str] = None, hf_
33
  Args:
34
  dataset_id: HuggingFace dataset identifier (e.g., 'squad', 'glue', 'imdb')
35
  config_name: Optional configuration name for multi-config datasets
36
- hf_api_token: Header parsed by Gradio when hf_api_token is provided in MCP configuration headers
37
 
38
  Returns:
39
  Dictionary containing comprehensive dataset metadata:
@@ -43,12 +42,16 @@ def get_dataset_metadata(dataset_id: str, config_name: Optional[str] = None, hf_
43
  - features: Dictionary of feature names and types
44
  - splits: Dictionary of split names and their sizes
45
  - configs: List of available configurations
 
46
  - size_bytes: Dataset size in bytes
 
47
  - downloads: Number of downloads
48
  - likes: Number of likes
49
  - tags: List of dataset tags
50
  - created_at: Creation timestamp
51
  - last_modified: Last modification timestamp
 
 
52
 
53
  Raises:
54
  ValueError: If dataset_id is empty or invalid
 
33
  Args:
34
  dataset_id: HuggingFace dataset identifier (e.g., 'squad', 'glue', 'imdb')
35
  config_name: Optional configuration name for multi-config datasets
 
36
 
37
  Returns:
38
  Dictionary containing comprehensive dataset metadata:
 
42
  - features: Dictionary of feature names and types
43
  - splits: Dictionary of split names and their sizes
44
  - configs: List of available configurations
45
+ - config_details: List of dictionaries containing detailed information for each config
46
  - size_bytes: Dataset size in bytes
47
+ - size_human: Human-readable size of dataset
48
  - downloads: Number of downloads
49
  - likes: Number of likes
50
  - tags: List of dataset tags
51
  - created_at: Creation timestamp
52
  - last_modified: Last modification timestamp
53
+ - summary: Human-readable summary of dataset information
54
+ - builder_name: Builder name of the dataset. If builder_name is "parquet", others tools like search_text_in_dataset are available.
55
 
56
  Raises:
57
  ValueError: If dataset_id is empty or invalid
src/hf_eda_mcp/tools/search.py ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import gradio as gr
3
+ from typing import Dict, Any
4
+ from hf_eda_mcp.services.dataset_service import (
5
+ DatasetServiceError,
6
+ DatasetNotParquetError,
7
+ NoTextColumnsError,
8
+ get_dataset_service
9
+ )
10
+ from hf_eda_mcp.integrations.hf_client import DatasetNotFoundError, AuthenticationError, NetworkError
11
+ from hf_eda_mcp.validation import (
12
+ validate_dataset_id,
13
+ validate_config_name,
14
+ validate_split_name,
15
+ ValidationError,
16
+ format_validation_error,
17
+ )
18
+ from hf_eda_mcp.error_handling import format_error_response, log_error_with_context
19
+
20
+
21
+ logger = logging.getLogger(__name__)
22
+
23
+
24
+ def search_text_in_dataset(
25
+ dataset_id: str,
26
+ config_name: str,
27
+ split: str,
28
+ query: str,
29
+ offset: int = 0,
30
+ length: int = 10,
31
+ hf_api_token: gr.Header = "",
32
+ ) -> Dict[str, Any]:
33
+ """
34
+ Search for text in text columns of a dataset using the Dataset Viewer API.
35
+ Only text columns are searched and only parquet datasets are supported (builder_name="parquet")
36
+
37
+ Useful for finding relevant examples or debugging issues.
38
+
39
+ Args:
40
+ dataset_id: HuggingFace full dataset identifier (e.g., 'stanfordnlp/imdb', 'rajpurkar/squad', 'nyu-mll/glue')
41
+ config_name: Configuration name
42
+ split: Split name
43
+ query: Search query
44
+ offset: Offset for pagination (default: 0)
45
+ length: Number of examples to return (default: 50). Means that we search in [offset, offset+length[
46
+ hf_api_token: Header parsed by Gradio when hf_api_token is provided in MCP configuration headers
47
+
48
+ Returns:
49
+ Dictionary containing search results including:
50
+ - features: List of features from the dataset, including column names and data types
51
+ - rows: List of slice of rows of a dataset and the content contained in each column of a specific row.
52
+ - num_rows_total: Total number of examples in the split
53
+ - num_rows_per_page: Number of examples in the current page
54
+ - partial: Whether the response is partial. If True, it means that the search couldn’t be run on the full dataset because it’s too big.
55
+ """
56
+ # Handle empty strings from Gradio (convert to None)
57
+ if config_name == "":
58
+ config_name = None
59
+
60
+ # Input validation using centralized validation
61
+ try:
62
+ dataset_id = validate_dataset_id(dataset_id)
63
+ config_name = validate_config_name(config_name)
64
+ split = validate_split_name(split)
65
+ except ValidationError as e:
66
+ logger.error(f"Validation error: {format_validation_error(e)}")
67
+ raise ValueError(format_validation_error(e))
68
+
69
+ context = {
70
+ "dataset_id": dataset_id,
71
+ "config_name": config_name,
72
+ "split": split,
73
+ "query": query,
74
+ "offset": offset,
75
+ "length": length,
76
+ "operation": "search_text_in_dataset"
77
+ }
78
+
79
+ logger.info(
80
+ f"Searching text {query} in dataset: {dataset_id}, split: {split}, "
81
+ f"config: {config_name}, offset: {offset}, length: {length}"
82
+ )
83
+
84
+ try:
85
+ # Get dataset service
86
+ service = get_dataset_service(hf_api_token=hf_api_token)
87
+
88
+ # Search in dataset
89
+ search_results = service.search_text_in_dataset(
90
+ dataset_id=dataset_id,
91
+ config_name=config_name,
92
+ split_name=split,
93
+ query=query,
94
+ offset=offset,
95
+ length=length
96
+ )
97
+
98
+ return search_results
99
+
100
+ except DatasetNotParquetError as e:
101
+ log_error_with_context(e, context, level=logging.WARNING)
102
+ logger.info(f"Dataset is not in parquet format: {str(e)}")
103
+ raise ValueError(str(e)) from e
104
+
105
+ except NoTextColumnsError as e:
106
+ log_error_with_context(e, context, level=logging.WARNING)
107
+ logger.info(f"Dataset has no text columns: {str(e)}")
108
+ raise ValueError(str(e)) from e
109
+
110
+ except DatasetNotFoundError as e:
111
+ log_error_with_context(e, context, level=logging.WARNING)
112
+ error_response = format_error_response(e, context)
113
+ logger.info(f"Dataset/split not found suggestions: {error_response.get('suggestions', [])}")
114
+ raise
115
+
116
+ except AuthenticationError as e:
117
+ log_error_with_context(e, context, level=logging.WARNING)
118
+ error_response = format_error_response(e, context)
119
+ logger.info(f"Authentication error guidance: {error_response.get('suggestions', [])}")
120
+ raise
121
+
122
+ except NetworkError as e:
123
+ log_error_with_context(e, context)
124
+ error_response = format_error_response(e, context)
125
+ logger.info(f"Network error guidance: {error_response.get('suggestions', [])}")
126
+ raise
127
+
128
+ except Exception as e:
129
+ log_error_with_context(e, context)
130
+ raise DatasetServiceError(f"Failed to search in dataset: {str(e)}") from e