Spaces:
Running
Running
| # Dataset Viewer Statistics Endpoint Integration | |
| ## Overview | |
| The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis. | |
| ## Key Benefits | |
| ### 1. Full Dataset Coverage | |
| - **Before**: Analysis based on samples (default 1,000 examples) | |
| - **After**: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split) | |
| ### 2. No Data Download Required | |
| - **Before**: Download and process samples from the dataset | |
| - **After**: Retrieve pre-computed statistics via API call | |
| ### 3. More Complete Statistics | |
| The endpoint provides detailed statistics for multiple modalities: | |
| #### Numerical Features (int, float) | |
| - **Basic statistics**: min, max, mean, median, std | |
| - **Missing values**: nan_count, nan_proportion | |
| - **Distribution**: histogram with bin_edges and hist counts | |
| Example response: | |
| ```json | |
| { | |
| "column_type": "float", | |
| "column_statistics": { | |
| "nan_count": 0, | |
| "nan_proportion": 0, | |
| "min": 0, | |
| "max": 2, | |
| "mean": 1.67206, | |
| "median": 1.8, | |
| "std": 0.38714, | |
| "histogram": { | |
| "hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048], | |
| "bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2] | |
| } | |
| } | |
| } | |
| ``` | |
| #### Categorical Features (class_label, string_label) | |
| - **Unique values**: n_unique count | |
| - **Frequencies**: Complete frequency distribution for all categories | |
| - **Missing values**: nan_count, nan_proportion | |
| - **No label tracking**: no_label_count, no_label_proportion (for class_label) | |
| Example response: | |
| ```json | |
| { | |
| "column_type": "class_label", | |
| "column_statistics": { | |
| "nan_count": 0, | |
| "nan_proportion": 0, | |
| "no_label_count": 0, | |
| "no_label_proportion": 0, | |
| "n_unique": 2, | |
| "frequencies": { | |
| "unacceptable": 2528, | |
| "acceptable": 6023 | |
| } | |
| } | |
| } | |
| ``` | |
| #### Text Features (string_text) | |
| - **Length statistics**: min, max, mean, median, std (character count) | |
| - **Missing values**: nan_count, nan_proportion | |
| - **Distribution**: histogram of text lengths | |
| Example response: | |
| ```json | |
| { | |
| "column_type": "string_text", | |
| "column_statistics": { | |
| "nan_count": 0, | |
| "nan_proportion": 0, | |
| "min": 6, | |
| "max": 231, | |
| "mean": 40.70074, | |
| "median": 37, | |
| "std": 19.14431, | |
| "histogram": { | |
| "hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1], | |
| "bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231] | |
| } | |
| } | |
| } | |
| ``` | |
| #### Boolean Features (bool) | |
| - **Frequencies**: Distribution of True/False values | |
| - **Missing values**: nan_count, nan_proportion | |
| Example response: | |
| ```json | |
| { | |
| "column_type": "bool", | |
| "column_statistics": { | |
| "nan_count": 3, | |
| "nan_proportion": 0.15, | |
| "frequencies": { | |
| "False": 7, | |
| "True": 10 | |
| } | |
| } | |
| } | |
| ``` | |
| #### Image Features (image) | |
| - **Dimension statistics**: min, max, mean, median, std (for width/height) | |
| - **Missing values**: nan_count, nan_proportion | |
| - **Distribution**: histogram of image dimensions | |
| Example response: | |
| ```json | |
| { | |
| "column_type": "image", | |
| "column_statistics": { | |
| "nan_count": 0, | |
| "nan_proportion": 0.0, | |
| "min": 256, | |
| "max": 873, | |
| "mean": 327.99339, | |
| "median": 341.0, | |
| "std": 60.07286, | |
| "histogram": { | |
| "hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2], | |
| "bin_edges": [256, 318, 380, 442, 504, ...] | |
| } | |
| } | |
| } | |
| ``` | |
| #### Audio Features (audio) | |
| - **Duration statistics**: min, max, mean, median, std (in seconds) | |
| - **Missing values**: nan_count, nan_proportion | |
| - **Distribution**: histogram of audio durations | |
| Example response: | |
| ```json | |
| { | |
| "column_type": "audio", | |
| "column_statistics": { | |
| "nan_count": 0, | |
| "nan_proportion": 0, | |
| "min": 1.02, | |
| "max": 15, | |
| "mean": 13.93042, | |
| "median": 14.77, | |
| "std": 2.63734, | |
| "histogram": { | |
| "hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770], | |
| "bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...] | |
| } | |
| } | |
| } | |
| ``` | |
| #### List Features (list) | |
| - **Length statistics**: min, max, mean, median, std (list length) | |
| - **Missing values**: nan_count, nan_proportion | |
| - **Distribution**: histogram of list lengths | |
| Example response: | |
| ```json | |
| { | |
| "column_type": "list", | |
| "column_statistics": { | |
| "nan_count": 0, | |
| "nan_proportion": 0.0, | |
| "min": 1, | |
| "max": 3, | |
| "mean": 1.01741, | |
| "median": 1.0, | |
| "std": 0.13146, | |
| "histogram": { | |
| "hist": [11177, 196, 1], | |
| "bin_edges": [1, 2, 3, 3] | |
| } | |
| } | |
| } | |
| ``` | |
| ## Implementation | |
| ### Architecture | |
| ``` | |
| analyze_dataset_features() | |
| β | |
| Try: get_dataset_statistics() [Dataset Viewer API] | |
| β | |
| If available (parquet format): | |
| β Use full dataset statistics | |
| β Cache results | |
| β Return converted analysis | |
| β | |
| If not available: | |
| β Fall back to sample-based analysis | |
| β Load samples via streaming | |
| β Compute statistics locally | |
| ``` | |
| ### Key Components | |
| #### 1. DatasetViewerAdapter | |
| - `get_dataset_statistics()`: Fetch statistics from API | |
| - `check_statistics_availability()`: Check if statistics are available for a dataset | |
| #### 2. DatasetService | |
| - `get_dataset_statistics()`: Wrapper with caching and error handling | |
| - Automatic fallback to sample-based analysis | |
| - Statistics cache directory: `cache/statistics/` | |
| #### 3. Analysis Tool | |
| - `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format | |
| - Seamless integration with existing analysis pipeline | |
| ### Caching Strategy | |
| Statistics are cached with the same TTL as other metadata (default: 1 hour): | |
| ``` | |
| cache/ | |
| βββ metadata/ # Dataset metadata | |
| βββ samples/ # Sample data | |
| βββ statistics/ # Dataset Viewer statistics | |
| βββ {dataset}_{config}_{split}_stats.json | |
| ``` | |
| ## Usage Examples | |
| ### Automatic Selection | |
| ```python | |
| from hf_eda_mcp.tools.analysis import analyze_dataset_features | |
| # Automatically uses Dataset Viewer statistics if available | |
| result = analyze_dataset_features( | |
| dataset_id="stanfordnlp/imdb", | |
| split="train" | |
| ) | |
| # Check which method was used | |
| print(result['sample_info']['sampling_method']) | |
| # Output: "dataset_viewer_api" or "sequential_head" | |
| print(result['sample_info']['represents_full_dataset']) | |
| # Output: True (full dataset) or False (sample) | |
| ``` | |
| ### Check Availability | |
| ```python | |
| from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter | |
| adapter = DatasetViewerAdapter(token="your_token") | |
| availability = adapter.check_statistics_availability("stanfordnlp/imdb") | |
| print(availability) | |
| # { | |
| # 'available': True, | |
| # 'configs': ['plain_text'], | |
| # 'reason': 'Statistics available for 1 config(s)' | |
| # } | |
| ``` | |
| ### Direct Statistics Access | |
| ```python | |
| from hf_eda_mcp.services.dataset_service import DatasetService | |
| service = DatasetService(token="your_token") | |
| stats = service.get_dataset_statistics( | |
| dataset_id="stanfordnlp/imdb", | |
| split="train", | |
| config_name="plain_text" | |
| ) | |
| if stats: | |
| print(f"Full dataset: {stats['num_examples']} examples") | |
| print(f"Columns: {len(stats['statistics'])}") | |
| else: | |
| print("Statistics not available, use sample-based analysis") | |
| ``` | |
| ## Comparison: Before vs After | |
| ### IMDB Dataset Example | |
| #### Before (Sample-based) | |
| ```python | |
| { | |
| 'dataset_info': { | |
| 'sample_size_used': 1000, | |
| 'sample_size_requested': 1000, | |
| }, | |
| 'sample_info': { | |
| 'sampling_method': 'sequential_head', | |
| 'represents_full_dataset': True, # Only if sample >= requested | |
| }, | |
| 'features': { | |
| 'text': { | |
| 'feature_type': 'text', | |
| 'statistics': { | |
| 'count': 1000, | |
| 'avg_length': 1311.289, | |
| 'min_length': 65, | |
| 'max_length': 6103, | |
| # Limited to sample | |
| } | |
| } | |
| }, | |
| 'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text' | |
| } | |
| ``` | |
| #### After (Dataset Viewer) | |
| ```python | |
| { | |
| 'dataset_info': { | |
| 'sample_size_used': 25000, # Full dataset | |
| 'sample_size_requested': 25000, | |
| }, | |
| 'sample_info': { | |
| 'sampling_method': 'dataset_viewer_api', | |
| 'represents_full_dataset': True, # Always true | |
| 'partial': False | |
| }, | |
| 'features': { | |
| 'text': { | |
| 'feature_type': 'text', | |
| 'statistics': { | |
| 'count': 25000, # Full dataset | |
| 'mean_length': 1325.06964, | |
| 'min_length': 52, | |
| 'max_length': 13704, | |
| 'histogram': { | |
| 'bin_edges': [52, 1418, 2784, ...], | |
| 'hist': [17426, 5384, 1490, ...] | |
| } | |
| } | |
| } | |
| }, | |
| 'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text' | |
| } | |
| ``` | |
| ## Supported Data Types | |
| The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types: | |
| | Data Type | Feature Type | Statistics Provided | | |
| |-----------|--------------|---------------------| | |
| | `int`, `float` | numerical | min, max, mean, median, std, histogram | | |
| | `class_label`, `string_label` | categorical | frequencies, n_unique, no_label tracking | | |
| | `bool` | boolean | True/False frequencies | | |
| | `string_text` | text | character length stats (min, max, mean, median, std), histogram | | |
| | `image` | image | dimension statistics, histogram | | |
| | `audio` | audio | duration statistics (seconds), histogram | | |
| | `list` | list | length statistics, histogram | | |
| ### Data Type Mapping | |
| Our analysis tool automatically maps Dataset Viewer types to our internal types: | |
| ```python | |
| Dataset Viewer Type β Our Feature Type | |
| βββββββββββββββββββββββββββββββββββββ | |
| int, float β numerical | |
| class_label β categorical | |
| string_label β categorical | |
| bool β boolean | |
| string_text β text | |
| image β image | |
| audio β audio | |
| list β list | |
| ``` | |
| ## Limitations | |
| ### Dataset Requirements | |
| - Only works for datasets with `builder_name="parquet"` | |
| - Not all datasets on HuggingFace Hub have this format | |
| - Automatic fallback to sample-based analysis for other formats | |
| ### API Availability | |
| - Requires internet connection | |
| - Subject to HuggingFace API rate limits | |
| - May fail for private datasets without proper authentication | |
| ## Error Handling | |
| The implementation includes robust error handling: | |
| 1. **Check availability first**: Verify dataset supports statistics | |
| 2. **Graceful fallback**: Automatically use sample-based analysis if unavailable | |
| 3. **Caching**: Reduce API calls and improve performance | |
| 4. **Logging**: Clear messages about which method is being used | |
| ## Performance Impact | |
| ### API Call Overhead | |
| - Initial call: ~1-2 seconds | |
| - Cached calls: <10ms | |
| - No data download required | |
| ### Sample-based Analysis | |
| - Download time: Varies by dataset size | |
| - Processing time: ~1-5 seconds for 1000 samples | |
| - Network bandwidth: Depends on sample size | |
| ## Future Enhancements | |
| 1. **Parallel requests**: Fetch statistics for multiple splits simultaneously | |
| 2. **Partial statistics**: Support datasets with partial statistics | |
| 3. **Custom aggregations**: Add more statistical measures | |
| 4. **Visualization**: Generate plots from histogram data | |
| ## References | |
| - [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info) | |
| - [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics) | |