hf-eda-mcp

Running

File size: 11,229 Bytes

ca96eb9

# Dataset Viewer Statistics Endpoint Integration

## Overview

The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis.

## Key Benefits

### 1. Full Dataset Coverage
- **Before**: Analysis based on samples (default 1,000 examples)
- **After**: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)

### 2. No Data Download Required
- **Before**: Download and process samples from the dataset
- **After**: Retrieve pre-computed statistics via API call

### 3. More Complete Statistics
The endpoint provides detailed statistics for multiple modalities:

#### Numerical Features (int, float)
- **Basic statistics**: min, max, mean, median, std
- **Missing values**: nan_count, nan_proportion
- **Distribution**: histogram with bin_edges and hist counts

Example response:
```json
{
  "column_type": "float",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 0,
    "max": 2,
    "mean": 1.67206,
    "median": 1.8,
    "std": 0.38714,
    "histogram": {
      "hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
      "bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
    }
  }
}
```

#### Categorical Features (class_label, string_label)
- **Unique values**: n_unique count
- **Frequencies**: Complete frequency distribution for all categories
- **Missing values**: nan_count, nan_proportion
- **No label tracking**: no_label_count, no_label_proportion (for class_label)

Example response:
```json
{
  "column_type": "class_label",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "no_label_count": 0,
    "no_label_proportion": 0,
    "n_unique": 2,
    "frequencies": {
      "unacceptable": 2528,
      "acceptable": 6023
    }
  }
}
```

#### Text Features (string_text)
- **Length statistics**: min, max, mean, median, std (character count)
- **Missing values**: nan_count, nan_proportion
- **Distribution**: histogram of text lengths

Example response:
```json
{
  "column_type": "string_text",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 6,
    "max": 231,
    "mean": 40.70074,
    "median": 37,
    "std": 19.14431,
    "histogram": {
      "hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
      "bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
    }
  }
}
```

#### Boolean Features (bool)
- **Frequencies**: Distribution of True/False values
- **Missing values**: nan_count, nan_proportion

Example response:
```json
{
  "column_type": "bool",
  "column_statistics": {
    "nan_count": 3,
    "nan_proportion": 0.15,
    "frequencies": {
      "False": 7,
      "True": 10
    }
  }
}
```

#### Image Features (image)
- **Dimension statistics**: min, max, mean, median, std (for width/height)
- **Missing values**: nan_count, nan_proportion
- **Distribution**: histogram of image dimensions

Example response:
```json
{
  "column_type": "image",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0.0,
    "min": 256,
    "max": 873,
    "mean": 327.99339,
    "median": 341.0,
    "std": 60.07286,
    "histogram": {
      "hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
      "bin_edges": [256, 318, 380, 442, 504, ...]
    }
  }
}
```

#### Audio Features (audio)
- **Duration statistics**: min, max, mean, median, std (in seconds)
- **Missing values**: nan_count, nan_proportion
- **Distribution**: histogram of audio durations

Example response:
```json
{
  "column_type": "audio",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0,
    "min": 1.02,
    "max": 15,
    "mean": 13.93042,
    "median": 14.77,
    "std": 2.63734,
    "histogram": {
      "hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
      "bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
    }
  }
}
```

#### List Features (list)
- **Length statistics**: min, max, mean, median, std (list length)
- **Missing values**: nan_count, nan_proportion
- **Distribution**: histogram of list lengths

Example response:
```json
{
  "column_type": "list",
  "column_statistics": {
    "nan_count": 0,
    "nan_proportion": 0.0,
    "min": 1,
    "max": 3,
    "mean": 1.01741,
    "median": 1.0,
    "std": 0.13146,
    "histogram": {
      "hist": [11177, 196, 1],
      "bin_edges": [1, 2, 3, 3]
    }
  }
}
```

## Implementation

### Architecture

```
analyze_dataset_features()
    ↓
    Try: get_dataset_statistics() [Dataset Viewer API]
    ↓
    If available (parquet format):
        → Use full dataset statistics
        → Cache results
        → Return converted analysis
    ↓
    If not available:
        → Fall back to sample-based analysis
        → Load samples via streaming
        → Compute statistics locally
```

### Key Components

#### 1. DatasetViewerAdapter
- `get_dataset_statistics()`: Fetch statistics from API
- `check_statistics_availability()`: Check if statistics are available for a dataset

#### 2. DatasetService
- `get_dataset_statistics()`: Wrapper with caching and error handling
- Automatic fallback to sample-based analysis
- Statistics cache directory: `cache/statistics/`

#### 3. Analysis Tool
- `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format
- Seamless integration with existing analysis pipeline

### Caching Strategy

Statistics are cached with the same TTL as other metadata (default: 1 hour):

```
cache/
├── metadata/          # Dataset metadata
├── samples/           # Sample data
└── statistics/        # Dataset Viewer statistics
    └── {dataset}_{config}_{split}_stats.json
```

## Usage Examples

### Automatic Selection

```python
from hf_eda_mcp.tools.analysis import analyze_dataset_features

# Automatically uses Dataset Viewer statistics if available
result = analyze_dataset_features(
    dataset_id="stanfordnlp/imdb",
    split="train"
)

# Check which method was used
print(result['sample_info']['sampling_method'])
# Output: "dataset_viewer_api" or "sequential_head"

print(result['sample_info']['represents_full_dataset'])
# Output: True (full dataset) or False (sample)
```

### Check Availability

```python
from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter

adapter = DatasetViewerAdapter(token="your_token")
availability = adapter.check_statistics_availability("stanfordnlp/imdb")

print(availability)
# {
#   'available': True,
#   'configs': ['plain_text'],
#   'reason': 'Statistics available for 1 config(s)'
# }
```

### Direct Statistics Access

```python
from hf_eda_mcp.services.dataset_service import DatasetService

service = DatasetService(token="your_token")
stats = service.get_dataset_statistics(
    dataset_id="stanfordnlp/imdb",
    split="train",
    config_name="plain_text"
)

if stats:
    print(f"Full dataset: {stats['num_examples']} examples")
    print(f"Columns: {len(stats['statistics'])}")
else:
    print("Statistics not available, use sample-based analysis")
```

## Comparison: Before vs After

### IMDB Dataset Example

#### Before (Sample-based)
```python
{
  'dataset_info': {
    'sample_size_used': 1000,
    'sample_size_requested': 1000,
  },
  'sample_info': {
    'sampling_method': 'sequential_head',
    'represents_full_dataset': True,  # Only if sample >= requested
  },
  'features': {
    'text': {
      'feature_type': 'text',
      'statistics': {
        'count': 1000,
        'avg_length': 1311.289,
        'min_length': 65,
        'max_length': 6103,
        # Limited to sample
      }
    }
  },
  'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text'
}
```

#### After (Dataset Viewer)
```python
{
  'dataset_info': {
    'sample_size_used': 25000,  # Full dataset
    'sample_size_requested': 25000,
  },
  'sample_info': {
    'sampling_method': 'dataset_viewer_api',
    'represents_full_dataset': True,  # Always true
    'partial': False
  },
  'features': {
    'text': {
      'feature_type': 'text',
      'statistics': {
        'count': 25000,  # Full dataset
        'mean_length': 1325.06964,
        'min_length': 52,
        'max_length': 13704,
        'histogram': {
          'bin_edges': [52, 1418, 2784, ...],
          'hist': [17426, 5384, 1490, ...]
        }
      }
    }
  },
  'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text'
}
```

## Supported Data Types

The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:

| Data Type | Feature Type | Statistics Provided |
|-----------|--------------|---------------------|
| `int`, `float` | numerical | min, max, mean, median, std, histogram |
| `class_label`, `string_label` | categorical | frequencies, n_unique, no_label tracking |
| `bool` | boolean | True/False frequencies |
| `string_text` | text | character length stats (min, max, mean, median, std), histogram |
| `image` | image | dimension statistics, histogram |
| `audio` | audio | duration statistics (seconds), histogram |
| `list` | list | length statistics, histogram |

### Data Type Mapping

Our analysis tool automatically maps Dataset Viewer types to our internal types:

```python
Dataset Viewer Type → Our Feature Type
─────────────────────────────────────
int, float          → numerical
class_label         → categorical
string_label        → categorical
bool                → boolean
string_text         → text
image               → image
audio               → audio
list                → list
```

## Limitations

### Dataset Requirements
- Only works for datasets with `builder_name="parquet"`
- Not all datasets on HuggingFace Hub have this format
- Automatic fallback to sample-based analysis for other formats

### API Availability
- Requires internet connection
- Subject to HuggingFace API rate limits
- May fail for private datasets without proper authentication

## Error Handling

The implementation includes robust error handling:

1. **Check availability first**: Verify dataset supports statistics
2. **Graceful fallback**: Automatically use sample-based analysis if unavailable
3. **Caching**: Reduce API calls and improve performance
4. **Logging**: Clear messages about which method is being used

## Performance Impact

### API Call Overhead
- Initial call: ~1-2 seconds
- Cached calls: <10ms
- No data download required

### Sample-based Analysis
- Download time: Varies by dataset size
- Processing time: ~1-5 seconds for 1000 samples
- Network bandwidth: Depends on sample size

## Future Enhancements

1. **Parallel requests**: Fetch statistics for multiple splits simultaneously
2. **Partial statistics**: Support datasets with partial statistics
3. **Custom aggregations**: Add more statistical measures
4. **Visualization**: Generate plots from histogram data

## References

- [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info)
- [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics)