hf-eda-mcp

Running

App Files Files Community

hf-eda-mcp / docs /STATISTICS_ENDPOINT.md

KhalilGuetari

Add a search text in dataset tool

ca96eb9 18 days ago

preview code

raw

history blame

11.2 kB

	# Dataset Viewer Statistics Endpoint Integration

	## Overview

	The HuggingFace Dataset Viewer API provides a `/statistics` endpoint that offers comprehensive statistics for datasets with `builder_name="parquet"`. This endpoint is significantly more efficient and complete than sample-based analysis.

	## Key Benefits

	### 1. Full Dataset Coverage
	- Before: Analysis based on samples (default 1,000 examples)
	- After: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)

	### 2. No Data Download Required
	- Before: Download and process samples from the dataset
	- After: Retrieve pre-computed statistics via API call

	### 3. More Complete Statistics
	The endpoint provides detailed statistics for multiple modalities:

	#### Numerical Features (int, float)
	- Basic statistics: min, max, mean, median, std
	- Missing values: nan_count, nan_proportion
	- Distribution: histogram with bin_edges and hist counts

	Example response:
	```json
	{
	"column_type": "float",
	"column_statistics": {
	"nan_count": 0,
	"nan_proportion": 0,
	"min": 0,
	"max": 2,
	"mean": 1.67206,
	"median": 1.8,
	"std": 0.38714,
	"histogram": {
	"hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
	"bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
	}
	}
	}
	```

	#### Categorical Features (class_label, string_label)
	- Unique values: n_unique count
	- Frequencies: Complete frequency distribution for all categories
	- Missing values: nan_count, nan_proportion
	- No label tracking: no_label_count, no_label_proportion (for class_label)

	Example response:
	```json
	{
	"column_type": "class_label",
	"column_statistics": {
	"nan_count": 0,
	"nan_proportion": 0,
	"no_label_count": 0,
	"no_label_proportion": 0,
	"n_unique": 2,
	"frequencies": {
	"unacceptable": 2528,
	"acceptable": 6023
	}
	}
	}
	```

	#### Text Features (string_text)
	- Length statistics: min, max, mean, median, std (character count)
	- Missing values: nan_count, nan_proportion
	- Distribution: histogram of text lengths

	Example response:
	```json
	{
	"column_type": "string_text",
	"column_statistics": {
	"nan_count": 0,
	"nan_proportion": 0,
	"min": 6,
	"max": 231,
	"mean": 40.70074,
	"median": 37,
	"std": 19.14431,
	"histogram": {
	"hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
	"bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
	}
	}
	}
	```

	#### Boolean Features (bool)
	- Frequencies: Distribution of True/False values
	- Missing values: nan_count, nan_proportion

	Example response:
	```json
	{
	"column_type": "bool",
	"column_statistics": {
	"nan_count": 3,
	"nan_proportion": 0.15,
	"frequencies": {
	"False": 7,
	"True": 10
	}
	}
	}
	```

	#### Image Features (image)
	- Dimension statistics: min, max, mean, median, std (for width/height)
	- Missing values: nan_count, nan_proportion
	- Distribution: histogram of image dimensions

	Example response:
	```json
	{
	"column_type": "image",
	"column_statistics": {
	"nan_count": 0,
	"nan_proportion": 0.0,
	"min": 256,
	"max": 873,
	"mean": 327.99339,
	"median": 341.0,
	"std": 60.07286,
	"histogram": {
	"hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
	"bin_edges": [256, 318, 380, 442, 504, ...]
	}
	}
	}
	```

	#### Audio Features (audio)
	- Duration statistics: min, max, mean, median, std (in seconds)
	- Missing values: nan_count, nan_proportion
	- Distribution: histogram of audio durations

	Example response:
	```json
	{
	"column_type": "audio",
	"column_statistics": {
	"nan_count": 0,
	"nan_proportion": 0,
	"min": 1.02,
	"max": 15,
	"mean": 13.93042,
	"median": 14.77,
	"std": 2.63734,
	"histogram": {
	"hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
	"bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
	}
	}
	}
	```

	#### List Features (list)
	- Length statistics: min, max, mean, median, std (list length)
	- Missing values: nan_count, nan_proportion
	- Distribution: histogram of list lengths

	Example response:
	```json
	{
	"column_type": "list",
	"column_statistics": {
	"nan_count": 0,
	"nan_proportion": 0.0,
	"min": 1,
	"max": 3,
	"mean": 1.01741,
	"median": 1.0,
	"std": 0.13146,
	"histogram": {
	"hist": [11177, 196, 1],
	"bin_edges": [1, 2, 3, 3]
	}
	}
	}
	```

	## Implementation

	### Architecture

	```
	analyze_dataset_features()
	↓
	Try: get_dataset_statistics() [Dataset Viewer API]
	↓
	If available (parquet format):
	→ Use full dataset statistics
	→ Cache results
	→ Return converted analysis
	↓
	If not available:
	→ Fall back to sample-based analysis
	→ Load samples via streaming
	→ Compute statistics locally
	```

	### Key Components

	#### 1. DatasetViewerAdapter
	- `get_dataset_statistics()`: Fetch statistics from API
	- `check_statistics_availability()`: Check if statistics are available for a dataset

	#### 2. DatasetService
	- `get_dataset_statistics()`: Wrapper with caching and error handling
	- Automatic fallback to sample-based analysis
	- Statistics cache directory: `cache/statistics/`

	#### 3. Analysis Tool
	- `_convert_viewer_statistics_to_analysis()`: Convert API format to our analysis format
	- Seamless integration with existing analysis pipeline

	### Caching Strategy

	Statistics are cached with the same TTL as other metadata (default: 1 hour):

	```
	cache/
	├── metadata/ # Dataset metadata
	├── samples/ # Sample data
	└── statistics/ # Dataset Viewer statistics
	└── {dataset}_{config}_{split}_stats.json
	```

	## Usage Examples

	### Automatic Selection

	```python
	from hf_eda_mcp.tools.analysis import analyze_dataset_features

	# Automatically uses Dataset Viewer statistics if available
	result = analyze_dataset_features(
	dataset_id="stanfordnlp/imdb",
	split="train"
	)

	# Check which method was used
	print(result['sample_info']['sampling_method'])
	# Output: "dataset_viewer_api" or "sequential_head"

	print(result['sample_info']['represents_full_dataset'])
	# Output: True (full dataset) or False (sample)
	```

	### Check Availability

	```python
	from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter

	adapter = DatasetViewerAdapter(token="your_token")
	availability = adapter.check_statistics_availability("stanfordnlp/imdb")

	print(availability)
	# {
	# 'available': True,
	# 'configs': ['plain_text'],
	# 'reason': 'Statistics available for 1 config(s)'
	# }
	```

	### Direct Statistics Access

	```python
	from hf_eda_mcp.services.dataset_service import DatasetService

	service = DatasetService(token="your_token")
	stats = service.get_dataset_statistics(
	dataset_id="stanfordnlp/imdb",
	split="train",
	config_name="plain_text"
	)

	if stats:
	print(f"Full dataset: {stats['num_examples']} examples")
	print(f"Columns: {len(stats['statistics'])}")
	else:
	print("Statistics not available, use sample-based analysis")
	```

	## Comparison: Before vs After

	### IMDB Dataset Example

	#### Before (Sample-based)
	```python
	{
	'dataset_info': {
	'sample_size_used': 1000,
	'sample_size_requested': 1000,
	},
	'sample_info': {
	'sampling_method': 'sequential_head',
	'represents_full_dataset': True, # Only if sample >= requested
	},
	'features': {
	'text': {
	'feature_type': 'text',
	'statistics': {
	'count': 1000,
	'avg_length': 1311.289,
	'min_length': 65,
	'max_length': 6103,
	# Limited to sample
	}
	}
	},
	'summary': 'Analyzed 2 features from 1000 samples \| Types: 1 categorical, 1 text'
	}
	```

	#### After (Dataset Viewer)
	```python
	{
	'dataset_info': {
	'sample_size_used': 25000, # Full dataset
	'sample_size_requested': 25000,
	},
	'sample_info': {
	'sampling_method': 'dataset_viewer_api',
	'represents_full_dataset': True, # Always true
	'partial': False
	},
	'features': {
	'text': {
	'feature_type': 'text',
	'statistics': {
	'count': 25000, # Full dataset
	'mean_length': 1325.06964,
	'min_length': 52,
	'max_length': 13704,
	'histogram': {
	'bin_edges': [52, 1418, 2784, ...],
	'hist': [17426, 5384, 1490, ...]
	}
	}
	}
	},
	'summary': 'Analyzed 2 features from 25000 samples \| Types: 1 categorical, 1 text'
	}
	```

	## Supported Data Types

	The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:

	\| Data Type \| Feature Type \| Statistics Provided \|
	\|-----------\|--------------\|---------------------\|
	\| `int`, `float` \| numerical \| min, max, mean, median, std, histogram \|
	\| `class_label`, `string_label` \| categorical \| frequencies, n_unique, no_label tracking \|
	\| `bool` \| boolean \| True/False frequencies \|
	\| `string_text` \| text \| character length stats (min, max, mean, median, std), histogram \|
	\| `image` \| image \| dimension statistics, histogram \|
	\| `audio` \| audio \| duration statistics (seconds), histogram \|
	\| `list` \| list \| length statistics, histogram \|

	### Data Type Mapping

	Our analysis tool automatically maps Dataset Viewer types to our internal types:

	```python
	Dataset Viewer Type → Our Feature Type
	─────────────────────────────────────
	int, float → numerical
	class_label → categorical
	string_label → categorical
	bool → boolean
	string_text → text
	image → image
	audio → audio
	list → list
	```

	## Limitations

	### Dataset Requirements
	- Only works for datasets with `builder_name="parquet"`
	- Not all datasets on HuggingFace Hub have this format
	- Automatic fallback to sample-based analysis for other formats

	### API Availability
	- Requires internet connection
	- Subject to HuggingFace API rate limits
	- May fail for private datasets without proper authentication

	## Error Handling

	The implementation includes robust error handling:

	1. Check availability first: Verify dataset supports statistics
	2. Graceful fallback: Automatically use sample-based analysis if unavailable
	3. Caching: Reduce API calls and improve performance
	4. Logging: Clear messages about which method is being used

	## Performance Impact

	### API Call Overhead
	- Initial call: ~1-2 seconds
	- Cached calls: <10ms
	- No data download required

	### Sample-based Analysis
	- Download time: Varies by dataset size
	- Processing time: ~1-5 seconds for 1000 samples
	- Network bandwidth: Depends on sample size

	## Future Enhancements

	1. Parallel requests: Fetch statistics for multiple splits simultaneously
	2. Partial statistics: Support datasets with partial statistics
	3. Custom aggregations: Add more statistical measures
	4. Visualization: Generate plots from histogram data

	## References

	- [HuggingFace Dataset Viewer Documentation](https://huggingface.co/docs/dataset-viewer/info)
	- [Statistics Endpoint Specification](https://huggingface.co/docs/dataset-viewer/statistics)