Spaces:
Running
Running
Dataset Viewer Statistics Endpoint Integration
Overview
The HuggingFace Dataset Viewer API provides a /statistics endpoint that offers comprehensive statistics for datasets with builder_name="parquet". This endpoint is significantly more efficient and complete than sample-based analysis.
Key Benefits
1. Full Dataset Coverage
- Before: Analysis based on samples (default 1,000 examples)
- After: Statistics computed on the entire dataset (e.g., 25,000 examples for IMDB train split)
2. No Data Download Required
- Before: Download and process samples from the dataset
- After: Retrieve pre-computed statistics via API call
3. More Complete Statistics
The endpoint provides detailed statistics for multiple modalities:
Numerical Features (int, float)
- Basic statistics: min, max, mean, median, std
- Missing values: nan_count, nan_proportion
- Distribution: histogram with bin_edges and hist counts
Example response:
{
"column_type": "float",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 0,
"max": 2,
"mean": 1.67206,
"median": 1.8,
"std": 0.38714,
"histogram": {
"hist": [17, 12, 48, 52, 135, 188, 814, 15, 1628, 2048],
"bin_edges": [0, 0.2, 0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6, 1.8, 2]
}
}
}
Categorical Features (class_label, string_label)
- Unique values: n_unique count
- Frequencies: Complete frequency distribution for all categories
- Missing values: nan_count, nan_proportion
- No label tracking: no_label_count, no_label_proportion (for class_label)
Example response:
{
"column_type": "class_label",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"no_label_count": 0,
"no_label_proportion": 0,
"n_unique": 2,
"frequencies": {
"unacceptable": 2528,
"acceptable": 6023
}
}
}
Text Features (string_text)
- Length statistics: min, max, mean, median, std (character count)
- Missing values: nan_count, nan_proportion
- Distribution: histogram of text lengths
Example response:
{
"column_type": "string_text",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 6,
"max": 231,
"mean": 40.70074,
"median": 37,
"std": 19.14431,
"histogram": {
"hist": [2260, 4512, 1262, 380, 102, 26, 6, 1, 1, 1],
"bin_edges": [6, 29, 52, 75, 98, 121, 144, 167, 190, 213, 231]
}
}
}
Boolean Features (bool)
- Frequencies: Distribution of True/False values
- Missing values: nan_count, nan_proportion
Example response:
{
"column_type": "bool",
"column_statistics": {
"nan_count": 3,
"nan_proportion": 0.15,
"frequencies": {
"False": 7,
"True": 10
}
}
}
Image Features (image)
- Dimension statistics: min, max, mean, median, std (for width/height)
- Missing values: nan_count, nan_proportion
- Distribution: histogram of image dimensions
Example response:
{
"column_type": "image",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 256,
"max": 873,
"mean": 327.99339,
"median": 341.0,
"std": 60.07286,
"histogram": {
"hist": [1734, 1637, 1326, 121, 10, 3, 1, 3, 1, 2],
"bin_edges": [256, 318, 380, 442, 504, ...]
}
}
}
Audio Features (audio)
- Duration statistics: min, max, mean, median, std (in seconds)
- Missing values: nan_count, nan_proportion
- Distribution: histogram of audio durations
Example response:
{
"column_type": "audio",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0,
"min": 1.02,
"max": 15,
"mean": 13.93042,
"median": 14.77,
"std": 2.63734,
"histogram": {
"hist": [32, 25, 18, 24, 22, 17, 18, 19, 55, 1770],
"bin_edges": [1.02, 2.418, 3.816, 5.214, 6.612, ...]
}
}
}
List Features (list)
- Length statistics: min, max, mean, median, std (list length)
- Missing values: nan_count, nan_proportion
- Distribution: histogram of list lengths
Example response:
{
"column_type": "list",
"column_statistics": {
"nan_count": 0,
"nan_proportion": 0.0,
"min": 1,
"max": 3,
"mean": 1.01741,
"median": 1.0,
"std": 0.13146,
"histogram": {
"hist": [11177, 196, 1],
"bin_edges": [1, 2, 3, 3]
}
}
}
Implementation
Architecture
analyze_dataset_features()
β
Try: get_dataset_statistics() [Dataset Viewer API]
β
If available (parquet format):
β Use full dataset statistics
β Cache results
β Return converted analysis
β
If not available:
β Fall back to sample-based analysis
β Load samples via streaming
β Compute statistics locally
Key Components
1. DatasetViewerAdapter
get_dataset_statistics(): Fetch statistics from APIcheck_statistics_availability(): Check if statistics are available for a dataset
2. DatasetService
get_dataset_statistics(): Wrapper with caching and error handling- Automatic fallback to sample-based analysis
- Statistics cache directory:
cache/statistics/
3. Analysis Tool
_convert_viewer_statistics_to_analysis(): Convert API format to our analysis format- Seamless integration with existing analysis pipeline
Caching Strategy
Statistics are cached with the same TTL as other metadata (default: 1 hour):
cache/
βββ metadata/ # Dataset metadata
βββ samples/ # Sample data
βββ statistics/ # Dataset Viewer statistics
βββ {dataset}_{config}_{split}_stats.json
Usage Examples
Automatic Selection
from hf_eda_mcp.tools.analysis import analyze_dataset_features
# Automatically uses Dataset Viewer statistics if available
result = analyze_dataset_features(
dataset_id="stanfordnlp/imdb",
split="train"
)
# Check which method was used
print(result['sample_info']['sampling_method'])
# Output: "dataset_viewer_api" or "sequential_head"
print(result['sample_info']['represents_full_dataset'])
# Output: True (full dataset) or False (sample)
Check Availability
from hf_eda_mcp.services.dataset_viewer_adapter import DatasetViewerAdapter
adapter = DatasetViewerAdapter(token="your_token")
availability = adapter.check_statistics_availability("stanfordnlp/imdb")
print(availability)
# {
# 'available': True,
# 'configs': ['plain_text'],
# 'reason': 'Statistics available for 1 config(s)'
# }
Direct Statistics Access
from hf_eda_mcp.services.dataset_service import DatasetService
service = DatasetService(token="your_token")
stats = service.get_dataset_statistics(
dataset_id="stanfordnlp/imdb",
split="train",
config_name="plain_text"
)
if stats:
print(f"Full dataset: {stats['num_examples']} examples")
print(f"Columns: {len(stats['statistics'])}")
else:
print("Statistics not available, use sample-based analysis")
Comparison: Before vs After
IMDB Dataset Example
Before (Sample-based)
{
'dataset_info': {
'sample_size_used': 1000,
'sample_size_requested': 1000,
},
'sample_info': {
'sampling_method': 'sequential_head',
'represents_full_dataset': True, # Only if sample >= requested
},
'features': {
'text': {
'feature_type': 'text',
'statistics': {
'count': 1000,
'avg_length': 1311.289,
'min_length': 65,
'max_length': 6103,
# Limited to sample
}
}
},
'summary': 'Analyzed 2 features from 1000 samples | Types: 1 categorical, 1 text'
}
After (Dataset Viewer)
{
'dataset_info': {
'sample_size_used': 25000, # Full dataset
'sample_size_requested': 25000,
},
'sample_info': {
'sampling_method': 'dataset_viewer_api',
'represents_full_dataset': True, # Always true
'partial': False
},
'features': {
'text': {
'feature_type': 'text',
'statistics': {
'count': 25000, # Full dataset
'mean_length': 1325.06964,
'min_length': 52,
'max_length': 13704,
'histogram': {
'bin_edges': [52, 1418, 2784, ...],
'hist': [17426, 5384, 1490, ...]
}
}
}
},
'summary': 'Analyzed 2 features from 25000 samples | Types: 1 categorical, 1 text'
}
Supported Data Types
The Dataset Viewer statistics endpoint supports comprehensive analysis for multiple data types:
| Data Type | Feature Type | Statistics Provided |
|---|---|---|
int, float |
numerical | min, max, mean, median, std, histogram |
class_label, string_label |
categorical | frequencies, n_unique, no_label tracking |
bool |
boolean | True/False frequencies |
string_text |
text | character length stats (min, max, mean, median, std), histogram |
image |
image | dimension statistics, histogram |
audio |
audio | duration statistics (seconds), histogram |
list |
list | length statistics, histogram |
Data Type Mapping
Our analysis tool automatically maps Dataset Viewer types to our internal types:
Dataset Viewer Type β Our Feature Type
βββββββββββββββββββββββββββββββββββββ
int, float β numerical
class_label β categorical
string_label β categorical
bool β boolean
string_text β text
image β image
audio β audio
list β list
Limitations
Dataset Requirements
- Only works for datasets with
builder_name="parquet" - Not all datasets on HuggingFace Hub have this format
- Automatic fallback to sample-based analysis for other formats
API Availability
- Requires internet connection
- Subject to HuggingFace API rate limits
- May fail for private datasets without proper authentication
Error Handling
The implementation includes robust error handling:
- Check availability first: Verify dataset supports statistics
- Graceful fallback: Automatically use sample-based analysis if unavailable
- Caching: Reduce API calls and improve performance
- Logging: Clear messages about which method is being used
Performance Impact
API Call Overhead
- Initial call: ~1-2 seconds
- Cached calls: <10ms
- No data download required
Sample-based Analysis
- Download time: Varies by dataset size
- Processing time: ~1-5 seconds for 1000 samples
- Network bandwidth: Depends on sample size
Future Enhancements
- Parallel requests: Fetch statistics for multiple splits simultaneously
- Partial statistics: Support datasets with partial statistics
- Custom aggregations: Add more statistical measures
- Visualization: Generate plots from histogram data