Dataset Validation & Curation - Implementation Complete
Comprehensive dataset validation, curation, and analysis utilities have been implemented.
β Implemented Features
1. Dataset Validation (ylff/utils/dataset_validation.py)
DatasetValidator Class:
- β Data integrity checks (images, poses, metadata)
- β Quality validation (NaN/Inf detection, rotation matrix validity)
- β Statistical analysis (error distributions, image counts)
- β Comprehensive reporting
Functions:
- β
validate_dataset_file()- Validate saved dataset files - β
check_dataset_integrity()- Check dataset directory integrity
Validation Checks:
- Image format validation (numpy arrays, tensors, file paths)
- Pose shape and validity checks
- Metadata validation (weights, errors, sequence IDs)
- NaN/Inf detection
- Rotation matrix determinant checks
2. Dataset Curation (ylff/utils/dataset_curation.py)
DatasetCurator Class:
- β Quality-based filtering (error, weight, image count thresholds)
- β Outlier removal (percentile-based, statistical IQR method)
- β Dataset balancing (error bins, uniform, weighted strategies)
- β Dataset splitting (train/val/test with stratification)
- β Smart sampling (random, weighted, error-based)
Curation Strategies:
- Filtering: By error range, weight range, image count
- Outlier Removal: Percentile-based or statistical IQR
- Balancing: Error bins, uniform distribution, weighted sampling
- Splitting: Stratified or random train/val/test splits
3. Dataset Analysis (ylff/utils/dataset_analysis.py)
DatasetAnalyzer Class:
- β Statistical analysis (mean, median, quartiles, percentiles)
- β Distribution computation (histograms, binning)
- β Quality metrics (error ratios, weight diversity, completeness)
- β Correlation analysis
- β Report generation (JSON, text, markdown)
Analysis Features:
- Error statistics (mean, median, Q25/Q75, Q90/Q95/Q99)
- Weight statistics
- Image count statistics
- Sequence statistics (samples per sequence)
- Quality metrics (low/medium/high error ratios)
- Completeness metrics
π API Endpoints
/api/v1/dataset/validate (POST)
Request Model: ValidateDatasetRequest
{
"dataset_path": "data/training/dataset.pkl",
"strict": false,
"check_images": true,
"check_poses": true,
"check_metadata": true
}
Response: DatasetValidationResponse
validation_passed: Booleanstatistics: Dataset statisticsissues: List of validation issuessummary: Validation summary
/api/v1/dataset/curate (POST)
Request Model: CurateDatasetRequest
{
"dataset_path": "data/training/dataset.pkl",
"output_path": "data/training/dataset_curated.pkl",
"min_error": 0.5,
"max_error": 30.0,
"remove_outliers": true,
"outlier_percentile": 95.0,
"balance": true,
"balance_strategy": "error_bins",
"num_bins": 10
}
Response: JobResponse (async job)
/api/v1/dataset/analyze (POST)
Request Model: AnalyzeDatasetRequest
{
"dataset_path": "data/training/dataset.pkl",
"output_path": "data/training/analysis.json",
"format": "json",
"compute_distributions": true,
"compute_correlations": true
}
Response: DatasetAnalysisResponse
statistics: Dataset statisticsquality_metrics: Quality metricsreport: Human-readable report (if text/markdown)
π§ CLI Commands
ylff dataset validate
ylff dataset validate data/training/dataset.pkl \
--strict \
--check-images \
--check-poses \
--check-metadata \
--output validation_report.json
ylff dataset curate
ylff dataset curate \
data/training/dataset.pkl \
data/training/dataset_curated.pkl \
--min-error 0.5 \
--max-error 30.0 \
--remove-outliers \
--outlier-percentile 95.0 \
--balance \
--balance-strategy error_bins \
--num-bins 10
ylff dataset analyze
ylff dataset analyze data/training/dataset.pkl \
--output analysis_report.json \
--format json \
--compute-distributions \
--compute-correlations
π Integration
Data Pipeline Integration
The BADataPipeline.build_training_set() method now automatically:
- β Validates built datasets
- β Analyzes dataset statistics
- β Logs validation and analysis results
Usage in Training
from ylff.utils.dataset_validation import DatasetValidator
from ylff.utils.dataset_curation import DatasetCurator
from ylff.utils.dataset_analysis import DatasetAnalyzer
# Validate
validator = DatasetValidator(strict=False)
report = validator.validate_dataset(samples)
# Curate
curator = DatasetCurator()
curated, stats = curator.filter_by_quality(
samples,
min_error=0.5,
max_error=30.0,
)
curated, _ = curator.remove_outliers(curated, error_percentile=95.0)
# Analyze
analyzer = DatasetAnalyzer()
analysis = analyzer.analyze_dataset(curated)
analyzer.generate_report("analysis_report.json", format="markdown")
π Features
Validation Features
- β Image format validation (numpy, tensor, file paths)
- β Pose shape and validity checks
- β Metadata validation
- β NaN/Inf detection
- β Rotation matrix validation
- β File integrity checks
Curation Features
- β Quality filtering (error, weight, image count)
- β Outlier removal (percentile, IQR)
- β Dataset balancing (error bins, uniform, weighted)
- β Train/val/test splitting (stratified, random)
- β Smart sampling strategies
Analysis Features
- β Statistical analysis (mean, median, quartiles)
- β Distribution computation
- β Quality metrics
- β Correlation analysis
- β Report generation (JSON, text, markdown)
π Next Steps
- Dataset Versioning - Track dataset versions and metadata
- Visualization - Generate plots for distributions and statistics
- Advanced Filtering - Scene-based, sequence-based filtering
- Data Augmentation - Integration with augmentation strategies
- Dataset Comparison - Compare multiple datasets
All core functionality is implemented and ready to use! π