| # Dataset Validation & Curation - Implementation Complete | |
| Comprehensive dataset validation, curation, and analysis utilities have been implemented. | |
| ## β Implemented Features | |
| ### 1. Dataset Validation (`ylff/utils/dataset_validation.py`) | |
| **DatasetValidator Class:** | |
| - β Data integrity checks (images, poses, metadata) | |
| - β Quality validation (NaN/Inf detection, rotation matrix validity) | |
| - β Statistical analysis (error distributions, image counts) | |
| - β Comprehensive reporting | |
| **Functions:** | |
| - β `validate_dataset_file()` - Validate saved dataset files | |
| - β `check_dataset_integrity()` - Check dataset directory integrity | |
| **Validation Checks:** | |
| - Image format validation (numpy arrays, tensors, file paths) | |
| - Pose shape and validity checks | |
| - Metadata validation (weights, errors, sequence IDs) | |
| - NaN/Inf detection | |
| - Rotation matrix determinant checks | |
| ### 2. Dataset Curation (`ylff/utils/dataset_curation.py`) | |
| **DatasetCurator Class:** | |
| - β Quality-based filtering (error, weight, image count thresholds) | |
| - β Outlier removal (percentile-based, statistical IQR method) | |
| - β Dataset balancing (error bins, uniform, weighted strategies) | |
| - β Dataset splitting (train/val/test with stratification) | |
| - β Smart sampling (random, weighted, error-based) | |
| **Curation Strategies:** | |
| - **Filtering**: By error range, weight range, image count | |
| - **Outlier Removal**: Percentile-based or statistical IQR | |
| - **Balancing**: Error bins, uniform distribution, weighted sampling | |
| - **Splitting**: Stratified or random train/val/test splits | |
| ### 3. Dataset Analysis (`ylff/utils/dataset_analysis.py`) | |
| **DatasetAnalyzer Class:** | |
| - β Statistical analysis (mean, median, quartiles, percentiles) | |
| - β Distribution computation (histograms, binning) | |
| - β Quality metrics (error ratios, weight diversity, completeness) | |
| - β Correlation analysis | |
| - β Report generation (JSON, text, markdown) | |
| **Analysis Features:** | |
| - Error statistics (mean, median, Q25/Q75, Q90/Q95/Q99) | |
| - Weight statistics | |
| - Image count statistics | |
| - Sequence statistics (samples per sequence) | |
| - Quality metrics (low/medium/high error ratios) | |
| - Completeness metrics | |
| ## π API Endpoints | |
| ### `/api/v1/dataset/validate` (POST) | |
| **Request Model**: `ValidateDatasetRequest` | |
| ```json | |
| { | |
| "dataset_path": "data/training/dataset.pkl", | |
| "strict": false, | |
| "check_images": true, | |
| "check_poses": true, | |
| "check_metadata": true | |
| } | |
| ``` | |
| **Response**: `DatasetValidationResponse` | |
| - `validation_passed`: Boolean | |
| - `statistics`: Dataset statistics | |
| - `issues`: List of validation issues | |
| - `summary`: Validation summary | |
| ### `/api/v1/dataset/curate` (POST) | |
| **Request Model**: `CurateDatasetRequest` | |
| ```json | |
| { | |
| "dataset_path": "data/training/dataset.pkl", | |
| "output_path": "data/training/dataset_curated.pkl", | |
| "min_error": 0.5, | |
| "max_error": 30.0, | |
| "remove_outliers": true, | |
| "outlier_percentile": 95.0, | |
| "balance": true, | |
| "balance_strategy": "error_bins", | |
| "num_bins": 10 | |
| } | |
| ``` | |
| **Response**: `JobResponse` (async job) | |
| ### `/api/v1/dataset/analyze` (POST) | |
| **Request Model**: `AnalyzeDatasetRequest` | |
| ```json | |
| { | |
| "dataset_path": "data/training/dataset.pkl", | |
| "output_path": "data/training/analysis.json", | |
| "format": "json", | |
| "compute_distributions": true, | |
| "compute_correlations": true | |
| } | |
| ``` | |
| **Response**: `DatasetAnalysisResponse` | |
| - `statistics`: Dataset statistics | |
| - `quality_metrics`: Quality metrics | |
| - `report`: Human-readable report (if text/markdown) | |
| ## π§ CLI Commands | |
| ### `ylff dataset validate` | |
| ```bash | |
| ylff dataset validate data/training/dataset.pkl \ | |
| --strict \ | |
| --check-images \ | |
| --check-poses \ | |
| --check-metadata \ | |
| --output validation_report.json | |
| ``` | |
| ### `ylff dataset curate` | |
| ```bash | |
| ylff dataset curate \ | |
| data/training/dataset.pkl \ | |
| data/training/dataset_curated.pkl \ | |
| --min-error 0.5 \ | |
| --max-error 30.0 \ | |
| --remove-outliers \ | |
| --outlier-percentile 95.0 \ | |
| --balance \ | |
| --balance-strategy error_bins \ | |
| --num-bins 10 | |
| ``` | |
| ### `ylff dataset analyze` | |
| ```bash | |
| ylff dataset analyze data/training/dataset.pkl \ | |
| --output analysis_report.json \ | |
| --format json \ | |
| --compute-distributions \ | |
| --compute-correlations | |
| ``` | |
| ## π Integration | |
| ### Data Pipeline Integration | |
| The `BADataPipeline.build_training_set()` method now automatically: | |
| - β Validates built datasets | |
| - β Analyzes dataset statistics | |
| - β Logs validation and analysis results | |
| ### Usage in Training | |
| ```python | |
| from ylff.utils.dataset_validation import DatasetValidator | |
| from ylff.utils.dataset_curation import DatasetCurator | |
| from ylff.utils.dataset_analysis import DatasetAnalyzer | |
| # Validate | |
| validator = DatasetValidator(strict=False) | |
| report = validator.validate_dataset(samples) | |
| # Curate | |
| curator = DatasetCurator() | |
| curated, stats = curator.filter_by_quality( | |
| samples, | |
| min_error=0.5, | |
| max_error=30.0, | |
| ) | |
| curated, _ = curator.remove_outliers(curated, error_percentile=95.0) | |
| # Analyze | |
| analyzer = DatasetAnalyzer() | |
| analysis = analyzer.analyze_dataset(curated) | |
| analyzer.generate_report("analysis_report.json", format="markdown") | |
| ``` | |
| ## π Features | |
| ### Validation Features | |
| - β Image format validation (numpy, tensor, file paths) | |
| - β Pose shape and validity checks | |
| - β Metadata validation | |
| - β NaN/Inf detection | |
| - β Rotation matrix validation | |
| - β File integrity checks | |
| ### Curation Features | |
| - β Quality filtering (error, weight, image count) | |
| - β Outlier removal (percentile, IQR) | |
| - β Dataset balancing (error bins, uniform, weighted) | |
| - β Train/val/test splitting (stratified, random) | |
| - β Smart sampling strategies | |
| ### Analysis Features | |
| - β Statistical analysis (mean, median, quartiles) | |
| - β Distribution computation | |
| - β Quality metrics | |
| - β Correlation analysis | |
| - β Report generation (JSON, text, markdown) | |
| ## π Next Steps | |
| 1. **Dataset Versioning** - Track dataset versions and metadata | |
| 2. **Visualization** - Generate plots for distributions and statistics | |
| 3. **Advanced Filtering** - Scene-based, sequence-based filtering | |
| 4. **Data Augmentation** - Integration with augmentation strategies | |
| 5. **Dataset Comparison** - Compare multiple datasets | |
| All core functionality is implemented and ready to use! π | |