3d_model / docs /DATASET_VALIDATION_CURATION.md
Azan
Clean deployment build (Squashed)
7a87926
# Dataset Validation & Curation - Implementation Complete
Comprehensive dataset validation, curation, and analysis utilities have been implemented.
## βœ… Implemented Features
### 1. Dataset Validation (`ylff/utils/dataset_validation.py`)
**DatasetValidator Class:**
- βœ… Data integrity checks (images, poses, metadata)
- βœ… Quality validation (NaN/Inf detection, rotation matrix validity)
- βœ… Statistical analysis (error distributions, image counts)
- βœ… Comprehensive reporting
**Functions:**
- βœ… `validate_dataset_file()` - Validate saved dataset files
- βœ… `check_dataset_integrity()` - Check dataset directory integrity
**Validation Checks:**
- Image format validation (numpy arrays, tensors, file paths)
- Pose shape and validity checks
- Metadata validation (weights, errors, sequence IDs)
- NaN/Inf detection
- Rotation matrix determinant checks
### 2. Dataset Curation (`ylff/utils/dataset_curation.py`)
**DatasetCurator Class:**
- βœ… Quality-based filtering (error, weight, image count thresholds)
- βœ… Outlier removal (percentile-based, statistical IQR method)
- βœ… Dataset balancing (error bins, uniform, weighted strategies)
- βœ… Dataset splitting (train/val/test with stratification)
- βœ… Smart sampling (random, weighted, error-based)
**Curation Strategies:**
- **Filtering**: By error range, weight range, image count
- **Outlier Removal**: Percentile-based or statistical IQR
- **Balancing**: Error bins, uniform distribution, weighted sampling
- **Splitting**: Stratified or random train/val/test splits
### 3. Dataset Analysis (`ylff/utils/dataset_analysis.py`)
**DatasetAnalyzer Class:**
- βœ… Statistical analysis (mean, median, quartiles, percentiles)
- βœ… Distribution computation (histograms, binning)
- βœ… Quality metrics (error ratios, weight diversity, completeness)
- βœ… Correlation analysis
- βœ… Report generation (JSON, text, markdown)
**Analysis Features:**
- Error statistics (mean, median, Q25/Q75, Q90/Q95/Q99)
- Weight statistics
- Image count statistics
- Sequence statistics (samples per sequence)
- Quality metrics (low/medium/high error ratios)
- Completeness metrics
## πŸ“‹ API Endpoints
### `/api/v1/dataset/validate` (POST)
**Request Model**: `ValidateDatasetRequest`
```json
{
"dataset_path": "data/training/dataset.pkl",
"strict": false,
"check_images": true,
"check_poses": true,
"check_metadata": true
}
```
**Response**: `DatasetValidationResponse`
- `validation_passed`: Boolean
- `statistics`: Dataset statistics
- `issues`: List of validation issues
- `summary`: Validation summary
### `/api/v1/dataset/curate` (POST)
**Request Model**: `CurateDatasetRequest`
```json
{
"dataset_path": "data/training/dataset.pkl",
"output_path": "data/training/dataset_curated.pkl",
"min_error": 0.5,
"max_error": 30.0,
"remove_outliers": true,
"outlier_percentile": 95.0,
"balance": true,
"balance_strategy": "error_bins",
"num_bins": 10
}
```
**Response**: `JobResponse` (async job)
### `/api/v1/dataset/analyze` (POST)
**Request Model**: `AnalyzeDatasetRequest`
```json
{
"dataset_path": "data/training/dataset.pkl",
"output_path": "data/training/analysis.json",
"format": "json",
"compute_distributions": true,
"compute_correlations": true
}
```
**Response**: `DatasetAnalysisResponse`
- `statistics`: Dataset statistics
- `quality_metrics`: Quality metrics
- `report`: Human-readable report (if text/markdown)
## πŸ”§ CLI Commands
### `ylff dataset validate`
```bash
ylff dataset validate data/training/dataset.pkl \
--strict \
--check-images \
--check-poses \
--check-metadata \
--output validation_report.json
```
### `ylff dataset curate`
```bash
ylff dataset curate \
data/training/dataset.pkl \
data/training/dataset_curated.pkl \
--min-error 0.5 \
--max-error 30.0 \
--remove-outliers \
--outlier-percentile 95.0 \
--balance \
--balance-strategy error_bins \
--num-bins 10
```
### `ylff dataset analyze`
```bash
ylff dataset analyze data/training/dataset.pkl \
--output analysis_report.json \
--format json \
--compute-distributions \
--compute-correlations
```
## πŸ”„ Integration
### Data Pipeline Integration
The `BADataPipeline.build_training_set()` method now automatically:
- βœ… Validates built datasets
- βœ… Analyzes dataset statistics
- βœ… Logs validation and analysis results
### Usage in Training
```python
from ylff.utils.dataset_validation import DatasetValidator
from ylff.utils.dataset_curation import DatasetCurator
from ylff.utils.dataset_analysis import DatasetAnalyzer
# Validate
validator = DatasetValidator(strict=False)
report = validator.validate_dataset(samples)
# Curate
curator = DatasetCurator()
curated, stats = curator.filter_by_quality(
samples,
min_error=0.5,
max_error=30.0,
)
curated, _ = curator.remove_outliers(curated, error_percentile=95.0)
# Analyze
analyzer = DatasetAnalyzer()
analysis = analyzer.analyze_dataset(curated)
analyzer.generate_report("analysis_report.json", format="markdown")
```
## πŸ“Š Features
### Validation Features
- βœ… Image format validation (numpy, tensor, file paths)
- βœ… Pose shape and validity checks
- βœ… Metadata validation
- βœ… NaN/Inf detection
- βœ… Rotation matrix validation
- βœ… File integrity checks
### Curation Features
- βœ… Quality filtering (error, weight, image count)
- βœ… Outlier removal (percentile, IQR)
- βœ… Dataset balancing (error bins, uniform, weighted)
- βœ… Train/val/test splitting (stratified, random)
- βœ… Smart sampling strategies
### Analysis Features
- βœ… Statistical analysis (mean, median, quartiles)
- βœ… Distribution computation
- βœ… Quality metrics
- βœ… Correlation analysis
- βœ… Report generation (JSON, text, markdown)
## πŸš€ Next Steps
1. **Dataset Versioning** - Track dataset versions and metadata
2. **Visualization** - Generate plots for distributions and statistics
3. **Advanced Filtering** - Scene-based, sequence-based filtering
4. **Data Augmentation** - Integration with augmentation strategies
5. **Dataset Comparison** - Compare multiple datasets
All core functionality is implemented and ready to use! πŸŽ‰