File size: 6,188 Bytes
7a87926 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
# Dataset Validation & Curation - Implementation Complete
Comprehensive dataset validation, curation, and analysis utilities have been implemented.
## β
Implemented Features
### 1. Dataset Validation (`ylff/utils/dataset_validation.py`)
**DatasetValidator Class:**
- β
Data integrity checks (images, poses, metadata)
- β
Quality validation (NaN/Inf detection, rotation matrix validity)
- β
Statistical analysis (error distributions, image counts)
- β
Comprehensive reporting
**Functions:**
- β
`validate_dataset_file()` - Validate saved dataset files
- β
`check_dataset_integrity()` - Check dataset directory integrity
**Validation Checks:**
- Image format validation (numpy arrays, tensors, file paths)
- Pose shape and validity checks
- Metadata validation (weights, errors, sequence IDs)
- NaN/Inf detection
- Rotation matrix determinant checks
### 2. Dataset Curation (`ylff/utils/dataset_curation.py`)
**DatasetCurator Class:**
- β
Quality-based filtering (error, weight, image count thresholds)
- β
Outlier removal (percentile-based, statistical IQR method)
- β
Dataset balancing (error bins, uniform, weighted strategies)
- β
Dataset splitting (train/val/test with stratification)
- β
Smart sampling (random, weighted, error-based)
**Curation Strategies:**
- **Filtering**: By error range, weight range, image count
- **Outlier Removal**: Percentile-based or statistical IQR
- **Balancing**: Error bins, uniform distribution, weighted sampling
- **Splitting**: Stratified or random train/val/test splits
### 3. Dataset Analysis (`ylff/utils/dataset_analysis.py`)
**DatasetAnalyzer Class:**
- β
Statistical analysis (mean, median, quartiles, percentiles)
- β
Distribution computation (histograms, binning)
- β
Quality metrics (error ratios, weight diversity, completeness)
- β
Correlation analysis
- β
Report generation (JSON, text, markdown)
**Analysis Features:**
- Error statistics (mean, median, Q25/Q75, Q90/Q95/Q99)
- Weight statistics
- Image count statistics
- Sequence statistics (samples per sequence)
- Quality metrics (low/medium/high error ratios)
- Completeness metrics
## π API Endpoints
### `/api/v1/dataset/validate` (POST)
**Request Model**: `ValidateDatasetRequest`
```json
{
"dataset_path": "data/training/dataset.pkl",
"strict": false,
"check_images": true,
"check_poses": true,
"check_metadata": true
}
```
**Response**: `DatasetValidationResponse`
- `validation_passed`: Boolean
- `statistics`: Dataset statistics
- `issues`: List of validation issues
- `summary`: Validation summary
### `/api/v1/dataset/curate` (POST)
**Request Model**: `CurateDatasetRequest`
```json
{
"dataset_path": "data/training/dataset.pkl",
"output_path": "data/training/dataset_curated.pkl",
"min_error": 0.5,
"max_error": 30.0,
"remove_outliers": true,
"outlier_percentile": 95.0,
"balance": true,
"balance_strategy": "error_bins",
"num_bins": 10
}
```
**Response**: `JobResponse` (async job)
### `/api/v1/dataset/analyze` (POST)
**Request Model**: `AnalyzeDatasetRequest`
```json
{
"dataset_path": "data/training/dataset.pkl",
"output_path": "data/training/analysis.json",
"format": "json",
"compute_distributions": true,
"compute_correlations": true
}
```
**Response**: `DatasetAnalysisResponse`
- `statistics`: Dataset statistics
- `quality_metrics`: Quality metrics
- `report`: Human-readable report (if text/markdown)
## π§ CLI Commands
### `ylff dataset validate`
```bash
ylff dataset validate data/training/dataset.pkl \
--strict \
--check-images \
--check-poses \
--check-metadata \
--output validation_report.json
```
### `ylff dataset curate`
```bash
ylff dataset curate \
data/training/dataset.pkl \
data/training/dataset_curated.pkl \
--min-error 0.5 \
--max-error 30.0 \
--remove-outliers \
--outlier-percentile 95.0 \
--balance \
--balance-strategy error_bins \
--num-bins 10
```
### `ylff dataset analyze`
```bash
ylff dataset analyze data/training/dataset.pkl \
--output analysis_report.json \
--format json \
--compute-distributions \
--compute-correlations
```
## π Integration
### Data Pipeline Integration
The `BADataPipeline.build_training_set()` method now automatically:
- β
Validates built datasets
- β
Analyzes dataset statistics
- β
Logs validation and analysis results
### Usage in Training
```python
from ylff.utils.dataset_validation import DatasetValidator
from ylff.utils.dataset_curation import DatasetCurator
from ylff.utils.dataset_analysis import DatasetAnalyzer
# Validate
validator = DatasetValidator(strict=False)
report = validator.validate_dataset(samples)
# Curate
curator = DatasetCurator()
curated, stats = curator.filter_by_quality(
samples,
min_error=0.5,
max_error=30.0,
)
curated, _ = curator.remove_outliers(curated, error_percentile=95.0)
# Analyze
analyzer = DatasetAnalyzer()
analysis = analyzer.analyze_dataset(curated)
analyzer.generate_report("analysis_report.json", format="markdown")
```
## π Features
### Validation Features
- β
Image format validation (numpy, tensor, file paths)
- β
Pose shape and validity checks
- β
Metadata validation
- β
NaN/Inf detection
- β
Rotation matrix validation
- β
File integrity checks
### Curation Features
- β
Quality filtering (error, weight, image count)
- β
Outlier removal (percentile, IQR)
- β
Dataset balancing (error bins, uniform, weighted)
- β
Train/val/test splitting (stratified, random)
- β
Smart sampling strategies
### Analysis Features
- β
Statistical analysis (mean, median, quartiles)
- β
Distribution computation
- β
Quality metrics
- β
Correlation analysis
- β
Report generation (JSON, text, markdown)
## π Next Steps
1. **Dataset Versioning** - Track dataset versions and metadata
2. **Visualization** - Generate plots for distributions and statistics
3. **Advanced Filtering** - Scene-based, sequence-based filtering
4. **Data Augmentation** - Integration with augmentation strategies
5. **Dataset Comparison** - Compare multiple datasets
All core functionality is implemented and ready to use! π
|