3d_model / docs /DATASET_VALIDATION_CURATION.md
Azan
Clean deployment build (Squashed)
7a87926

Dataset Validation & Curation - Implementation Complete

Comprehensive dataset validation, curation, and analysis utilities have been implemented.

βœ… Implemented Features

1. Dataset Validation (ylff/utils/dataset_validation.py)

DatasetValidator Class:

  • βœ… Data integrity checks (images, poses, metadata)
  • βœ… Quality validation (NaN/Inf detection, rotation matrix validity)
  • βœ… Statistical analysis (error distributions, image counts)
  • βœ… Comprehensive reporting

Functions:

  • βœ… validate_dataset_file() - Validate saved dataset files
  • βœ… check_dataset_integrity() - Check dataset directory integrity

Validation Checks:

  • Image format validation (numpy arrays, tensors, file paths)
  • Pose shape and validity checks
  • Metadata validation (weights, errors, sequence IDs)
  • NaN/Inf detection
  • Rotation matrix determinant checks

2. Dataset Curation (ylff/utils/dataset_curation.py)

DatasetCurator Class:

  • βœ… Quality-based filtering (error, weight, image count thresholds)
  • βœ… Outlier removal (percentile-based, statistical IQR method)
  • βœ… Dataset balancing (error bins, uniform, weighted strategies)
  • βœ… Dataset splitting (train/val/test with stratification)
  • βœ… Smart sampling (random, weighted, error-based)

Curation Strategies:

  • Filtering: By error range, weight range, image count
  • Outlier Removal: Percentile-based or statistical IQR
  • Balancing: Error bins, uniform distribution, weighted sampling
  • Splitting: Stratified or random train/val/test splits

3. Dataset Analysis (ylff/utils/dataset_analysis.py)

DatasetAnalyzer Class:

  • βœ… Statistical analysis (mean, median, quartiles, percentiles)
  • βœ… Distribution computation (histograms, binning)
  • βœ… Quality metrics (error ratios, weight diversity, completeness)
  • βœ… Correlation analysis
  • βœ… Report generation (JSON, text, markdown)

Analysis Features:

  • Error statistics (mean, median, Q25/Q75, Q90/Q95/Q99)
  • Weight statistics
  • Image count statistics
  • Sequence statistics (samples per sequence)
  • Quality metrics (low/medium/high error ratios)
  • Completeness metrics

πŸ“‹ API Endpoints

/api/v1/dataset/validate (POST)

Request Model: ValidateDatasetRequest

{
  "dataset_path": "data/training/dataset.pkl",
  "strict": false,
  "check_images": true,
  "check_poses": true,
  "check_metadata": true
}

Response: DatasetValidationResponse

  • validation_passed: Boolean
  • statistics: Dataset statistics
  • issues: List of validation issues
  • summary: Validation summary

/api/v1/dataset/curate (POST)

Request Model: CurateDatasetRequest

{
  "dataset_path": "data/training/dataset.pkl",
  "output_path": "data/training/dataset_curated.pkl",
  "min_error": 0.5,
  "max_error": 30.0,
  "remove_outliers": true,
  "outlier_percentile": 95.0,
  "balance": true,
  "balance_strategy": "error_bins",
  "num_bins": 10
}

Response: JobResponse (async job)

/api/v1/dataset/analyze (POST)

Request Model: AnalyzeDatasetRequest

{
  "dataset_path": "data/training/dataset.pkl",
  "output_path": "data/training/analysis.json",
  "format": "json",
  "compute_distributions": true,
  "compute_correlations": true
}

Response: DatasetAnalysisResponse

  • statistics: Dataset statistics
  • quality_metrics: Quality metrics
  • report: Human-readable report (if text/markdown)

πŸ”§ CLI Commands

ylff dataset validate

ylff dataset validate data/training/dataset.pkl \
    --strict \
    --check-images \
    --check-poses \
    --check-metadata \
    --output validation_report.json

ylff dataset curate

ylff dataset curate \
    data/training/dataset.pkl \
    data/training/dataset_curated.pkl \
    --min-error 0.5 \
    --max-error 30.0 \
    --remove-outliers \
    --outlier-percentile 95.0 \
    --balance \
    --balance-strategy error_bins \
    --num-bins 10

ylff dataset analyze

ylff dataset analyze data/training/dataset.pkl \
    --output analysis_report.json \
    --format json \
    --compute-distributions \
    --compute-correlations

πŸ”„ Integration

Data Pipeline Integration

The BADataPipeline.build_training_set() method now automatically:

  • βœ… Validates built datasets
  • βœ… Analyzes dataset statistics
  • βœ… Logs validation and analysis results

Usage in Training

from ylff.utils.dataset_validation import DatasetValidator
from ylff.utils.dataset_curation import DatasetCurator
from ylff.utils.dataset_analysis import DatasetAnalyzer

# Validate
validator = DatasetValidator(strict=False)
report = validator.validate_dataset(samples)

# Curate
curator = DatasetCurator()
curated, stats = curator.filter_by_quality(
    samples,
    min_error=0.5,
    max_error=30.0,
)
curated, _ = curator.remove_outliers(curated, error_percentile=95.0)

# Analyze
analyzer = DatasetAnalyzer()
analysis = analyzer.analyze_dataset(curated)
analyzer.generate_report("analysis_report.json", format="markdown")

πŸ“Š Features

Validation Features

  • βœ… Image format validation (numpy, tensor, file paths)
  • βœ… Pose shape and validity checks
  • βœ… Metadata validation
  • βœ… NaN/Inf detection
  • βœ… Rotation matrix validation
  • βœ… File integrity checks

Curation Features

  • βœ… Quality filtering (error, weight, image count)
  • βœ… Outlier removal (percentile, IQR)
  • βœ… Dataset balancing (error bins, uniform, weighted)
  • βœ… Train/val/test splitting (stratified, random)
  • βœ… Smart sampling strategies

Analysis Features

  • βœ… Statistical analysis (mean, median, quartiles)
  • βœ… Distribution computation
  • βœ… Quality metrics
  • βœ… Correlation analysis
  • βœ… Report generation (JSON, text, markdown)

πŸš€ Next Steps

  1. Dataset Versioning - Track dataset versions and metadata
  2. Visualization - Generate plots for distributions and statistics
  3. Advanced Filtering - Scene-based, sequence-based filtering
  4. Data Augmentation - Integration with augmentation strategies
  5. Dataset Comparison - Compare multiple datasets

All core functionality is implemented and ready to use! πŸŽ‰