# End-to-End Training Pipeline Architecture ## 🎯 Overview The training pipeline is split into **two phases** to handle the computational cost of BA: 1. **Pre-Processing Phase** (offline, expensive) - Compute BA and oracle uncertainty 2. **Training Phase** (online, fast) - Load pre-computed results and train ## 📊 Pipeline Flow ### Phase 1: Pre-Processing (Offline) **When:** Run once before training (or when data/model changes) **What it does:** 1. Extract ARKit data (poses, LiDAR) - **FREE** 2. Run DA3 inference (GPU, batchable) - **Moderate cost** 3. Run BA validation (CPU, expensive) - **Only if ARKit quality is poor** 4. Compute oracle uncertainty propagation - **Moderate cost** 5. Save to cache - **Fast disk I/O** **Time:** ~10-20 minutes per sequence (mostly BA) **Command:** ```bash ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed \ --num-workers 8 ``` ### Phase 2: Training (Online) **When:** Run repeatedly during training iterations **What it does:** 1. Load pre-computed results from cache - **Fast (disk I/O)** 2. Run DA3 inference (current model) - **GPU, fast** 3. Compute uncertainty-weighted loss - **GPU, fast** 4. Backprop & update - **Standard training** **Time:** ~1-3 seconds per sequence **Command:** ```bash ylff train pretrain data/arkit_sequences \ --use-preprocessed \ --preprocessed-cache-dir cache/preprocessed \ --epochs 50 ``` ## 🔄 Complete Workflow ### Step 1: Pre-Process All Sequences ```bash # Pre-process all ARKit sequences (one-time, can run overnight) ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed \ --model-name depth-anything/DA3-LARGE \ --num-workers 8 \ --use-lidar \ --prefer-arkit-poses # This: # - Extracts ARKit data (free) # - Runs DA3 inference (GPU) # - Runs BA only for sequences with poor ARKit tracking # - Computes oracle uncertainty # - Saves everything to cache ``` **Output:** ``` cache/preprocessed/ ├── sequence_001/ │ ├── oracle_targets.npz # Best poses/depth (BA or ARKit) │ ├── uncertainty_results.npz # Confidence scores, uncertainty │ ├── arkit_data.npz # Original ARKit data │ └── metadata.json # Sequence info └── sequence_002/ └── ... ``` ### Step 2: Train Using Pre-Processed Data ```bash # Train using pre-computed results (fast iteration) ylff train pretrain data/arkit_sequences \ --use-preprocessed \ --preprocessed-cache-dir cache/preprocessed \ --epochs 50 \ --lr 1e-4 \ --batch-size 1 ``` **What happens:** 1. Loads pre-computed oracle targets and uncertainty from cache 2. Runs DA3 inference with current model 3. Computes uncertainty-weighted loss (continuous confidence) 4. Updates model weights ## 🚫 Handling Rejection/Failure ### No Binary Rejection **Key Principle:** All data contributes, just weighted by confidence. ### Continuous Confidence Weighting **In Loss Function:** ```python # All pixels/frames contribute, weighted by confidence loss = confidence * prediction_error # Low confidence (0.3) → weight=0.3 (contributes less) # High confidence (0.9) → weight=0.9 (contributes more) # No hard cutoff - smooth weighting ``` ### Failure Scenarios **BA Failure:** - ✅ Falls back to ARKit poses (if quality good) - ✅ Lower confidence score (reflects uncertainty) - ✅ Still used for training (just weighted less) - ✅ Model learns from ARKit poses with lower confidence **Missing LiDAR:** - ✅ Uses BA depth (if available) - ✅ Or geometric consistency only - ✅ Lower confidence score - ✅ Still used for training **Poor Tracking:** - ✅ Lower confidence score - ✅ Still used for training - ✅ Model learns to handle uncertainty **Key Insight:** Even "failed" or low-confidence data contributes to training, just with lower weight. This is better than binary rejection because: - No information loss - Model learns to handle uncertainty - Smooth gradient flow (no hard cutoffs) - Better generalization ## 📈 Performance Comparison ### Without Pre-Processing (Current) **Per Training Iteration:** - BA computation: ~5-15 min per sequence (CPU, expensive) - DA3 inference: ~0.5-2 sec per sequence (GPU) - Loss computation: ~0.1-0.5 sec per sequence (GPU) - **Total: ~5-15 min per sequence** **For 100 sequences:** - One epoch: ~8-25 hours - 50 epochs: ~17-52 days ### With Pre-Processing (New) **Pre-Processing (One-Time):** - BA computation: ~5-15 min per sequence (CPU, expensive) - Oracle uncertainty: ~10-30 sec per sequence (CPU) - **Total: ~10-20 min per sequence** (one-time cost) **Training (Per Iteration):** - Load cache: ~0.1-1 sec per sequence (disk I/O) - DA3 inference: ~0.5-2 sec per sequence (GPU) - Loss computation: ~0.1-0.5 sec per sequence (GPU) - **Total: ~1-3 sec per sequence** **For 100 sequences:** - Pre-processing: ~17-33 hours (one-time) - One epoch: ~2-5 minutes - 50 epochs: ~2-4 hours **Speedup:** 100-1000x faster training iteration! ## 🔧 Implementation Details ### Pre-Processing Service **File:** `ylff/services/preprocessing.py` **Function:** `preprocess_arkit_sequence()` **Steps:** 1. Extract ARKit data (free) 2. Run DA3 inference (GPU) 3. Decide: ARKit poses (if quality good) or BA (if quality poor) 4. Compute oracle uncertainty propagation 5. Save to cache ### Preprocessed Dataset **File:** `ylff/services/preprocessed_dataset.py` **Class:** `PreprocessedARKitDataset` **Features:** - Loads pre-computed oracle targets - Loads uncertainty results (confidence, covariance) - Loads ARKit data (for reference) - Fast disk I/O (no BA computation) ### Training Integration **File:** `ylff/services/pretrain.py` **Changes:** - Detects preprocessed data (checks for `uncertainty_results` in batch) - Uses `oracle_uncertainty_ensemble_loss()` when available - Falls back to standard loss for live data (backward compatibility) ## 📝 Usage Examples ### Full Workflow ```bash # Step 1: Pre-process (one-time, overnight) ylff preprocess arkit data/arkit_sequences \ --output-cache cache/preprocessed \ --num-workers 8 # Step 2: Train (fast iteration) ylff train pretrain data/arkit_sequences \ --use-preprocessed \ --preprocessed-cache-dir cache/preprocessed \ --epochs 50 # Step 3: Iterate on training (no re-preprocessing needed) ylff train pretrain data/arkit_sequences \ --use-preprocessed \ --preprocessed-cache-dir cache/preprocessed \ --epochs 100 \ --lr 5e-5 # Lower LR for fine-tuning ``` ### When to Re-Preprocess Only needed if: - ✅ New sequences added - ✅ Different DA3 model used for initial inference - ✅ BA parameters changed - ✅ Oracle uncertainty parameters changed **Not needed for:** - ❌ Training hyperparameter changes (LR, batch size, etc.) - ❌ Model architecture changes (same input/output) - ❌ Training iteration (epochs, etc.) ## 🎓 Key Benefits 1. **100-1000x faster training iteration** - No BA during training 2. **Continuous confidence weighting** - No binary rejection 3. **All data contributes** - Low confidence = low weight, not zero 4. **Uncertainty propagation** - Covariance estimates available 5. **Parallelizable pre-processing** - Can process multiple sequences simultaneously 6. **Reusable cache** - Pre-process once, train many times ## 📊 Summary **Pre-Processing:** - Runs BA and oracle uncertainty computation offline - Saves results to cache - One-time cost per dataset **Training:** - Loads pre-computed results - Fast iteration (no BA) - Uses continuous confidence weighting - All data contributes (weighted by confidence) This architecture enables efficient training while using all available oracle sources! 🚀