3d_model / docs /END_TO_END_PIPELINE.md
Azan
Clean deployment build (Squashed)
7a87926
# End-to-End Training Pipeline Architecture
## 🎯 Overview
The training pipeline is split into **two phases** to handle the computational cost of BA:
1. **Pre-Processing Phase** (offline, expensive) - Compute BA and oracle uncertainty
2. **Training Phase** (online, fast) - Load pre-computed results and train
## πŸ“Š Pipeline Flow
### Phase 1: Pre-Processing (Offline)
**When:** Run once before training (or when data/model changes)
**What it does:**
1. Extract ARKit data (poses, LiDAR) - **FREE**
2. Run DA3 inference (GPU, batchable) - **Moderate cost**
3. Run BA validation (CPU, expensive) - **Only if ARKit quality is poor**
4. Compute oracle uncertainty propagation - **Moderate cost**
5. Save to cache - **Fast disk I/O**
**Time:** ~10-20 minutes per sequence (mostly BA)
**Command:**
```bash
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed \
--num-workers 8
```
### Phase 2: Training (Online)
**When:** Run repeatedly during training iterations
**What it does:**
1. Load pre-computed results from cache - **Fast (disk I/O)**
2. Run DA3 inference (current model) - **GPU, fast**
3. Compute uncertainty-weighted loss - **GPU, fast**
4. Backprop & update - **Standard training**
**Time:** ~1-3 seconds per sequence
**Command:**
```bash
ylff train pretrain data/arkit_sequences \
--use-preprocessed \
--preprocessed-cache-dir cache/preprocessed \
--epochs 50
```
## πŸ”„ Complete Workflow
### Step 1: Pre-Process All Sequences
```bash
# Pre-process all ARKit sequences (one-time, can run overnight)
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed \
--model-name depth-anything/DA3-LARGE \
--num-workers 8 \
--use-lidar \
--prefer-arkit-poses
# This:
# - Extracts ARKit data (free)
# - Runs DA3 inference (GPU)
# - Runs BA only for sequences with poor ARKit tracking
# - Computes oracle uncertainty
# - Saves everything to cache
```
**Output:**
```
cache/preprocessed/
β”œβ”€β”€ sequence_001/
β”‚ β”œβ”€β”€ oracle_targets.npz # Best poses/depth (BA or ARKit)
β”‚ β”œβ”€β”€ uncertainty_results.npz # Confidence scores, uncertainty
β”‚ β”œβ”€β”€ arkit_data.npz # Original ARKit data
β”‚ └── metadata.json # Sequence info
└── sequence_002/
└── ...
```
### Step 2: Train Using Pre-Processed Data
```bash
# Train using pre-computed results (fast iteration)
ylff train pretrain data/arkit_sequences \
--use-preprocessed \
--preprocessed-cache-dir cache/preprocessed \
--epochs 50 \
--lr 1e-4 \
--batch-size 1
```
**What happens:**
1. Loads pre-computed oracle targets and uncertainty from cache
2. Runs DA3 inference with current model
3. Computes uncertainty-weighted loss (continuous confidence)
4. Updates model weights
## 🚫 Handling Rejection/Failure
### No Binary Rejection
**Key Principle:** All data contributes, just weighted by confidence.
### Continuous Confidence Weighting
**In Loss Function:**
```python
# All pixels/frames contribute, weighted by confidence
loss = confidence * prediction_error
# Low confidence (0.3) β†’ weight=0.3 (contributes less)
# High confidence (0.9) β†’ weight=0.9 (contributes more)
# No hard cutoff - smooth weighting
```
### Failure Scenarios
**BA Failure:**
- βœ… Falls back to ARKit poses (if quality good)
- βœ… Lower confidence score (reflects uncertainty)
- βœ… Still used for training (just weighted less)
- βœ… Model learns from ARKit poses with lower confidence
**Missing LiDAR:**
- βœ… Uses BA depth (if available)
- βœ… Or geometric consistency only
- βœ… Lower confidence score
- βœ… Still used for training
**Poor Tracking:**
- βœ… Lower confidence score
- βœ… Still used for training
- βœ… Model learns to handle uncertainty
**Key Insight:** Even "failed" or low-confidence data contributes to training, just with lower weight. This is better than binary rejection because:
- No information loss
- Model learns to handle uncertainty
- Smooth gradient flow (no hard cutoffs)
- Better generalization
## πŸ“ˆ Performance Comparison
### Without Pre-Processing (Current)
**Per Training Iteration:**
- BA computation: ~5-15 min per sequence (CPU, expensive)
- DA3 inference: ~0.5-2 sec per sequence (GPU)
- Loss computation: ~0.1-0.5 sec per sequence (GPU)
- **Total: ~5-15 min per sequence**
**For 100 sequences:**
- One epoch: ~8-25 hours
- 50 epochs: ~17-52 days
### With Pre-Processing (New)
**Pre-Processing (One-Time):**
- BA computation: ~5-15 min per sequence (CPU, expensive)
- Oracle uncertainty: ~10-30 sec per sequence (CPU)
- **Total: ~10-20 min per sequence** (one-time cost)
**Training (Per Iteration):**
- Load cache: ~0.1-1 sec per sequence (disk I/O)
- DA3 inference: ~0.5-2 sec per sequence (GPU)
- Loss computation: ~0.1-0.5 sec per sequence (GPU)
- **Total: ~1-3 sec per sequence**
**For 100 sequences:**
- Pre-processing: ~17-33 hours (one-time)
- One epoch: ~2-5 minutes
- 50 epochs: ~2-4 hours
**Speedup:** 100-1000x faster training iteration!
## πŸ”§ Implementation Details
### Pre-Processing Service
**File:** `ylff/services/preprocessing.py`
**Function:** `preprocess_arkit_sequence()`
**Steps:**
1. Extract ARKit data (free)
2. Run DA3 inference (GPU)
3. Decide: ARKit poses (if quality good) or BA (if quality poor)
4. Compute oracle uncertainty propagation
5. Save to cache
### Preprocessed Dataset
**File:** `ylff/services/preprocessed_dataset.py`
**Class:** `PreprocessedARKitDataset`
**Features:**
- Loads pre-computed oracle targets
- Loads uncertainty results (confidence, covariance)
- Loads ARKit data (for reference)
- Fast disk I/O (no BA computation)
### Training Integration
**File:** `ylff/services/pretrain.py`
**Changes:**
- Detects preprocessed data (checks for `uncertainty_results` in batch)
- Uses `oracle_uncertainty_ensemble_loss()` when available
- Falls back to standard loss for live data (backward compatibility)
## πŸ“ Usage Examples
### Full Workflow
```bash
# Step 1: Pre-process (one-time, overnight)
ylff preprocess arkit data/arkit_sequences \
--output-cache cache/preprocessed \
--num-workers 8
# Step 2: Train (fast iteration)
ylff train pretrain data/arkit_sequences \
--use-preprocessed \
--preprocessed-cache-dir cache/preprocessed \
--epochs 50
# Step 3: Iterate on training (no re-preprocessing needed)
ylff train pretrain data/arkit_sequences \
--use-preprocessed \
--preprocessed-cache-dir cache/preprocessed \
--epochs 100 \
--lr 5e-5 # Lower LR for fine-tuning
```
### When to Re-Preprocess
Only needed if:
- βœ… New sequences added
- βœ… Different DA3 model used for initial inference
- βœ… BA parameters changed
- βœ… Oracle uncertainty parameters changed
**Not needed for:**
- ❌ Training hyperparameter changes (LR, batch size, etc.)
- ❌ Model architecture changes (same input/output)
- ❌ Training iteration (epochs, etc.)
## πŸŽ“ Key Benefits
1. **100-1000x faster training iteration** - No BA during training
2. **Continuous confidence weighting** - No binary rejection
3. **All data contributes** - Low confidence = low weight, not zero
4. **Uncertainty propagation** - Covariance estimates available
5. **Parallelizable pre-processing** - Can process multiple sequences simultaneously
6. **Reusable cache** - Pre-process once, train many times
## πŸ“Š Summary
**Pre-Processing:**
- Runs BA and oracle uncertainty computation offline
- Saves results to cache
- One-time cost per dataset
**Training:**
- Loads pre-computed results
- Fast iteration (no BA)
- Uses continuous confidence weighting
- All data contributes (weighted by confidence)
This architecture enables efficient training while using all available oracle sources! πŸš€