# Phase 2 Performance Guide ## TL;DR — The Right Way to Optimize **DO NOT** blindly reduce computational budgets. Instead: 1. **Profile** to find the real bottleneck 2. **Optimize the bottleneck** (likely CIM vectorization) 3. **Validate any reductions** with ablation studies 4. **Document** empirical justification for paper **Current status:** - Baseline: 1000 particles, 8 MC samples - CI fast mode: 10 trials, 20 steps (infrastructure testing only) - Bottleneck: CIM forward model called 8000× per measurement (not vectorized) --- ## The Performance Problem **Symptom:** Single trial takes 30 minutes in CI **Math:** - 8 MC samples × 1000 particles × ~100 steps × 256 CIM calls/patch = **204,800,000 forward model evaluations** - At ~0.01ms per call (Python loop overhead), that's 34 minutes **Root cause:** `CIMObservationModel.predicted_conductance_2d()` calls `device.current()` in a nested Python loop (256 times for 16×16 patch). --- ## The Right Fix: Vectorization ### Current Code (Slow) ```python # belief.py line 156-158 for i, v2 in enumerate(v2_vals): for j, v1 in enumerate(v1_vals): patch[i, j] = self.device.current(v1, v2) # Python loop + function call overhead ``` ### Optimized Code (10-50× Faster) ```python # Compute all voltage points at once with numpy v1_grid, v2_grid = np.meshgrid(v1_vals, v2_vals) patch = self.device.current_2d(v1_grid, v2_grid) # Vectorized in numpy/C ``` **Why this matters:** - Numpy operations are vectorized in C (SIMD instructions) - Eliminates 256 Python function call overheads per patch - Better CPU cache utilization - Expected speedup: 10-50× for the forward model step ### Implementation Roadmap 1. Add `ConstantInteractionDevice.current_2d()` method (vectorized) 2. Update `CIMObservationModel.predicted_conductance_2d()` to use it 3. Benchmark: measure speedup on single trial 4. Validate: run ablation to confirm no accuracy loss **This is the proper optimization** — improve efficiency without sacrificing scientific accuracy. --- ## Only After Vectorization: Consider Budget Reductions If vectorization isn't enough, run ablations: ```bash python experiments/ablation_phase2.py --n-trials 20 ``` This tests: - **baseline:** 1000 particles, 8 MC samples - **reduced_particles:** 500 particles, 8 MC samples - **reduced_mc:** 1000 particles, 4 MC samples - **both_reduced:** 500 particles, 4 MC samples Output: ``` Config Success% Reduction% Duration(s) Speedup ---------------------------------------------------------------------- baseline 90.0% 52.3% ± 3.1% 120.5 ± 12.3 - reduced_particles 88.5% 51.1% ± 3.4% 65.2 ± 8.1 1.85x reduced_mc 89.0% 50.8% ± 3.5% 68.1 ± 9.2 1.77x both_reduced 86.5% 48.9% ± 4.1% 35.4 ± 6.7 3.40x KEY FINDINGS: reduced_particles: ✗ Performance differs from baseline (Δsuccess=1.5%, Δreduction=1.2%) → Not recommended despite 1.85× speedup ``` **Accept reductions only if:** - Success rate Δ < 5% - Measurement reduction Δ < 5% - Documented in paper methods section --- ## Computational Bottlenecks Phase 2 introduces several compute-intensive operations: ### 1. Particle Filter (BeliefUpdater) **Cost:** O(n_particles × n_measurements) Each measurement update: - Computes likelihood for each particle (CIM forward model) - Resamples when effective sample size drops below threshold - Syncs to `belief.charge_probs` for other components **Default:** 500 particles **Trade-off:** - 100 particles: Fast but coarse uncertainty estimates - 500 particles: Good balance (CI default) - 1000 particles: High accuracy for critical experiments - 2000+ particles: Overkill for most cases ### 2. Active Sensing Monte Carlo (ActiveSensingPolicy) **Cost:** O(n_mc_samples × n_particles × n_candidate_plans) Each sensing decision: - Samples n_mc_samples hypothetical measurements - For each sample, updates a copy of the particle filter - Estimates information gain for each candidate plan - Typically evaluates 3-5 candidate plans per decision **Default:** 4 MC samples **Trade-off:** - 2 samples: Very rough IG estimates, fast - 4 samples: Reasonable estimates (CI default) - 8 samples: Good estimates (production) - 16+ samples: Diminishing returns **Combined cost:** 4 MC × 500 particles × 4 plans = 8,000 forward model evaluations per measurement selection ### 3. Bayesian Optimization (MultiResBO) **Cost:** O(n_bo_history²) for GP fitting Each BO proposal: - Fits a Gaussian Process on growing BO history - Optimizes acquisition function (UCB) over voltage space **Grows over time:** More expensive as experiment progresses ### 4. CIM Forward Model **Cost:** O(1) per call, but called repeatedly The CIM simulator computes conductance at each voltage point: - Chemical potential calculation (depends on charge state) - Fermi-Dirac statistics - Tunneling current formula **Not a bottleneck** for single calls, but becomes significant when multiplied by particle filter and MC sampling. --- ## Profiling ### Quick Profile ```bash python experiments/benchmark_phase2.py \ --fast \ --profile \ --skip-missing-checkpoints ``` This will use Python's cProfile and print the top 20 slowest functions. ### Detailed Profile ```python import cProfile import pstats from qdot.agent.executive import ExecutiveAgent # ... setup state, adapter, etc. profiler = cProfile.Profile() profiler.enable() summary = agent.run() profiler.disable() stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(50) ``` ### Expected Hot Spots Based on computational complexity: 1. `_ParticleSet.update()` - particle filter updates 2. `ActiveSensingPolicy._estimate_information_gain()` - MC sampling 3. `CIMObservationModel.log_likelihood_2d()` - forward model 4. `GaussianProcess.fit()` - GP kernel matrix inversion 5. `MultiResBO.propose()` - acquisition optimization --- ## Tuning Guidelines ### For CI (Fast Turnaround) ```python # Already configured in benchmark_phase2.py --fast # 10 trials, 20 steps, 512 measurement budget # Runtime: 10-15 minutes ``` ### For Development (Moderate Accuracy) ```python agent = ExecutiveAgent( state=state, adapter=adapter, max_steps=50, measurement_budget=1024, ) # BeliefUpdater uses 500 particles (default) # ActiveSensingPolicy uses 4 MC samples (default) # Runtime: ~20-30 minutes for 10 trials ``` ### For Production (High Accuracy) ```python # Create custom components with higher budgets belief_updater = BeliefUpdater( belief=state.belief, n_particles=1000, # 2x particles ) sensing_policy = ActiveSensingPolicy( n_mc_samples=8, # 2x MC samples ) # Inject into ExecutiveAgent (Phase 3 feature) # For Phase 2, edit the defaults in the source files ``` ### For Benchmarking ```bash # Full 100-trial evaluation with trained models python experiments/benchmark_phase2.py \ --n-trials 100 \ --budget 2048 \ --max-steps 100 # Runtime: 2-4 hours (depends on hardware) ``` --- ## Performance Expectations ### Single Trial Timing (Intel i7, 8 cores) - Bootstrap: ~5 seconds (line scan) - Coarse Survey: ~30 seconds (coarse 2D scan) - Charge ID: ~20 seconds (local patch + classification) - Navigation: ~10 seconds per voltage move (BO + belief update) - Verification: ~30 seconds (repeated measurements) **Total per trial:** 1-3 minutes on average (depends on backtracking) ### 100-Trial Benchmark - **Fast mode (CI):** 10 trials × 20 steps = 10-15 minutes - **Full mode:** 100 trials × 100 steps = 2-4 hours ### Scaling Factors - **Particles:** Linear scaling (2x particles = 2x runtime) - **MC samples:** Linear scaling (2x samples = 2x runtime for sensing) - **Step count:** Linear scaling (2x steps = 2x runtime) - **BO history:** Quadratic scaling (2x history = 4x GP fit time) --- ## Optimization Strategies ### If BeliefUpdater is the Bottleneck 1. Reduce `n_particles` to 250-300 2. Increase `resample_threshold` to avoid frequent resampling 3. Use a coarser voltage grid for likelihood evaluation 4. Cache CIM forward model results for repeated voltage points ### If ActiveSensingPolicy is the Bottleneck 1. Reduce `n_mc_samples` to 2-3 2. Reduce the number of candidate plans considered 3. Skip active sensing for certain stages (e.g., bootstrap always uses line scan) 4. Use a heuristic policy (e.g., always take coarse 2D in survey stage) ### If BO is the Bottleneck 1. Limit BO history to last N points (e.g., 50 points) 2. Use a sparse GP approximation 3. Use simpler acquisition (e.g., probability of improvement vs UCB) 4. Skip BO optimization and use greedy search ### Parallelization (Future Work) - Particle filter updates are embarrassingly parallel - MC sampling can be parallelized across samples - Multiple trials in benchmark can run in parallel **Not implemented in Phase 2** - requires careful handling of NumPy random state and PyTorch device placement. --- ## Debugging Slow Runs ### Check if Agent is Stuck ```python # Add verbose logging to ExecutiveAgent._step() if self.state.step % 10 == 0: print(f"Step {self.state.step}: stage={self.state.stage}, " f"measurements={self.state.total_measurements}") ``` ### Common Causes of Slowdown 1. **Backtracking loop:** State machine gets stuck retrying failed stages 2. **Low-quality measurements:** DQC repeatedly rejects measurements 3. **Poor BO convergence:** BO proposals don't improve, agent exhausts step budget 4. **HITL blocking:** HITL not in test mode, waiting for human input 5. **Excessive logging:** Governance logger writing large decision objects ### Quick Diagnosis ```bash # Run with verbose output python -u experiments/benchmark_phase2.py --fast 2>&1 | tee benchmark.log # Check for repeated stage names (stuck in backtracking) grep "stage=" benchmark.log | tail -50 # Check measurement count vs step count (efficiency) grep "meas" benchmark.log | tail -20 ``` --- ## When to Profile **Profile when:** - CI timeout despite --fast mode - Single trial takes >5 minutes - Benchmark takes >1 hour for 10 trials - Memory usage grows unbounded **Don't profile when:** - Runs complete successfully in expected time - Small variance across trials (<2x) - Just need to reduce accuracy for faster turnaround (adjust budgets directly) --- ## Summary **The bottleneck is particle filter × MC sampling = 4 samples × 500 particles = 2000 forward model evaluations per measurement decision.** **For CI:** Use --fast mode (10 trials, 20 steps, 4 MC, 500 particles) → 10-15 min **For production:** Use full mode after training Phase 1 models → 2-4 hours **For profiling:** Add --profile flag and check hot spots in cProfile output