# Phase 2 Performance Guide

## TL;DR — The Right Way to Optimize

**DO NOT** blindly reduce computational budgets. Instead:

1. **Profile** to find the real bottleneck
2. **Optimize the bottleneck** (likely CIM vectorization)
3. **Validate any reductions** with ablation studies
4. **Document** empirical justification for paper

**Current status:**
- Baseline: 1000 particles, 8 MC samples
- CI fast mode: 10 trials, 20 steps (infrastructure testing only)
- Bottleneck: CIM forward model called 8000× per measurement (not vectorized)

---

## The Performance Problem

**Symptom:** Single trial takes 30 minutes in CI

**Math:**
- 8 MC samples × 1000 particles × ~100 steps × 256 CIM calls/patch = **204,800,000 forward model evaluations**
- At ~0.01ms per call (Python loop overhead), that's 34 minutes

**Root cause:** `CIMObservationModel.predicted_conductance_2d()` calls `device.current()` in a nested Python loop (256 times for 16×16 patch).

---

## The Right Fix: Vectorization

### Current Code (Slow)
```python
# belief.py line 156-158
for i, v2 in enumerate(v2_vals):
    for j, v1 in enumerate(v1_vals):
        patch[i, j] = self.device.current(v1, v2)  # Python loop + function call overhead
```

### Optimized Code (10-50× Faster)
```python
# Compute all voltage points at once with numpy
v1_grid, v2_grid = np.meshgrid(v1_vals, v2_vals)
patch = self.device.current_2d(v1_grid, v2_grid)  # Vectorized in numpy/C
```

**Why this matters:**
- Numpy operations are vectorized in C (SIMD instructions)
- Eliminates 256 Python function call overheads per patch
- Better CPU cache utilization
- Expected speedup: 10-50× for the forward model step

### Implementation Roadmap

1. Add `ConstantInteractionDevice.current_2d()` method (vectorized)
2. Update `CIMObservationModel.predicted_conductance_2d()` to use it
3. Benchmark: measure speedup on single trial
4. Validate: run ablation to confirm no accuracy loss

**This is the proper optimization** — improve efficiency without sacrificing scientific accuracy.

---

## Only After Vectorization: Consider Budget Reductions

If vectorization isn't enough, run ablations:

```bash
python experiments/ablation_phase2.py --n-trials 20
```

This tests:
- **baseline:** 1000 particles, 8 MC samples
- **reduced_particles:** 500 particles, 8 MC samples
- **reduced_mc:** 1000 particles, 4 MC samples
- **both_reduced:** 500 particles, 4 MC samples

Output:
```
Config               Success%   Reduction%   Duration(s)  Speedup
----------------------------------------------------------------------
baseline             90.0%      52.3% ± 3.1% 120.5 ± 12.3 -
reduced_particles    88.5%      51.1% ± 3.4%  65.2 ± 8.1  1.85x
reduced_mc           89.0%      50.8% ± 3.5%  68.1 ± 9.2  1.77x
both_reduced         86.5%      48.9% ± 4.1%  35.4 ± 6.7  3.40x

KEY FINDINGS:
reduced_particles:
  ✗ Performance differs from baseline (Δsuccess=1.5%, Δreduction=1.2%)
    → Not recommended despite 1.85× speedup
```

**Accept reductions only if:**
- Success rate Δ < 5%
- Measurement reduction Δ < 5%
- Documented in paper methods section

---

## Computational Bottlenecks

Phase 2 introduces several compute-intensive operations:

### 1. Particle Filter (BeliefUpdater)
**Cost:** O(n_particles × n_measurements)

Each measurement update:
- Computes likelihood for each particle (CIM forward model)
- Resamples when effective sample size drops below threshold
- Syncs to `belief.charge_probs` for other components

**Default:** 500 particles
**Trade-off:** 
- 100 particles: Fast but coarse uncertainty estimates
- 500 particles: Good balance (CI default)
- 1000 particles: High accuracy for critical experiments
- 2000+ particles: Overkill for most cases

### 2. Active Sensing Monte Carlo (ActiveSensingPolicy)
**Cost:** O(n_mc_samples × n_particles × n_candidate_plans)

Each sensing decision:
- Samples n_mc_samples hypothetical measurements
- For each sample, updates a copy of the particle filter
- Estimates information gain for each candidate plan
- Typically evaluates 3-5 candidate plans per decision

**Default:** 4 MC samples
**Trade-off:**
- 2 samples: Very rough IG estimates, fast
- 4 samples: Reasonable estimates (CI default)
- 8 samples: Good estimates (production)
- 16+ samples: Diminishing returns

**Combined cost:** 4 MC × 500 particles × 4 plans = 8,000 forward model evaluations per measurement selection

### 3. Bayesian Optimization (MultiResBO)
**Cost:** O(n_bo_history²) for GP fitting

Each BO proposal:
- Fits a Gaussian Process on growing BO history
- Optimizes acquisition function (UCB) over voltage space

**Grows over time:** More expensive as experiment progresses

### 4. CIM Forward Model
**Cost:** O(1) per call, but called repeatedly

The CIM simulator computes conductance at each voltage point:
- Chemical potential calculation (depends on charge state)
- Fermi-Dirac statistics
- Tunneling current formula

**Not a bottleneck** for single calls, but becomes significant when multiplied by particle filter and MC sampling.

---

## Profiling

### Quick Profile
```bash
python experiments/benchmark_phase2.py \
  --fast \
  --profile \
  --skip-missing-checkpoints
```

This will use Python's cProfile and print the top 20 slowest functions.

### Detailed Profile
```python
import cProfile
import pstats

from qdot.agent.executive import ExecutiveAgent
# ... setup state, adapter, etc.

profiler = cProfile.Profile()
profiler.enable()

summary = agent.run()

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(50)
```

### Expected Hot Spots
Based on computational complexity:
1. `_ParticleSet.update()` - particle filter updates
2. `ActiveSensingPolicy._estimate_information_gain()` - MC sampling
3. `CIMObservationModel.log_likelihood_2d()` - forward model
4. `GaussianProcess.fit()` - GP kernel matrix inversion
5. `MultiResBO.propose()` - acquisition optimization

---

## Tuning Guidelines

### For CI (Fast Turnaround)
```python
# Already configured in benchmark_phase2.py --fast
# 10 trials, 20 steps, 512 measurement budget
# Runtime: 10-15 minutes
```

### For Development (Moderate Accuracy)
```python
agent = ExecutiveAgent(
    state=state,
    adapter=adapter,
    max_steps=50,
    measurement_budget=1024,
)
# BeliefUpdater uses 500 particles (default)
# ActiveSensingPolicy uses 4 MC samples (default)
# Runtime: ~20-30 minutes for 10 trials
```

### For Production (High Accuracy)
```python
# Create custom components with higher budgets
belief_updater = BeliefUpdater(
    belief=state.belief,
    n_particles=1000,  # 2x particles
)
sensing_policy = ActiveSensingPolicy(
    n_mc_samples=8,  # 2x MC samples
)

# Inject into ExecutiveAgent (Phase 3 feature)
# For Phase 2, edit the defaults in the source files
```

### For Benchmarking
```bash
# Full 100-trial evaluation with trained models
python experiments/benchmark_phase2.py \
  --n-trials 100 \
  --budget 2048 \
  --max-steps 100
# Runtime: 2-4 hours (depends on hardware)
```

---

## Performance Expectations

### Single Trial Timing (Intel i7, 8 cores)
- Bootstrap: ~5 seconds (line scan)
- Coarse Survey: ~30 seconds (coarse 2D scan)
- Charge ID: ~20 seconds (local patch + classification)
- Navigation: ~10 seconds per voltage move (BO + belief update)
- Verification: ~30 seconds (repeated measurements)

**Total per trial:** 1-3 minutes on average (depends on backtracking)

### 100-Trial Benchmark
- **Fast mode (CI):** 10 trials × 20 steps = 10-15 minutes
- **Full mode:** 100 trials × 100 steps = 2-4 hours

### Scaling Factors
- **Particles:** Linear scaling (2x particles = 2x runtime)
- **MC samples:** Linear scaling (2x samples = 2x runtime for sensing)
- **Step count:** Linear scaling (2x steps = 2x runtime)
- **BO history:** Quadratic scaling (2x history = 4x GP fit time)

---

## Optimization Strategies

### If BeliefUpdater is the Bottleneck
1. Reduce `n_particles` to 250-300
2. Increase `resample_threshold` to avoid frequent resampling
3. Use a coarser voltage grid for likelihood evaluation
4. Cache CIM forward model results for repeated voltage points

### If ActiveSensingPolicy is the Bottleneck
1. Reduce `n_mc_samples` to 2-3
2. Reduce the number of candidate plans considered
3. Skip active sensing for certain stages (e.g., bootstrap always uses line scan)
4. Use a heuristic policy (e.g., always take coarse 2D in survey stage)

### If BO is the Bottleneck
1. Limit BO history to last N points (e.g., 50 points)
2. Use a sparse GP approximation
3. Use simpler acquisition (e.g., probability of improvement vs UCB)
4. Skip BO optimization and use greedy search

### Parallelization (Future Work)
- Particle filter updates are embarrassingly parallel
- MC sampling can be parallelized across samples
- Multiple trials in benchmark can run in parallel

**Not implemented in Phase 2** - requires careful handling of NumPy random state and PyTorch device placement.

---

## Debugging Slow Runs

### Check if Agent is Stuck
```python
# Add verbose logging to ExecutiveAgent._step()
if self.state.step % 10 == 0:
    print(f"Step {self.state.step}: stage={self.state.stage}, "
          f"measurements={self.state.total_measurements}")
```

### Common Causes of Slowdown
1. **Backtracking loop:** State machine gets stuck retrying failed stages
2. **Low-quality measurements:** DQC repeatedly rejects measurements
3. **Poor BO convergence:** BO proposals don't improve, agent exhausts step budget
4. **HITL blocking:** HITL not in test mode, waiting for human input
5. **Excessive logging:** Governance logger writing large decision objects

### Quick Diagnosis
```bash
# Run with verbose output
python -u experiments/benchmark_phase2.py --fast 2>&1 | tee benchmark.log

# Check for repeated stage names (stuck in backtracking)
grep "stage=" benchmark.log | tail -50

# Check measurement count vs step count (efficiency)
grep "meas" benchmark.log | tail -20
```

---

## When to Profile

**Profile when:**
- CI timeout despite --fast mode
- Single trial takes >5 minutes
- Benchmark takes >1 hour for 10 trials
- Memory usage grows unbounded

**Don't profile when:**
- Runs complete successfully in expected time
- Small variance across trials (<2x)
- Just need to reduce accuracy for faster turnaround (adjust budgets directly)

---

## Summary

**The bottleneck is particle filter × MC sampling = 4 samples × 500 particles = 2000 forward model evaluations per measurement decision.**

**For CI:** Use --fast mode (10 trials, 20 steps, 4 MC, 500 particles) → 10-15 min
**For production:** Use full mode after training Phase 1 models → 2-4 hours
**For profiling:** Add --profile flag and check hot spots in cProfile output