simquantum-tuning-lab / PERFORMANCE.md
100enigma's picture
SimQuantum — AMD Developer Hackathon
da98415
|
Raw
History Blame Contribute Delete
10.7 kB
# Phase 2 Performance Guide
## TL;DR — The Right Way to Optimize
**DO NOT** blindly reduce computational budgets. Instead:
1. **Profile** to find the real bottleneck
2. **Optimize the bottleneck** (likely CIM vectorization)
3. **Validate any reductions** with ablation studies
4. **Document** empirical justification for paper
**Current status:**
- Baseline: 1000 particles, 8 MC samples
- CI fast mode: 10 trials, 20 steps (infrastructure testing only)
- Bottleneck: CIM forward model called 8000× per measurement (not vectorized)
---
## The Performance Problem
**Symptom:** Single trial takes 30 minutes in CI
**Math:**
- 8 MC samples × 1000 particles × ~100 steps × 256 CIM calls/patch = **204,800,000 forward model evaluations**
- At ~0.01ms per call (Python loop overhead), that's 34 minutes
**Root cause:** `CIMObservationModel.predicted_conductance_2d()` calls `device.current()` in a nested Python loop (256 times for 16×16 patch).
---
## The Right Fix: Vectorization
### Current Code (Slow)
```python
# belief.py line 156-158
for i, v2 in enumerate(v2_vals):
for j, v1 in enumerate(v1_vals):
patch[i, j] = self.device.current(v1, v2) # Python loop + function call overhead
```
### Optimized Code (10-50× Faster)
```python
# Compute all voltage points at once with numpy
v1_grid, v2_grid = np.meshgrid(v1_vals, v2_vals)
patch = self.device.current_2d(v1_grid, v2_grid) # Vectorized in numpy/C
```
**Why this matters:**
- Numpy operations are vectorized in C (SIMD instructions)
- Eliminates 256 Python function call overheads per patch
- Better CPU cache utilization
- Expected speedup: 10-50× for the forward model step
### Implementation Roadmap
1. Add `ConstantInteractionDevice.current_2d()` method (vectorized)
2. Update `CIMObservationModel.predicted_conductance_2d()` to use it
3. Benchmark: measure speedup on single trial
4. Validate: run ablation to confirm no accuracy loss
**This is the proper optimization** — improve efficiency without sacrificing scientific accuracy.
---
## Only After Vectorization: Consider Budget Reductions
If vectorization isn't enough, run ablations:
```bash
python experiments/ablation_phase2.py --n-trials 20
```
This tests:
- **baseline:** 1000 particles, 8 MC samples
- **reduced_particles:** 500 particles, 8 MC samples
- **reduced_mc:** 1000 particles, 4 MC samples
- **both_reduced:** 500 particles, 4 MC samples
Output:
```
Config Success% Reduction% Duration(s) Speedup
----------------------------------------------------------------------
baseline 90.0% 52.3% ± 3.1% 120.5 ± 12.3 -
reduced_particles 88.5% 51.1% ± 3.4% 65.2 ± 8.1 1.85x
reduced_mc 89.0% 50.8% ± 3.5% 68.1 ± 9.2 1.77x
both_reduced 86.5% 48.9% ± 4.1% 35.4 ± 6.7 3.40x
KEY FINDINGS:
reduced_particles:
✗ Performance differs from baseline (Δsuccess=1.5%, Δreduction=1.2%)
→ Not recommended despite 1.85× speedup
```
**Accept reductions only if:**
- Success rate Δ < 5%
- Measurement reduction Δ < 5%
- Documented in paper methods section
---
## Computational Bottlenecks
Phase 2 introduces several compute-intensive operations:
### 1. Particle Filter (BeliefUpdater)
**Cost:** O(n_particles × n_measurements)
Each measurement update:
- Computes likelihood for each particle (CIM forward model)
- Resamples when effective sample size drops below threshold
- Syncs to `belief.charge_probs` for other components
**Default:** 500 particles
**Trade-off:**
- 100 particles: Fast but coarse uncertainty estimates
- 500 particles: Good balance (CI default)
- 1000 particles: High accuracy for critical experiments
- 2000+ particles: Overkill for most cases
### 2. Active Sensing Monte Carlo (ActiveSensingPolicy)
**Cost:** O(n_mc_samples × n_particles × n_candidate_plans)
Each sensing decision:
- Samples n_mc_samples hypothetical measurements
- For each sample, updates a copy of the particle filter
- Estimates information gain for each candidate plan
- Typically evaluates 3-5 candidate plans per decision
**Default:** 4 MC samples
**Trade-off:**
- 2 samples: Very rough IG estimates, fast
- 4 samples: Reasonable estimates (CI default)
- 8 samples: Good estimates (production)
- 16+ samples: Diminishing returns
**Combined cost:** 4 MC × 500 particles × 4 plans = 8,000 forward model evaluations per measurement selection
### 3. Bayesian Optimization (MultiResBO)
**Cost:** O(n_bo_history²) for GP fitting
Each BO proposal:
- Fits a Gaussian Process on growing BO history
- Optimizes acquisition function (UCB) over voltage space
**Grows over time:** More expensive as experiment progresses
### 4. CIM Forward Model
**Cost:** O(1) per call, but called repeatedly
The CIM simulator computes conductance at each voltage point:
- Chemical potential calculation (depends on charge state)
- Fermi-Dirac statistics
- Tunneling current formula
**Not a bottleneck** for single calls, but becomes significant when multiplied by particle filter and MC sampling.
---
## Profiling
### Quick Profile
```bash
python experiments/benchmark_phase2.py \
--fast \
--profile \
--skip-missing-checkpoints
```
This will use Python's cProfile and print the top 20 slowest functions.
### Detailed Profile
```python
import cProfile
import pstats
from qdot.agent.executive import ExecutiveAgent
# ... setup state, adapter, etc.
profiler = cProfile.Profile()
profiler.enable()
summary = agent.run()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(50)
```
### Expected Hot Spots
Based on computational complexity:
1. `_ParticleSet.update()` - particle filter updates
2. `ActiveSensingPolicy._estimate_information_gain()` - MC sampling
3. `CIMObservationModel.log_likelihood_2d()` - forward model
4. `GaussianProcess.fit()` - GP kernel matrix inversion
5. `MultiResBO.propose()` - acquisition optimization
---
## Tuning Guidelines
### For CI (Fast Turnaround)
```python
# Already configured in benchmark_phase2.py --fast
# 10 trials, 20 steps, 512 measurement budget
# Runtime: 10-15 minutes
```
### For Development (Moderate Accuracy)
```python
agent = ExecutiveAgent(
state=state,
adapter=adapter,
max_steps=50,
measurement_budget=1024,
)
# BeliefUpdater uses 500 particles (default)
# ActiveSensingPolicy uses 4 MC samples (default)
# Runtime: ~20-30 minutes for 10 trials
```
### For Production (High Accuracy)
```python
# Create custom components with higher budgets
belief_updater = BeliefUpdater(
belief=state.belief,
n_particles=1000, # 2x particles
)
sensing_policy = ActiveSensingPolicy(
n_mc_samples=8, # 2x MC samples
)
# Inject into ExecutiveAgent (Phase 3 feature)
# For Phase 2, edit the defaults in the source files
```
### For Benchmarking
```bash
# Full 100-trial evaluation with trained models
python experiments/benchmark_phase2.py \
--n-trials 100 \
--budget 2048 \
--max-steps 100
# Runtime: 2-4 hours (depends on hardware)
```
---
## Performance Expectations
### Single Trial Timing (Intel i7, 8 cores)
- Bootstrap: ~5 seconds (line scan)
- Coarse Survey: ~30 seconds (coarse 2D scan)
- Charge ID: ~20 seconds (local patch + classification)
- Navigation: ~10 seconds per voltage move (BO + belief update)
- Verification: ~30 seconds (repeated measurements)
**Total per trial:** 1-3 minutes on average (depends on backtracking)
### 100-Trial Benchmark
- **Fast mode (CI):** 10 trials × 20 steps = 10-15 minutes
- **Full mode:** 100 trials × 100 steps = 2-4 hours
### Scaling Factors
- **Particles:** Linear scaling (2x particles = 2x runtime)
- **MC samples:** Linear scaling (2x samples = 2x runtime for sensing)
- **Step count:** Linear scaling (2x steps = 2x runtime)
- **BO history:** Quadratic scaling (2x history = 4x GP fit time)
---
## Optimization Strategies
### If BeliefUpdater is the Bottleneck
1. Reduce `n_particles` to 250-300
2. Increase `resample_threshold` to avoid frequent resampling
3. Use a coarser voltage grid for likelihood evaluation
4. Cache CIM forward model results for repeated voltage points
### If ActiveSensingPolicy is the Bottleneck
1. Reduce `n_mc_samples` to 2-3
2. Reduce the number of candidate plans considered
3. Skip active sensing for certain stages (e.g., bootstrap always uses line scan)
4. Use a heuristic policy (e.g., always take coarse 2D in survey stage)
### If BO is the Bottleneck
1. Limit BO history to last N points (e.g., 50 points)
2. Use a sparse GP approximation
3. Use simpler acquisition (e.g., probability of improvement vs UCB)
4. Skip BO optimization and use greedy search
### Parallelization (Future Work)
- Particle filter updates are embarrassingly parallel
- MC sampling can be parallelized across samples
- Multiple trials in benchmark can run in parallel
**Not implemented in Phase 2** - requires careful handling of NumPy random state and PyTorch device placement.
---
## Debugging Slow Runs
### Check if Agent is Stuck
```python
# Add verbose logging to ExecutiveAgent._step()
if self.state.step % 10 == 0:
print(f"Step {self.state.step}: stage={self.state.stage}, "
f"measurements={self.state.total_measurements}")
```
### Common Causes of Slowdown
1. **Backtracking loop:** State machine gets stuck retrying failed stages
2. **Low-quality measurements:** DQC repeatedly rejects measurements
3. **Poor BO convergence:** BO proposals don't improve, agent exhausts step budget
4. **HITL blocking:** HITL not in test mode, waiting for human input
5. **Excessive logging:** Governance logger writing large decision objects
### Quick Diagnosis
```bash
# Run with verbose output
python -u experiments/benchmark_phase2.py --fast 2>&1 | tee benchmark.log
# Check for repeated stage names (stuck in backtracking)
grep "stage=" benchmark.log | tail -50
# Check measurement count vs step count (efficiency)
grep "meas" benchmark.log | tail -20
```
---
## When to Profile
**Profile when:**
- CI timeout despite --fast mode
- Single trial takes >5 minutes
- Benchmark takes >1 hour for 10 trials
- Memory usage grows unbounded
**Don't profile when:**
- Runs complete successfully in expected time
- Small variance across trials (<2x)
- Just need to reduce accuracy for faster turnaround (adjust budgets directly)
---
## Summary
**The bottleneck is particle filter × MC sampling = 4 samples × 500 particles = 2000 forward model evaluations per measurement decision.**
**For CI:** Use --fast mode (10 trials, 20 steps, 4 MC, 500 particles) → 10-15 min
**For production:** Use full mode after training Phase 1 models → 2-4 hours
**For profiling:** Add --profile flag and check hot spots in cProfile output