Spaces:

lablab-ai-amd-developer-hackathon
/

simquantum-tuning-lab

Sleeping

App Files Files Community

simquantum-tuning-lab / PERFORMANCE.md

100enigma

SimQuantum — AMD Developer Hackathon

da98415 about 2 months ago

preview code

Raw

History Blame Contribute Delete

10.7 kB

	# Phase 2 Performance Guide

	## TL;DR — The Right Way to Optimize

	DO NOT blindly reduce computational budgets. Instead:

	1. Profile to find the real bottleneck
	2. Optimize the bottleneck (likely CIM vectorization)
	3. Validate any reductions with ablation studies
	4. Document empirical justification for paper

	Current status:
	- Baseline: 1000 particles, 8 MC samples
	- CI fast mode: 10 trials, 20 steps (infrastructure testing only)
	- Bottleneck: CIM forward model called 8000× per measurement (not vectorized)

	---

	## The Performance Problem

	Symptom: Single trial takes 30 minutes in CI

	Math:
	- 8 MC samples × 1000 particles × ~100 steps × 256 CIM calls/patch = 204,800,000 forward model evaluations
	- At ~0.01ms per call (Python loop overhead), that's 34 minutes

	Root cause: `CIMObservationModel.predicted_conductance_2d()` calls `device.current()` in a nested Python loop (256 times for 16×16 patch).

	---

	## The Right Fix: Vectorization

	### Current Code (Slow)
	```python
	# belief.py line 156-158
	for i, v2 in enumerate(v2_vals):
	for j, v1 in enumerate(v1_vals):
	patch[i, j] = self.device.current(v1, v2) # Python loop + function call overhead
	```

	### Optimized Code (10-50× Faster)
	```python
	# Compute all voltage points at once with numpy
	v1_grid, v2_grid = np.meshgrid(v1_vals, v2_vals)
	patch = self.device.current_2d(v1_grid, v2_grid) # Vectorized in numpy/C
	```

	Why this matters:
	- Numpy operations are vectorized in C (SIMD instructions)
	- Eliminates 256 Python function call overheads per patch
	- Better CPU cache utilization
	- Expected speedup: 10-50× for the forward model step

	### Implementation Roadmap

	1. Add `ConstantInteractionDevice.current_2d()` method (vectorized)
	2. Update `CIMObservationModel.predicted_conductance_2d()` to use it
	3. Benchmark: measure speedup on single trial
	4. Validate: run ablation to confirm no accuracy loss

	This is the proper optimization — improve efficiency without sacrificing scientific accuracy.

	---

	## Only After Vectorization: Consider Budget Reductions

	If vectorization isn't enough, run ablations:

	```bash
	python experiments/ablation_phase2.py --n-trials 20
	```

	This tests:
	- baseline: 1000 particles, 8 MC samples
	- reduced_particles: 500 particles, 8 MC samples
	- reduced_mc: 1000 particles, 4 MC samples
	- both_reduced: 500 particles, 4 MC samples

	Output:
	```
	Config Success% Reduction% Duration(s) Speedup
	----------------------------------------------------------------------
	baseline 90.0% 52.3% ± 3.1% 120.5 ± 12.3 -
	reduced_particles 88.5% 51.1% ± 3.4% 65.2 ± 8.1 1.85x
	reduced_mc 89.0% 50.8% ± 3.5% 68.1 ± 9.2 1.77x
	both_reduced 86.5% 48.9% ± 4.1% 35.4 ± 6.7 3.40x

	KEY FINDINGS:
	reduced_particles:
	✗ Performance differs from baseline (Δsuccess=1.5%, Δreduction=1.2%)
	→ Not recommended despite 1.85× speedup
	```

	Accept reductions only if:
	- Success rate Δ < 5%
	- Measurement reduction Δ < 5%
	- Documented in paper methods section

	---

	## Computational Bottlenecks

	Phase 2 introduces several compute-intensive operations:

	### 1. Particle Filter (BeliefUpdater)
	Cost: O(n_particles × n_measurements)

	Each measurement update:
	- Computes likelihood for each particle (CIM forward model)
	- Resamples when effective sample size drops below threshold
	- Syncs to `belief.charge_probs` for other components

	Default: 500 particles
	Trade-off:
	- 100 particles: Fast but coarse uncertainty estimates
	- 500 particles: Good balance (CI default)
	- 1000 particles: High accuracy for critical experiments
	- 2000+ particles: Overkill for most cases

	### 2. Active Sensing Monte Carlo (ActiveSensingPolicy)
	Cost: O(n_mc_samples × n_particles × n_candidate_plans)

	Each sensing decision:
	- Samples n_mc_samples hypothetical measurements
	- For each sample, updates a copy of the particle filter
	- Estimates information gain for each candidate plan
	- Typically evaluates 3-5 candidate plans per decision

	Default: 4 MC samples
	Trade-off:
	- 2 samples: Very rough IG estimates, fast
	- 4 samples: Reasonable estimates (CI default)
	- 8 samples: Good estimates (production)
	- 16+ samples: Diminishing returns

	Combined cost: 4 MC × 500 particles × 4 plans = 8,000 forward model evaluations per measurement selection

	### 3. Bayesian Optimization (MultiResBO)
	Cost: O(n_bo_history²) for GP fitting

	Each BO proposal:
	- Fits a Gaussian Process on growing BO history
	- Optimizes acquisition function (UCB) over voltage space

	Grows over time: More expensive as experiment progresses

	### 4. CIM Forward Model
	Cost: O(1) per call, but called repeatedly

	The CIM simulator computes conductance at each voltage point:
	- Chemical potential calculation (depends on charge state)
	- Fermi-Dirac statistics
	- Tunneling current formula

	Not a bottleneck for single calls, but becomes significant when multiplied by particle filter and MC sampling.

	---

	## Profiling

	### Quick Profile
	```bash
	python experiments/benchmark_phase2.py \
	--fast \
	--profile \
	--skip-missing-checkpoints
	```

	This will use Python's cProfile and print the top 20 slowest functions.

	### Detailed Profile
	```python
	import cProfile
	import pstats

	from qdot.agent.executive import ExecutiveAgent
	# ... setup state, adapter, etc.

	profiler = cProfile.Profile()
	profiler.enable()

	summary = agent.run()

	profiler.disable()
	stats = pstats.Stats(profiler)
	stats.sort_stats('cumulative')
	stats.print_stats(50)
	```

	### Expected Hot Spots
	Based on computational complexity:
	1. `_ParticleSet.update()` - particle filter updates
	2. `ActiveSensingPolicy._estimate_information_gain()` - MC sampling
	3. `CIMObservationModel.log_likelihood_2d()` - forward model
	4. `GaussianProcess.fit()` - GP kernel matrix inversion
	5. `MultiResBO.propose()` - acquisition optimization

	---

	## Tuning Guidelines

	### For CI (Fast Turnaround)
	```python
	# Already configured in benchmark_phase2.py --fast
	# 10 trials, 20 steps, 512 measurement budget
	# Runtime: 10-15 minutes
	```

	### For Development (Moderate Accuracy)
	```python
	agent = ExecutiveAgent(
	state=state,
	adapter=adapter,
	max_steps=50,
	measurement_budget=1024,
	)
	# BeliefUpdater uses 500 particles (default)
	# ActiveSensingPolicy uses 4 MC samples (default)
	# Runtime: ~20-30 minutes for 10 trials
	```

	### For Production (High Accuracy)
	```python
	# Create custom components with higher budgets
	belief_updater = BeliefUpdater(
	belief=state.belief,
	n_particles=1000, # 2x particles
	)
	sensing_policy = ActiveSensingPolicy(
	n_mc_samples=8, # 2x MC samples
	)

	# Inject into ExecutiveAgent (Phase 3 feature)
	# For Phase 2, edit the defaults in the source files
	```

	### For Benchmarking
	```bash
	# Full 100-trial evaluation with trained models
	python experiments/benchmark_phase2.py \
	--n-trials 100 \
	--budget 2048 \
	--max-steps 100
	# Runtime: 2-4 hours (depends on hardware)
	```

	---

	## Performance Expectations

	### Single Trial Timing (Intel i7, 8 cores)
	- Bootstrap: ~5 seconds (line scan)
	- Coarse Survey: ~30 seconds (coarse 2D scan)
	- Charge ID: ~20 seconds (local patch + classification)
	- Navigation: ~10 seconds per voltage move (BO + belief update)
	- Verification: ~30 seconds (repeated measurements)

	Total per trial: 1-3 minutes on average (depends on backtracking)

	### 100-Trial Benchmark
	- Fast mode (CI): 10 trials × 20 steps = 10-15 minutes
	- Full mode: 100 trials × 100 steps = 2-4 hours

	### Scaling Factors
	- Particles: Linear scaling (2x particles = 2x runtime)
	- MC samples: Linear scaling (2x samples = 2x runtime for sensing)
	- Step count: Linear scaling (2x steps = 2x runtime)
	- BO history: Quadratic scaling (2x history = 4x GP fit time)

	---

	## Optimization Strategies

	### If BeliefUpdater is the Bottleneck
	1. Reduce `n_particles` to 250-300
	2. Increase `resample_threshold` to avoid frequent resampling
	3. Use a coarser voltage grid for likelihood evaluation
	4. Cache CIM forward model results for repeated voltage points

	### If ActiveSensingPolicy is the Bottleneck
	1. Reduce `n_mc_samples` to 2-3
	2. Reduce the number of candidate plans considered
	3. Skip active sensing for certain stages (e.g., bootstrap always uses line scan)
	4. Use a heuristic policy (e.g., always take coarse 2D in survey stage)

	### If BO is the Bottleneck
	1. Limit BO history to last N points (e.g., 50 points)
	2. Use a sparse GP approximation
	3. Use simpler acquisition (e.g., probability of improvement vs UCB)
	4. Skip BO optimization and use greedy search

	### Parallelization (Future Work)
	- Particle filter updates are embarrassingly parallel
	- MC sampling can be parallelized across samples
	- Multiple trials in benchmark can run in parallel

	Not implemented in Phase 2 - requires careful handling of NumPy random state and PyTorch device placement.

	---

	## Debugging Slow Runs

	### Check if Agent is Stuck
	```python
	# Add verbose logging to ExecutiveAgent._step()
	if self.state.step % 10 == 0:
	print(f"Step {self.state.step}: stage={self.state.stage}, "
	f"measurements={self.state.total_measurements}")
	```

	### Common Causes of Slowdown
	1. Backtracking loop: State machine gets stuck retrying failed stages
	2. Low-quality measurements: DQC repeatedly rejects measurements
	3. Poor BO convergence: BO proposals don't improve, agent exhausts step budget
	4. HITL blocking: HITL not in test mode, waiting for human input
	5. Excessive logging: Governance logger writing large decision objects

	### Quick Diagnosis
	```bash
	# Run with verbose output
	python -u experiments/benchmark_phase2.py --fast 2>&1 \| tee benchmark.log

	# Check for repeated stage names (stuck in backtracking)
	grep "stage=" benchmark.log \| tail -50

	# Check measurement count vs step count (efficiency)
	grep "meas" benchmark.log \| tail -20
	```

	---

	## When to Profile

	Profile when:
	- CI timeout despite --fast mode
	- Single trial takes >5 minutes
	- Benchmark takes >1 hour for 10 trials
	- Memory usage grows unbounded

	Don't profile when:
	- Runs complete successfully in expected time
	- Small variance across trials (<2x)
	- Just need to reduce accuracy for faster turnaround (adjust budgets directly)

	---

	## Summary

	The bottleneck is particle filter × MC sampling = 4 samples × 500 particles = 2000 forward model evaluations per measurement decision.

	For CI: Use --fast mode (10 trials, 20 steps, 4 MC, 500 particles) → 10-15 min
	For production: Use full mode after training Phase 1 models → 2-4 hours
	For profiling: Add --profile flag and check hot spots in cProfile output